Hadoop0.20.2 Bloom filter应用示例

2014-06-04 11:55 451人阅读评论(0) 收藏举报

1. 简介

参见《Hadoop in Action》P102 以及《Hadoop实战（第2版）》（陆嘉恒）P69

2. 案例

网上大部分的说明仅仅是按照《Hadoop in Action》中的示例代码给出，这里是Hadoop0.20.2版本，在该版本中已经实现了BloomFilter。

案例文件如下：

customers.txt

1,Stephanie Leung,555-555-5555
2,Edward Kim,123-456-7890
3,Jose Madriz,281-330-8004
4,David Stork,408-555-0000

-----------------------------------------------------------------

orders.txt

3,A,12.95,02-Jun-2008
1,B,88.25,20-May-2008
2,C,32.00,30-Nov-2007
3,D,25.02,22-Jan-2009
5,E,34.59,05-Jan-2010
6,F,28.67,16-Jan-2008
7,G,49.82,24-Jan-2009

两个文件通过customer ID关联。

3. 代码

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.ArrayList;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.util.bloom.BloomFilter;
import org.apache.hadoop.util.bloom.Key;
import org.apache.hadoop.util.hash.Hash;
public class BloomMRMain {
public static class BloomMapper extends Mapper<Object, Text, Text, Text> {
BloomFilter bloomFilter = new BloomFilter(10000, 6, Hash.MURMUR_HASH);
protected void setup(Context context) throws IOException ,InterruptedException {
Configuration conf = context.getConfiguration();
String path = "hdfs://localhost:9000/user/hezhixue/input/customers.txt";
Path file = new Path(path);
FileSystem hdfs = FileSystem.get(conf);
FSDataInputStream dis = hdfs.open(file);
BufferedReader reader = new BufferedReader(new InputStreamReader(dis));
String temp;
while ((temp = reader.readLine()) != null) {
// System.out.println("bloom filter temp:" + temp);
String[] tokens = temp.split(",");
if (tokens.length > 0) {
bloomFilter.add(new Key(tokens[0].getBytes()));
}
}
}
protected void map(Object key, Text value, Context context) throws IOException ,InterruptedException {
//获得文件输入路径
String pathName = ((FileSplit) context.getInputSplit()).getPath().toString();
if (pathName.contains("customers")) {
String data = value.toString();
String[] tokens = data.split(",");
if (tokens.length == 3) {
String outKey = tokens[0];
String outVal = "0" + ":" + tokens[1] + "," + tokens[2];
context.write(new Text(outKey), new Text(outVal));
}
} else if (pathName.contains("orders")) {
String data = value.toString();
String[] tokens = data.split(",");
if (tokens.length == 4) {
String outKey = tokens[0];
System.out.println("in map and outKey:" + outKey);
if (bloomFilter.membershipTest(new Key(outKey.getBytes()))) {
String outVal = "1" + ":" + tokens[1] + "," + tokens[2]+ "," + tokens[3];
context.write(new Text(outKey), new Text(outVal));
}
}
}
}
}
public static class BloomReducer extends Reducer<Text, Text, Text, Text> {
ArrayList<Text> leftTable = new ArrayList<Text>();
ArrayList<Text> rightTable = new ArrayList<Text>();
protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException ,InterruptedException {
leftTable.clear();
rightTable.clear();
for (Text val : values) {
String outVal = val.toString();
System.out.println("key: " + key.toString() + " : " + outVal);
int index = outVal.indexOf(":");
String flag = outVal.substring(0, index);
if ("0".equals(flag)) {
leftTable.add(new Text(outVal.substring(index+1)));
} else if ("1".equals(flag)) {
rightTable.add(new Text(outVal.substring(index + 1)));
}
}
if (leftTable.size() > 0 && rightTable.size() > 0) {
for(Text left : leftTable) {
for (Text right : rightTable) {
context.write(key, new Text(left.toString() + "," + right.toString()));
}
}
}
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: BloomMRMain <in> <out>");
System.exit(2);
}
Job job = new Job(conf, "BloomMRMain");
job.setJarByClass(BloomMRMain.class);
job.setMapperClass(BloomMapper.class);
job.setReducerClass(BloomReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

Hadoop Bloom filter应用示例的更多相关文章

Hadoop Bloom Filter 使用
1.Bloom Filter 默认的 BloomFilter filter =new BloomFilter(10,2,1); // 过滤器长度为10 ,用2哈希函数,MURMUR_HASH (1) ...
Bloom Filter 原理与应用
介绍 Bloom Filter是一种简单的节省空间的随机化的数据结构,支持用户查询的集合.一般我们使用STL的std::set, stdext::hash_set,std::set是用红黑树实现的,s ...
Hadoop0.20.2 Bloom filter应用演示样例
1. 简单介绍參见<Hadoop in Action>P102 以及 <Hadoop实战(第2版)>(陆嘉恒)P69 2. 案例网上大部分的说明不过依照<Hadoop ...
Skip List & Bloom Filter
Skip List | Set 1 (Introduction) Can we search in a sorted linked list in better than O(n) time?Th ...
Bloom Filter：海量数据的HashSet
Bloom Filter一般用于数据的去重计算,近似于HashSet的功能:但是不同于Bitmap(用于精确计算),其为一种估算的数据结构,存在误判(false positive)的情况. 1. 基本 ...
探索C#之布隆过滤器(Bloom filter)
阅读目录: 背景介绍算法原理误判率 BF改进总结背景介绍 Bloom filter(后面简称BF)是Bloom在1970年提出的二进制向量数据结构.通俗来说就是在大数据集合下高效判断某个成员是 ...
Bloom Filter 布隆过滤器
Bloom Filter 是由伯顿.布隆(Burton Bloom)在1970年提出的一种多hash函数映射的快速查找算法.它实际上是一个很长的二进制向量和一些列随机映射函数.应用在数据量很大的情况下 ...
Bloom Filter学习
参考文献: Bloom Filters - the math http://pages.cs.wisc.edu/~cao/papers/summary-cache/node8.html B ...
【转】探索C#之布隆过滤器(Bloom filter)
原文:蘑菇先生,http://www.cnblogs.com/mushroom/p/4556801.html 背景介绍 Bloom filter(后面简称BF)是Bloom在1970年提出的二进制向量 ...

随机推荐

What is Proguard?
When packaging an apk, all classes of all libraries used by the program will be included, this makes ...
java编程之：Unsafe类
Unsafe类在jdk 源码的多个类中用到,这个类的提供了一些绕开JVM的更底层功能,基于它的实现可以提高效率.但是,它是一把双刃剑:正如它的名字所预示的那样,它是 Unsafe的,它所分配的内存需要 ...
论文笔记之：Localizing by Describing: Attribute-Guided Attention Localization for Fine-Grained Recognition
Localizing by Describing: Attribute-Guided Attention Localization for Fine-Grained Recognition Baidu ...
Articles Every Programmer Must Read
http://javarevisited.blogspot.sg/2014/05/10-articles-every-programmer-must-read.html Being a Java pr ...
OC IOS屏幕分辨率
CGRect screenRect=[UIScreenmainScreen].bounds; CGSize screenSize=screenRect.size; //屏幕分辨率 screenSize ...
oracle更新统计信息以及解锁统计信息
begin dbms_stats.unlock_table_stats(ownname => 'DM_MPAY', tabname => 'PLAT_INFO');end; begin d ...
PHP实现动态规划背包问题
有一堆货物,有各种大小和价值不等的多个物品,而你只有固定大小的背包,拿走哪些能保证你的背包带走的价值最多动态规划就是可以记录前一次递归过程中计算出的最大值,在之后的递归期间使用,以免重复计算. &l ...
设置EDIUS字幕时有哪些要注意的
我们在用EDIUS添加字幕,有时候可能会遇到以下麻烦.例如有的字体在EDIUS中找不到,诗歌的排版问题还有怎么给字幕加光效等等.今天小编主要来给大家解决这三个问题,让你们知道EDIUS字幕设置时应该注 ...
axure变量的使用
1.什么是变量? 变量在数学中的定义是可以改变的数,在计算机编程中,它是在内存中开辟的一块空间用于存储临时数据.Axure中的变量和计算机编程中一样,它是一个用于存储临时数据的容器. 2.变量的创建 ...
运行Appium碰到的坑们
运行Appium的时候,碰到的那些坑 1. java命令会出现error:could not open ...jvm.cfg 出现这种情况大多是因为电脑上之前安装过JDK,卸载重装之后,运行java命 ...

Hadoop Bloom filter应用示例

Hadoop0.20.2 Bloom filter应用示例

Hadoop Bloom filter应用示例的更多相关文章

随机推荐

热门专题