Hadoop0.20.2 Bloom filter应用示例

2014-06-04 11:55 451人阅读评论(0) 收藏举报

1. 简介

参见《Hadoop in Action》P102 以及《Hadoop实战（第2版）》（陆嘉恒）P69

2. 案例

网上大部分的说明仅仅是按照《Hadoop in Action》中的示例代码给出，这里是Hadoop0.20.2版本，在该版本中已经实现了BloomFilter。

案例文件如下：

customers.txt

1,Stephanie Leung,555-555-5555
2,Edward Kim,123-456-7890
3,Jose Madriz,281-330-8004
4,David Stork,408-555-0000

-----------------------------------------------------------------

orders.txt

3,A,12.95,02-Jun-2008
1,B,88.25,20-May-2008
2,C,32.00,30-Nov-2007
3,D,25.02,22-Jan-2009
5,E,34.59,05-Jan-2010
6,F,28.67,16-Jan-2008
7,G,49.82,24-Jan-2009

两个文件通过customer ID关联。

3. 代码

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.ArrayList;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.util.bloom.BloomFilter;
import org.apache.hadoop.util.bloom.Key;
import org.apache.hadoop.util.hash.Hash;
public class BloomMRMain {
public static class BloomMapper extends Mapper<Object, Text, Text, Text> {
BloomFilter bloomFilter = new BloomFilter(10000, 6, Hash.MURMUR_HASH);
protected void setup(Context context) throws IOException ,InterruptedException {
Configuration conf = context.getConfiguration();
String path = "hdfs://localhost:9000/user/hezhixue/input/customers.txt";
Path file = new Path(path);
FileSystem hdfs = FileSystem.get(conf);
FSDataInputStream dis = hdfs.open(file);
BufferedReader reader = new BufferedReader(new InputStreamReader(dis));
String temp;
while ((temp = reader.readLine()) != null) {
// System.out.println("bloom filter temp:" + temp);
String[] tokens = temp.split(",");
if (tokens.length > 0) {
bloomFilter.add(new Key(tokens[0].getBytes()));
}
}
}
protected void map(Object key, Text value, Context context) throws IOException ,InterruptedException {
//获得文件输入路径
String pathName = ((FileSplit) context.getInputSplit()).getPath().toString();
if (pathName.contains("customers")) {
String data = value.toString();
String[] tokens = data.split(",");
if (tokens.length == 3) {
String outKey = tokens[0];
String outVal = "0" + ":" + tokens[1] + "," + tokens[2];
context.write(new Text(outKey), new Text(outVal));
}
} else if (pathName.contains("orders")) {
String data = value.toString();
String[] tokens = data.split(",");
if (tokens.length == 4) {
String outKey = tokens[0];
System.out.println("in map and outKey:" + outKey);
if (bloomFilter.membershipTest(new Key(outKey.getBytes()))) {
String outVal = "1" + ":" + tokens[1] + "," + tokens[2]+ "," + tokens[3];
context.write(new Text(outKey), new Text(outVal));
}
}
}
}
}
public static class BloomReducer extends Reducer<Text, Text, Text, Text> {
ArrayList<Text> leftTable = new ArrayList<Text>();
ArrayList<Text> rightTable = new ArrayList<Text>();
protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException ,InterruptedException {
leftTable.clear();
rightTable.clear();
for (Text val : values) {
String outVal = val.toString();
System.out.println("key: " + key.toString() + " : " + outVal);
int index = outVal.indexOf(":");
String flag = outVal.substring(0, index);
if ("0".equals(flag)) {
leftTable.add(new Text(outVal.substring(index+1)));
} else if ("1".equals(flag)) {
rightTable.add(new Text(outVal.substring(index + 1)));
}
}
if (leftTable.size() > 0 && rightTable.size() > 0) {
for(Text left : leftTable) {
for (Text right : rightTable) {
context.write(key, new Text(left.toString() + "," + right.toString()));
}
}
}
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: BloomMRMain <in> <out>");
System.exit(2);
}
Job job = new Job(conf, "BloomMRMain");
job.setJarByClass(BloomMRMain.class);
job.setMapperClass(BloomMapper.class);
job.setReducerClass(BloomReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

Hadoop Bloom filter应用示例的更多相关文章

Hadoop Bloom Filter 使用
1.Bloom Filter 默认的 BloomFilter filter =new BloomFilter(10,2,1); // 过滤器长度为10 ,用2哈希函数,MURMUR_HASH (1) ...
Bloom Filter 原理与应用
介绍 Bloom Filter是一种简单的节省空间的随机化的数据结构,支持用户查询的集合.一般我们使用STL的std::set, stdext::hash_set,std::set是用红黑树实现的,s ...
Hadoop0.20.2 Bloom filter应用演示样例
1. 简单介绍參见<Hadoop in Action>P102 以及 <Hadoop实战(第2版)>(陆嘉恒)P69 2. 案例网上大部分的说明不过依照<Hadoop ...
Skip List & Bloom Filter
Skip List | Set 1 (Introduction) Can we search in a sorted linked list in better than O(n) time?Th ...
Bloom Filter：海量数据的HashSet
Bloom Filter一般用于数据的去重计算,近似于HashSet的功能:但是不同于Bitmap(用于精确计算),其为一种估算的数据结构,存在误判(false positive)的情况. 1. 基本 ...
探索C#之布隆过滤器(Bloom filter)
阅读目录: 背景介绍算法原理误判率 BF改进总结背景介绍 Bloom filter(后面简称BF)是Bloom在1970年提出的二进制向量数据结构.通俗来说就是在大数据集合下高效判断某个成员是 ...
Bloom Filter 布隆过滤器
Bloom Filter 是由伯顿.布隆(Burton Bloom)在1970年提出的一种多hash函数映射的快速查找算法.它实际上是一个很长的二进制向量和一些列随机映射函数.应用在数据量很大的情况下 ...
Bloom Filter学习
参考文献: Bloom Filters - the math http://pages.cs.wisc.edu/~cao/papers/summary-cache/node8.html B ...
【转】探索C#之布隆过滤器(Bloom filter)
原文:蘑菇先生,http://www.cnblogs.com/mushroom/p/4556801.html 背景介绍 Bloom filter(后面简称BF)是Bloom在1970年提出的二进制向量 ...

随机推荐

Java——各种日期的获取（来自别人分享）
import java.text.DateFormat; import java.text.ParsePosition; import java.text.SimpleDateFormat; i ...
liunx之：wps for liunx的安装经验
首先是下载正确的安装包 WPS For Linux : 社区下载:http://community.wps.cn/download/ 社区最新包下载:http://wps-community.org/ ...
unity3d中切换武器
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 3 ...
开启自启动oracle和实例
第一步在/etc/rc.d/rc.local中添加下列信息#启动oraclesu - oracle -c '/u01/app/oracle/product/11.2.0/db_1/bin/dbstar ...
jquery-easyui中表格的行编辑功能
datagrid现在具有行编辑能力了,使用时只须在columns中为需要编辑的列添加一个editor属性,编辑保存时同时具有数据校验能力. 看一个例子效果图: 代码如下: $('#tt').datag ...
java之yield(),sleep(),wait()区别详解-备忘笔记
备注:转载地址,http://dylanxu.iteye.com/blog/1322066,谢谢作者 1.sleep() 使当前线程(即调用该方法的线程)暂停执行一段时间,让其他线程有机会继续执行,但 ...
unity, 搜索组件
Hierarchy的搜索栏中既可以搜节点名,也可以搜组件名.
【转】div弹出窗口的制作
来自:http://www.21shipin.com/html/95347.shtml 可以覆盖父窗口,可以移动的,做了关闭按钮 <html> <head> <scrip ...
输入n行整数，每行的个数不确定，整数之间用逗号分隔
/*===================================== 输入n行整数,每行的个数不确定. 每行内部两个数之间用逗号隔开. 例如输入数据如下: 6 1,3,5,23,6,8,14 ...
HOCON 了解
Spec This is an informal spec, but hopefully it's clear. Goals / Background The primary goal is: kee ...

Hadoop Bloom filter应用示例

Hadoop0.20.2 Bloom filter应用示例

Hadoop Bloom filter应用示例的更多相关文章

随机推荐

热门专题