Hadoop的Map侧join
写了关于Hadoop下载地址的Map侧join
和Reduce的join,今天我们就来在看另外一种比较中立的Join。
SemiJoin,一般称为半链接,其原理是在Map侧过滤掉了一些不需要join的数据,从而大大减少了reduce的shffule时间,因为我们知道,如果仅仅使用Reduce侧连接,那么如果一份数据中,存在大量的无效数据,而这些数据,在join中,并不需要,但是因为没有做过预处理,所以这些数据,直到真正的执行reduce函数时,才被定义为无效数据,而这时候,前面已经执行过shuffle和merge和sort,所以这部分无效的数据,就浪费了大量的网络IO和磁盘IO,所以在整体来讲,这是一种降低性能的表现,如果存在的无效数据越多,那么这种趋势,就越明显。
之所以会出现半连接,这其实也是reduce侧连接的一个变种,只不过我们在Map侧,过滤掉了一些无效的数据,所以减少了reduce过程的shuffle时间,所以能获取一个性能的提升。
具体的原理也是利用DistributedCache将小表的的分发到各个节点上,在Map过程的setup函数里,读取缓存里面的文件,只将小表的链接键存储在hashset里,在map函数执行时,对每一条数据,进行判断,如果这条数据的链接键为空或者在hashset里面不存在,那么则认为这条数据,是无效的数据,所以这条数据,并不会被partition分区后写入磁盘,参与reduce阶段的shuffle和sort下载地址 ,所以在一定程序上,提升了join性能。需要注意的是如果
小表的key依然非常巨大,可能会导致我们的程序出现OOM的情况,那么这时候我们就需要考虑其他的链接方式了。
测试数据如下:
模拟小表数据:
1,三劫散仙,13575468248
2,凤舞九天,18965235874
3,忙忙碌碌,15986854789
4,少林寺方丈,15698745862
模拟大表数据:
3,A,99,2013-03-05
1,B,89,2013-02-05
2,C,69,2013-03-09
3,D,56,2013-06-07
5,E,100,2013-09-09
6,H,200,2014-01-10
代码如下:
- package com.semijoin;
- import java.io.BufferedReader;
- import java.io.DataInput;
- import java.io.DataOutput;
- import java.io.FileReader;
- import java.io.IOException;
- import java.net.URI;
- import java.util.ArrayList;
- import java.util.HashSet;
- import java.util.List;
- import org.apache.hadoop.conf.Configuration;
- import org.apache.hadoop.filecache.DistributedCache;
- import org.apache.hadoop.fs.FileSystem;
- import org.apache.hadoop.fs.Path;
- import org.apache.hadoop.io.LongWritable;
- import org.apache.hadoop.io.Text;
- import org.apache.hadoop.io.WritableComparable;
- import org.apache.hadoop.mapred.JobConf;
- import org.apache.hadoop.mapreduce.Job;
- import org.apache.hadoop.mapreduce.Mapper;
- import org.apache.hadoop.mapreduce.Reducer;
- import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
- import org.apache.hadoop.mapreduce.lib.input.FileSplit;
- import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
- import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
- import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
- /***
- *
- * Hadoop1.2的版本
- *
- * hadoop的半链接
- *
- * SemiJoin实现
- *
- * @author qindongliang
- *
- * 大数据交流群:376932160
- * 搜索技术交流群:324714439
- *
- *
- *
- * **/
- public class Semjoin {
- /**
- *
- *
- * 自定义一个输出实体
- *
- * **/
- private static class CombineEntity implements WritableComparable<CombineEntity>{
- private Text joinKey;//连接key
- private Text flag;//文件来源标志
- private Text secondPart;//除了键外的其他部分的数据
- public CombineEntity() {
- // TODO Auto-generated constructor stub
- this.joinKey=new Text();
- this.flag=new Text();
- this.secondPart=new Text();
- }
- public Text getJoinKey() {
- return joinKey;
- }
- public void setJoinKey(Text joinKey) {
- this.joinKey = joinKey;
- }
- public Text getFlag() {
- return flag;
- }
- public void setFlag(Text flag) {
- this.flag = flag;
- }
- public Text getSecondPart() {
- return secondPart;
- }
- public void setSecondPart(Text secondPart) {
- this.secondPart = secondPart;
- }
- @Override
- public void readFields(DataInput in) throws IOException {
- this.joinKey.readFields(in);
- this.flag.readFields(in);
- this.secondPart.readFields(in);
- }
- @Override
- public void write(DataOutput out) throws IOException {
- this.joinKey.write(out);
- this.flag.write(out);
- this.secondPart.write(out);
- }
- @Override
- public int compareTo(CombineEntity o) {
- // TODO Auto-generated method stub
- return this.joinKey.compareTo(o.joinKey);
- }
- }
- private static class JMapper extends Mapper<LongWritable, Text, Text, CombineEntity>{
- private CombineEntity combine=new CombineEntity();
- private Text flag=new Text();
- private Text joinKey=new Text();
- private Text secondPart=new Text();
- /**
- * 存储小表的key
- *
- *
- * */
- private HashSet<String> joinKeySet=new HashSet<String>();
- @Override
- protected void setup(Context context)throws IOException, InterruptedException {
- //读取文件流
- BufferedReader br=null;
- String temp;
- // 获取DistributedCached里面 的共享文件
- Path path[]=DistributedCache.getLocalCacheFiles(context.getConfiguration());
- for(Path p:path){
- if(p.getName().endsWith("a.txt")){
- br=new BufferedReader(new FileReader(p.toString()));
- //List<String> list=Files.readAllLines(Paths.get(p.getName()), Charset.forName("UTF-8"));
- while((temp=br.readLine())!=null){
- String ss[]=temp.split(",");
- //map.put(ss[0], ss[1]+"\t"+ss[2]);//放入hash表中
- joinKeySet.add(ss[0]);//加入小表的key
- }
- }
- }
- }
- @Override
- protected void map(LongWritable key, Text value,Context context)
- throws IOException, InterruptedException {
- //获得文件输入路径
- String pathName = ((FileSplit) context.getInputSplit()).getPath().toString();
- if(pathName.endsWith("a.txt")){
- String valueItems[]=value.toString().split(",");
- /**
- * 在这里过滤必须要的连接字符
- *
- * */
- if(joinKeySet.contains(valueItems[0])){
- //设置标志位
- flag.set("0");
- //设置链接键
- joinKey.set(valueItems[0]);
- //设置第二部分
- secondPart.set(valueItems[1]+"\t"+valueItems[2]);
- //封装实体
- combine.setFlag(flag);//标志位
- combine.setJoinKey(joinKey);//链接键
- combine.setSecondPart(secondPart);//其他部分
- //写出
- context.write(combine.getJoinKey(), combine);
- }else{
- System.out.println("a.txt里");
- System.out.println("在小表中无此记录,执行过滤掉!");
- for(String v:valueItems){
- System.out.print(v+" ");
- }
- return ;
- }
- }else if(pathName.endsWith("b.txt")){
- String valueItems[]=value.toString().split(",");
- /**
- *
- * 判断是否在集合中
- *
- * */
- if(joinKeySet.contains(valueItems[0])){
- //设置标志位
- flag.set("1");
- //设置链接键
- joinKey.set(valueItems[0]);
- //设置第二部分注意不同的文件的列数不一样
- secondPart.set(valueItems[1]+"\t"+valueItems[2]+"\t"+valueItems[3]);
- //封装实体
- combine.setFlag(flag);//标志位
- combine.setJoinKey(joinKey);//链接键
- combine.setSecondPart(secondPart);//其他部分
- //写出
- context.write(combine.getJoinKey(), combine);
- }else{
- //执行过滤 ......
- System.out.println("b.txt里");
- System.out.println("在小表中无此记录,执行过滤掉!");
- for(String v:valueItems){
- System.out.print(v+" ");
- }
- return ;
- }
- }
- }
- }
- private static class JReduce extends Reducer<Text, CombineEntity, Text, Text>{
- //存储一个分组中左表信息
- private List<Text> leftTable=new ArrayList<Text>();
- //存储一个分组中右表信息
- private List<Text> rightTable=new ArrayList<Text>();
- private Text secondPart=null;
- private Text output=new Text();
- //一个分组调用一次
- @Override
- protected void reduce(Text key, Iterable<CombineEntity> values,Context context)
- throws IOException, InterruptedException {
- leftTable.clear();//清空分组数据
- rightTable.clear();//清空分组数据
- /**
- * 将不同文件的数据,分别放在不同的集合
- * 中,注意数据量过大时,会出现
- * OOM的异常
- *
- * **/
- for(CombineEntity ce:values){
- this.secondPart=new Text(ce.getSecondPart().toString());
- //左表
- if(ce.getFlag().toString().trim().equals("0")){
- leftTable.add(secondPart);
- }else if(ce.getFlag().toString().trim().equals("1")){
- rightTable.add(secondPart);
- }
- }
- //=====================
- for(Text left:leftTable){
- for(Text right:rightTable){
- output.set(left+"\t"+right);//连接左右数据
- context.write(key, output);//输出
- }
- }
- }
- }
- public static void main(String[] args)throws Exception {
- //Job job=new Job(conf,"myjoin");
- JobConf conf=new JobConf(Semjoin.class);
- conf.set("mapred.job.tracker","192.168.75.130:9001");
- conf.setJar("tt.jar");
- //小表共享
- String bpath="hdfs://192.168.75.130:9000/root/dist/a.txt";
- //添加到共享cache里
- DistributedCache.addCacheFile(new URI(bpath), conf);
- Job job=new Job(conf, "aaaaa");
- job.setJarByClass(Semjoin.class);
- System.out.println("模式: "+conf.get("mapred.job.tracker"));;
- //设置Map和Reduce自定义类
- job.setMapperClass(JMapper.class);
- job.setReducerClass(JReduce.class);
- //设置Map端输出
- job.setMapOutputKeyClass(Text.class);
- job.setMapOutputValueClass(CombineEntity.class);
- //设置Reduce端的输出
- job.setOutputKeyClass(Text.class);
- job.setOutputValueClass(Text.class);
- job.setInputFormatClass(TextInputFormat.class);
- job.setOutputFormatClass(TextOutputFormat.class);
- FileSystem fs=FileSystem.get(conf);
- Path op=new Path("hdfs://192.168.75.130:9000/root/outputjoindbnew4");
- if(fs.exists(op)){
- fs.delete(op, true);
- System.out.println("存在此输出路径,已删除!!!");
- }
- FileInputFormat.setInputPaths(job, new Path("hdfs://192.168.75.130:9000/root/inputjoindb"));
- FileOutputFormat.setOutputPath(job, op);
- System.exit(job.waitForCompletion(true)?0:1);
- }
- }
- package com.semijoin;
- import java.io.BufferedReader;
- import java.io.DataInput;
- import java.io.DataOutput;
- import java.io.FileReader;
- import java.io.IOException;
- import java.net.URI;
- import java.util.ArrayList;
- import java.util.HashSet;
- import java.util.List;
- import org.apache.hadoop.conf.Configuration;
- import org.apache.hadoop.filecache.DistributedCache;
- import org.apache.hadoop.fs.FileSystem;
- import org.apache.hadoop.fs.Path;
- import org.apache.hadoop.io.LongWritable;
- import org.apache.hadoop.io.Text;
- import org.apache.hadoop.io.WritableComparable;
- import org.apache.hadoop.mapred.JobConf;
- import org.apache.hadoop.mapreduce.Job;
- import org.apache.hadoop.mapreduce.Mapper;
- import org.apache.hadoop.mapreduce.Reducer;
- import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
- import org.apache.hadoop.mapreduce.lib.input.FileSplit;
- import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
- import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
- import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
- /***
- *
- * Hadoop1.2的版本
- *
- * hadoop的半链接
- *
- * SemiJoin实现
- *
- * @author qindongliang
- *
- * 大数据交流群:376932160
- * 搜索技术交流群:324714439
- *
- *
- *
- * **/
- public class Semjoin {
- /**
- *
- *
- * 自定义一个输出实体
- *
- * **/
- private static class CombineEntity implements WritableComparable<CombineEntity>{
- private Text joinKey;//连接key
- private Text flag;//文件来源标志
- private Text secondPart;//除了键外的其他部分的数据
- public CombineEntity() {
- // TODO Auto-generated constructor stub
- this.joinKey=new Text();
- this.flag=new Text();
- this.secondPart=new Text();
- }
- public Text getJoinKey() {
- return joinKey;
- }
- public void setJoinKey(Text joinKey) {
- this.joinKey = joinKey;
- }
- public Text getFlag() {
- return flag;
- }
- public void setFlag(Text flag) {
- this.flag = flag;
- }
- public Text getSecondPart() {
- return secondPart;
- }
- public void setSecondPart(Text secondPart) {
- this.secondPart = secondPart;
- }
- @Override
- public void readFields(DataInput in) throws IOException {
- this.joinKey.readFields(in);
- this.flag.readFields(in);
- this.secondPart.readFields(in);
- }
- @Override
- public void write(DataOutput out) throws IOException {
- this.joinKey.write(out);
- this.flag.write(out);
- this.secondPart.write(out);
- }
- @Override
- public int compareTo(CombineEntity o) {
- // TODO Auto-generated method stub
- return this.joinKey.compareTo(o.joinKey);
- }
- }
- private static class JMapper extends Mapper<LongWritable, Text, Text, CombineEntity>{
- private CombineEntity combine=new CombineEntity();
- private Text flag=new Text();
- private Text joinKey=new Text();
- private Text secondPart=new Text();
- /**
- * 存储小表的key
- *
- *
- * */
- private HashSet<String> joinKeySet=new HashSet<String>();
- @Override
- protected void setup(Context context)throws IOException, InterruptedException {
- //读取文件流
- BufferedReader br=null;
- String temp;
- // 获取DistributedCached里面 的共享文件
- Path path[]=DistributedCache.getLocalCacheFiles(context.getConfiguration());
- for(Path p:path){
- if(p.getName().endsWith("a.txt")){
- br=new BufferedReader(new FileReader(p.toString()));
- //List<String> list=Files.readAllLines(Paths.get(p.getName()), Charset.forName("UTF-8"));
- while((temp=br.readLine())!=null){
- String ss[]=temp.split(",");
- //map.put(ss[0], ss[1]+"\t"+ss[2]);//放入hash表中
- joinKeySet.add(ss[0]);//加入小表的key
- }
- }
- }
- }
- @Override
- protected void map(LongWritable key, Text value,Context context)
- throws IOException, InterruptedException {
- //获得文件输入路径
- String pathName = ((FileSplit) context.getInputSplit()).getPath().toString();
- if(pathName.endsWith("a.txt")){
- String valueItems[]=value.toString().split(",");
- /**
- * 在这里过滤必须要的连接字符
- *
- * */
- if(joinKeySet.contains(valueItems[0])){
- //设置标志位
- flag.set("0");
- //设置链接键
- joinKey.set(valueItems[0]);
- //设置第二部分
- secondPart.set(valueItems[1]+"\t"+valueItems[2]);
- //封装实体
- combine.setFlag(flag);//标志位
- combine.setJoinKey(joinKey);//链接键
- combine.setSecondPart(secondPart);//其他部分
- //写出
- context.write(combine.getJoinKey(), combine);
- }else{
- System.out.println("a.txt里");
- System.out.println("在小表中无此记录,执行过滤掉!");
- for(String v:valueItems){
- System.out.print(v+" ");
- }
- return ;
- }
- }else if(pathName.endsWith("b.txt")){
- String valueItems[]=value.toString().split(",");
- /**
- *
- * 判断是否在集合中
- *
- * */
- if(joinKeySet.contains(valueItems[0])){
- //设置标志位
- flag.set("1");
- //设置链接键
- joinKey.set(valueItems[0]);
- //设置第二部分注意不同的文件的列数不一样
- secondPart.set(valueItems[1]+"\t"+valueItems[2]+"\t"+valueItems[3]);
- //封装实体
- combine.setFlag(flag);//标志位
- combine.setJoinKey(joinKey);//链接键
- combine.setSecondPart(secondPart);//其他部分
- //写出
- context.write(combine.getJoinKey(), combine);
- }else{
- //执行过滤 ......
- System.out.println("b.txt里");
- System.out.println("在小表中无此记录,执行过滤掉!");
- for(String v:valueItems){
- System.out.print(v+" ");
- }
- return ;
- }
- }
- }
- }
- private static class JReduce extends Reducer<Text, CombineEntity, Text, Text>{
- //存储一个分组中左表信息
- private List<Text> leftTable=new ArrayList<Text>();
- //存储一个分组中右表信息
- private List<Text> rightTable=new ArrayList<Text>();
- private Text secondPart=null;
- private Text output=new Text();
- //一个分组调用一次
- @Override
- protected void reduce(Text key, Iterable<CombineEntity> values,Context context)
- throws IOException, InterruptedException {
- leftTable.clear();//清空分组数据
- rightTable.clear();//清空分组数据
- /**
- * 将不同文件的数据,分别放在不同的集合
- * 中,注意数据量过大时,会出现
- * OOM的异常
- *
- * **/
- for(CombineEntity ce:values){
- this.secondPart=new Text(ce.getSecondPart().toString());
- //左表
- if(ce.getFlag().toString().trim().equals("0")){
- leftTable.add(secondPart);
- }else if(ce.getFlag().toString().trim().equals("1")){
- rightTable.add(secondPart);
- }
- }
- //=====================
- for(Text left:leftTable){
- for(Text right:rightTable){
- output.set(left+"\t"+right);//连接左右数据
- context.write(key, output);//输出
- }
- }
- }
- }
- public static void main(String[] args)throws Exception {
- //Job job=new Job(conf,"myjoin");
- JobConf conf=new JobConf(Semjoin.class);
- conf.set("mapred.job.tracker","192.168.75.130:9001");
- conf.setJar("tt.jar");
- //小表共享
- String bpath="hdfs://192.168.75.130:9000/root/dist/a.txt";
- //添加到共享cache里
- DistributedCache.addCacheFile(new URI(bpath), conf);
- Job job=new Job(conf, "aaaaa");
- job.setJarByClass(Semjoin.class);
- System.out.println("模式: "+conf.get("mapred.job.tracker"));;
- //设置Map和Reduce自定义类
- job.setMapperClass(JMapper.class);
- job.setReducerClass(JReduce.class);
- //设置Map端输出
- job.setMapOutputKeyClass(Text.class);
- job.setMapOutputValueClass(CombineEntity.class);
- //设置Reduce端的输出
- job.setOutputKeyClass(Text.class);
- job.setOutputValueClass(Text.class);
- job.setInputFormatClass(TextInputFormat.class);
- job.setOutputFormatClass(TextOutputFormat.class);
- FileSystem fs=FileSystem.get(conf);
- Path op=new Path("hdfs://192.168.75.130:9000/root/outputjoindbnew4");
- if(fs.exists(op)){
- fs.delete(op, true);
- System.out.println("存在此输出路径,已删除!!!");
- }
- FileInputFormat.setInputPaths(job, new Path("hdfs://192.168.75.130:9000/root/inputjoindb"));
- FileOutputFormat.setOutputPath(job, op);
- System.exit(job.waitForCompletion(true)?0:1);
- }
- }
运行日志如下:
- 模式: 192.168.75.130:9001
- 存在此输出路径,已删除!!!
- WARN - JobClient.copyAndConfigureFiles(746) | Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
- INFO - FileInputFormat.listStatus(237) | Total input paths to process : 2
- WARN - NativeCodeLoader.<clinit>(52) | Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
- WARN - LoadSnappy.<clinit>(46) | Snappy native library not loaded
- INFO - JobClient.monitorAndPrintJob(1380) | Running job: job_201404260312_0002
- INFO - JobClient.monitorAndPrintJob(1393) | map 0% reduce 0%
- INFO - JobClient.monitorAndPrintJob(1393) | map 50% reduce 0%
- INFO - JobClient.monitorAndPrintJob(1393) | map 100% reduce 0%
- INFO - JobClient.monitorAndPrintJob(1393) | map 100% reduce 33%
- INFO - JobClient.monitorAndPrintJob(1393) | map 100% reduce 100%
- INFO - JobClient.monitorAndPrintJob(1448) | Job complete: job_201404260312_0002
- INFO - Counters.log(585) | Counters: 29
- INFO - Counters.log(587) | Job Counters
- INFO - Counters.log(589) | Launched reduce tasks=1
- INFO - Counters.log(589) | SLOTS_MILLIS_MAPS=12445
- INFO - Counters.log(589) | Total time spent by all reduces waiting after reserving slots (ms)=0
- INFO - Counters.log(589) | Total time spent by all maps waiting after reserving slots (ms)=0
- INFO - Counters.log(589) | Launched map tasks=2
- INFO - Counters.log(589) | Data-local map tasks=2
- INFO - Counters.log(589) | SLOTS_MILLIS_REDUCES=9801
- INFO - Counters.log(587) | File Output Format Counters
- INFO - Counters.log(589) | Bytes Written=172
- INFO - Counters.log(587) | FileSystemCounters
- INFO - Counters.log(589) | FILE_BYTES_READ=237
- INFO - Counters.log(589) | HDFS_BYTES_READ=455
- INFO - Counters.log(589) | FILE_BYTES_WRITTEN=169503
- INFO - Counters.log(589) | HDFS_BYTES_WRITTEN=172
- INFO - Counters.log(587) | File Input Format Counters
- INFO - Counters.log(589) | Bytes Read=227
- INFO - Counters.log(587) | Map-Reduce Framework
- INFO - Counters.log(589) | Map output materialized bytes=243
- INFO - Counters.log(589) | Map input records=10
- INFO - Counters.log(589) | Reduce shuffle bytes=243
- INFO - Counters.log(589) | Spilled Records=16
- INFO - Counters.log(589) | Map output bytes=215
- INFO - Counters.log(589) | Total committed heap usage (bytes)=336338944
- INFO - Counters.log(589) | CPU time spent (ms)=1770
- INFO - Counters.log(589) | Combine input records=0
- INFO - Counters.log(589) | SPLIT_RAW_BYTES=228
- INFO - Counters.log(589) | Reduce input records=8
- INFO - Counters.log(589) | Reduce input groups=4
- INFO - Counters.log(589) | Combine output records=0
- INFO - Counters.log(589) | Physical memory (bytes) snapshot=442564608
- INFO - Counters.log(589) | Reduce output records=4
- INFO - Counters.log(589) | Virtual memory (bytes) snapshot=2184306688
- INFO - Counters.log(589) | Map output records=8
- 模式: 192.168.75.130:9001
- 存在此输出路径,已删除!!!
- WARN - JobClient.copyAndConfigureFiles(746) | Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
- INFO - FileInputFormat.listStatus(237) | Total input paths to process : 2
- WARN - NativeCodeLoader.<clinit>(52) | Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
- WARN - LoadSnappy.<clinit>(46) | Snappy native library not loaded
- INFO - JobClient.monitorAndPrintJob(1380) | Running job: job_201404260312_0002
- INFO - JobClient.monitorAndPrintJob(1393) | map 0% reduce 0%
- INFO - JobClient.monitorAndPrintJob(1393) | map 50% reduce 0%
- INFO - JobClient.monitorAndPrintJob(1393) | map 100% reduce 0%
- INFO - JobClient.monitorAndPrintJob(1393) | map 100% reduce 33%
- INFO - JobClient.monitorAndPrintJob(1393) | map 100% reduce 100%
- INFO - JobClient.monitorAndPrintJob(1448) | Job complete: job_201404260312_0002
- INFO - Counters.log(585) | Counters: 29
- INFO - Counters.log(587) | Job Counters
- INFO - Counters.log(589) | Launched reduce tasks=1
- INFO - Counters.log(589) | SLOTS_MILLIS_MAPS=12445
- INFO - Counters.log(589) | Total time spent by all reduces waiting after reserving slots (ms)=0
- INFO - Counters.log(589) | Total time spent by all maps waiting after reserving slots (ms)=0
- INFO - Counters.log(589) | Launched map tasks=2
- INFO - Counters.log(589) | Data-local map tasks=2
- INFO - Counters.log(589) | SLOTS_MILLIS_REDUCES=9801
- INFO - Counters.log(587) | File Output Format Counters
- INFO - Counters.log(589) | Bytes Written=172
- INFO - Counters.log(587) | FileSystemCounters
- INFO - Counters.log(589) | FILE_BYTES_READ=237
- INFO - Counters.log(589) | HDFS_BYTES_READ=455
- INFO - Counters.log(589) | FILE_BYTES_WRITTEN=169503
- INFO - Counters.log(589) | HDFS_BYTES_WRITTEN=172
- INFO - Counters.log(587) | File Input Format Counters
- INFO - Counters.log(589) | Bytes Read=227
- INFO - Counters.log(587) | Map-Reduce Framework
- INFO - Counters.log(589) | Map output materialized bytes=243
- INFO - Counters.log(589) | Map input records=10
- INFO - Counters.log(589) | Reduce shuffle bytes=243
- INFO - Counters.log(589) | Spilled Records=16
- INFO - Counters.log(589) | Map output bytes=215
- INFO - Counters.log(589) | Total committed heap usage (bytes)=336338944
- INFO - Counters.log(589) | CPU time spent (ms)=1770
- INFO - Counters.log(589) | Combine input records=0
- INFO - Counters.log(589) | SPLIT_RAW_BYTES=228
- INFO - Counters.log(589) | Reduce input records=8
- INFO - Counters.log(589) | Reduce input groups=4
- INFO - Counters.log(589) | Combine output records=0
- INFO - Counters.log(589) | Physical memory (bytes) snapshot=442564608
- INFO - Counters.log(589) | Reduce output records=4
- INFO - Counters.log(589) | Virtual memory (bytes) snapshot=2184306688
- INFO - Counters.log(589) | Map output records=8
在map侧过滤的数据,在50030中查看的截图如下:

运行结果如下所示:
- 1 三劫散仙 13575468248 B 89 2013-02-05
- 2 凤舞九天 18965235874 C 69 2013-03-09
- 3 忙忙碌碌 15986854789 A 99 2013-03-05
- 3 忙忙碌碌 15986854789 D 56 2013-06-07
- 1 三劫散仙 13575468248 B 89 2013-02-05
- 2 凤舞九天 18965235874 C 69 2013-03-09
- 3 忙忙碌碌 15986854789 A 99 2013-03-05
- 3 忙忙碌碌 15986854789 D 56 2013-06-07
至此,这个半链接就完成了,结果正确,在hadoop的几种join方式里,只有在Map侧的链接比较高效,但也需要根据具体的实际情况,进行选择。
Hadoop的Map侧join的更多相关文章
- hadoop 多表join:Map side join及Reduce side join范例
最近在准备抽取数据的工作.有一个id集合200多M,要从另一个500GB的数据集合中抽取出所有id集合中包含的数据集.id数据集合中每一个行就是一个id的字符串(Reduce side join要在每 ...
- hadoop的压缩解压缩,reduce端join,map端join
hadoop的压缩解压缩 hadoop对于常见的几种压缩算法对于我们的mapreduce都是内置支持,不需要我们关心.经过map之后,数据会产生输出经过shuffle,这个时候的shuffle过程特别 ...
- hadoop中MapReduce多种join实现实例分析
转载自:http://zengzhaozheng.blog.51cto.com/8219051/1392961 1.在Reudce端进行连接. 在Reudce端进行连接是MapReduce框架进行表之 ...
- Hadoop中两表JOIN的处理方法(转)
1. 概述 在传统数据库(如:MYSQL)中,JOIN操作是非常常见且非常耗时的.而在HADOOP中进行JOIN操作,同样常见且耗时,由于Hadoop的独特设计思想,当进行JOIN操作时,有一些特殊的 ...
- Hadoop中两表JOIN的处理方法
Dong的这篇博客我觉得把原理写的很详细,同时介绍了一些优化办法,利用二次排序或者布隆过滤器,但在之前实践中我并没有在join中用二者来优化,因为我不是作join优化的,而是做单纯的倾斜处理,做joi ...
- Hadoop基础-MapReduce的Join操作
Hadoop基础-MapReduce的Join操作 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任. 一.连接操作Map端Join(适合处理小表+大表的情况) no001 no002 ...
- map侧连接
两个数据集中一个非常小,可以让小数据集存入缓存.在作业开始这些文件会被复制到运行task的节点上. 一开始,它的setup方法会检索缓存文件. 与reduce侧连接不同,Map侧连接需要等待参与连接的 ...
- Hadoop_22_MapReduce map端join实现方式解决数据倾斜(DistributedCache)
1.Map端Join解决数据倾斜 1.Mapreduce中会将map输出的kv对,按照相同key分组(调用getPartition),然后分发给不同的reducetask 2.Map输出结果的时候 ...
- 【Spark调优】:如果实在要shuffle,使用map侧预聚合的算子
因业务上的需要,无可避免的一些运算一定要使用shuffle操作,无法用map类的算子来替代,那么尽量使用可以map侧预聚合的算子. map侧预聚合,是指在每个节点本地对相同的key进行一次聚合操作,类 ...
随机推荐
- On Caching and Evangelizing SQL
http://www.oracle.com/technetwork/issue-archive/2011/11-sep/o51asktom-453438.html Our technologist ...
- 拼音 名字 排序 a-z的比较 ( sortUsingComparator )
NSMutableArray * array = [NSMutableArrayarrayWithObjects:@"ad",@"az",@"ac&q ...
- Sql Server 系统表功能
SELECT Name FROM Master..SysDatabases ORDER BY Name
- 【转载】SHELL字符串处理技巧(${}、##、%%)
转载自:http://www.cnblogs.com/pmars/archive/2013/02/17/2914444.html 在SHELL编程中,经常要处理一些字符串变量.比如,计算长度啊.截取子 ...
- 将win7电脑无线网变身WiFi热点,让手机、笔记本共享上网
开启windows 7的隐藏功能:虚拟WiFi和SoftAP(即虚拟无线AP),就可以让电脑变成无线路由器,实现共享上网,节省网费和路由器购买费.宏碁.惠普笔记本和诺基亚N97mini亲测通过. 以操 ...
- [python]pythonic的字典常用操作
注意:dct代表字典,key代表键值 1.判断字典中某个键是否存在 实现 dct.has_key(key) #False 更Pythonic方法 key in dct #False 2.获取字典中的值 ...
- E:无法修正错误,因为您要求某些软件包保持现状,就是它们破坏了软件包间的依赖关系
安装terminator等一些软件等时候,遇到了这样等问题 leo@leo:~$ sudo apt-get install terminator [sudo] password for leo: 正在 ...
- 一篇通俗易懂的讲解OpenGL ES的文章
电脑或者手机上做图像处理有很多方式,但是目前为止最高效的方法是有效地使用图形处理单元,或者叫 GPU.你的手机包含两个不同的处理单元,CPU 和 GPU.CPU 是个多面手,并且不得不处理所有的事情, ...
- ThroughRain第二次冲刺(每天更新
第二次冲刺时间: 11月28-12月5号 第一次冲刺目标及分配: 1. 查询点餐界面 认领:梁仕标 2. 链接数据库 认领:冯梓凡 3. 建立数据库的表 ...
- Scrum项目1.0
1) N (Need 需求) 面向小学生 2) A (Approach 做法) 3) B (Benefit 好处) 让小学生以游戏的方式进行学习 4) C (Competitors 竞争) 减少练习 ...