Hadoop MapReduce链式实践--ChainReducer

版本号：CDH5.0.0，HDFS：2.3.0，Mapreduce:2.3.0,Yarn:2.3.0。

场景描写叙述：求一组数据中依照不同类别的最大值，比方，例如以下的数据：

data1:

A,10

A,11

A,12

A,13

B,21

B,31

B,41

B,51

data2:

A,20

A,21

A,22

A,23

B,201

B,301

B,401

B,501

最后输出为：

A,23

B,501

假如这种逻辑的mapreduce数据流例如以下：

假设C组数据比較多，同一时候假设集群有2个节点，那么这个任务分配2个reducer，且C组数据平均分布到两个reducer中，（这样做是为了效率考虑，假设仅仅有一个reducer，那么当一个节点在执行reducer的时候另外一个节点会处于空暇状态）那么假设在reducer之后，还能够再次做一个reducer，那么不就能够整合数据到一个文件了么，同一时候还能够再次比較C组数据中，以得到真正比較大的数据。

首先说下，不用上面假设的方式进行操作，那么一般的操作方法。一般有两种方法：其一，直接读出HDFS数据，然后进行整合；其二，新建另外一个Job来进行整合。这两种方法，假设就效率来说的话，可能第一种效率会高点。

考虑到前面提出的mapreduce数据流，曾经曾对ChainReducer有点印象，好像能够做这个，所以就拿ChainReducer来试，同一时候为了学多点知识，也是用了多个Mapper（即使用ChainMapper）。

主程序代码例如以下：

package chain;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.conf.Configured;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapred.FileInputFormat;

import org.apache.hadoop.mapred.FileOutputFormat;

import org.apache.hadoop.mapred.JobClient;

import org.apache.hadoop.mapred.JobConf;

import org.apache.hadoop.mapred.TextInputFormat;

import org.apache.hadoop.mapred.TextOutputFormat;

import org.apache.hadoop.mapred.lib.ChainMapper;

import org.apache.hadoop.mapred.lib.ChainReducer;

import org.apache.hadoop.util.Tool;

import org.apache.hadoop.util.ToolRunner;

public class ChainDriver2 extends Configured implements Tool{

	/**

	 * ChainReducer 实战

	 * 验证多个reducer的整合

	 * 逻辑：寻找最大值

	 * @param args

	 */

	private String input=null;

	private String output=null;

	private String delimiter=null;

	private int reducer=1;

	public static void main(String[] args) throws Exception {

		ToolRunner.run(new Configuration(), new ChainDriver2(),args);

	}

	@Override

	public int run(String[] arg0) throws Exception {

		configureArgs(arg0);

		checkArgs();

		Configuration conf = getConf();

		conf.set("delimiter", delimiter);

		JobConf  job= new JobConf(conf,ChainDriver2.class);

		ChainMapper.addMapper(job, MaxMapper.class, LongWritable.class,

				Text.class, Text.class, IntWritable.class, true, new JobConf(false)) ;

		ChainMapper.addMapper(job, MergeMaxMapper.class, Text.class,

				IntWritable.class, Text.class, IntWritable.class, true, new JobConf(false));

		ChainReducer.setReducer(job, MaxReducer.class, Text.class, IntWritable.class,

				Text.class, IntWritable.class, true, new JobConf(false));

		ChainReducer.addMapper(job, MergeMaxMapper.class, Text.class,

				IntWritable.class, Text.class, IntWritable.class, false, new JobConf(false));

		job.setJarByClass(ChainDriver2.class);

		job.setJobName("ChainReducer test job");

        job.setMapOutputKeyClass(Text.class);

        job.setMapOutputValueClass(IntWritable.class);

        job.setOutputKeyClass(Text.class);

        job.setOutputValueClass(IntWritable.class);

       /* job.setMapperClass(MaxMapper.class);

        job.setReducerClass(MaxReducer.class);*/

        job.setInputFormat(TextInputFormat.class);;

        job.setOutputFormat(TextOutputFormat.class);

        job.setNumReduceTasks(reducer);

        FileInputFormat.addInputPath(job, new Path(input));

        FileOutputFormat.setOutputPath(job, new Path(output));

        JobClient.runJob(job);

		return 0;

	}

	/**

	 * check the args

	 */

	private void checkArgs() {

		if(input==null||"".equals(input)){

			System.out.println("no input...");

			printUsage();

			System.exit(-1);

		}

		if(output==null||"".equals(output)){

			System.out.println("no output...");

			printUsage();

			System.exit(-1);

		}

		if(delimiter==null||"".equals(delimiter)){

			System.out.println("no delimiter...");

			printUsage();

			System.exit(-1);

		}

		if(reducer==0){

			System.out.println("no reducer...");

			printUsage();

			System.exit(-1);

		}

	}

	/**

	 * configuration the args

	 * @param args

	 */

	private void configureArgs(String[] args) {

    	for(int i=0;i<args.length;i++){

    		if("-i".equals(args[i])){

    			input=args[++i];

    		}

    		if("-o".equals(args[i])){

    			output=args[++i];

    		}

    		if("-delimiter".equals(args[i])){

    			delimiter=args[++i];

    		}

    		if("-reducer".equals(args[i])){

    			try {

    				reducer=Integer.parseInt(args[++i]);

				} catch (Exception e) {

					reducer=0;

				}

    		}

    	}

	}

	public static void printUsage(){

    	System.err.println("Usage:");

    	System.err.println("-i input \t cell data path.");

    	System.err.println("-o output \t output data path.");

    	System.err.println("-delimiter  data delimiter , default is blanket  .");

    	System.err.println("-reducer  reducer number , default is 1  .");

    }

}

MaxMapper：

package chain;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapred.JobConf;

import org.apache.hadoop.mapred.MapReduceBase;

import org.apache.hadoop.mapred.Mapper;

import org.apache.hadoop.mapred.OutputCollector;

import org.apache.hadoop.mapred.Reporter;

import org.slf4j.Logger;

import org.slf4j.LoggerFactory;

public class MaxMapper extends MapReduceBase implements Mapper<LongWritable ,Text,Text,IntWritable>{

	private Logger log = LoggerFactory.getLogger(MaxMapper.class);

	private String delimiter=null;

	@Override

	public void configure(JobConf conf){

		delimiter=conf.get("delimiter");

		log.info("delimiter:"+delimiter);

		log.info("This is the begin of MaxMapper");

	}

	@Override

	public void map(LongWritable key, Text value,

			OutputCollector<Text, IntWritable> out, Reporter reporter)

			throws IOException {

		// TODO Auto-generated method stub

		String[] values= value.toString().split(delimiter);

		log.info(values[0]+"-->"+values[1]);

		out.collect(new Text(values[0]), new IntWritable(Integer.parseInt(values[1])));

	}

	public void close(){

		log.info("This is the end of MaxMapper");

	}

}

MaxReducer：

package chain;

import java.io.IOException;

import java.util.Iterator;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapred.JobConf;

import org.apache.hadoop.mapred.MapReduceBase;

import org.apache.hadoop.mapred.OutputCollector;

import org.apache.hadoop.mapred.Reducer;

import org.apache.hadoop.mapred.Reporter;

import org.slf4j.Logger;

import org.slf4j.LoggerFactory;

public   class MaxReducer extends MapReduceBase implements Reducer<Text,IntWritable,Text,IntWritable>{

	private Logger log = LoggerFactory.getLogger(MaxReducer.class);

	@Override

	public void configure(JobConf conf){

		log.info("This is the begin of the MaxReducer");

	}

	@Override

	public void reduce(Text key, Iterator<IntWritable> values,

			OutputCollector<Text, IntWritable> out, Reporter reporter)

			throws IOException {

		// TODO Auto-generated method stub

		int max=-1;

		while(values.hasNext()){

			int value=values.next().get();

			if(value>max){

				max=value;

			}

		}

		log.info(key+"-->"+max);

		out.collect(key, new IntWritable(max));

	}

	@Override

	public void close(){

		log.info("This is the end of the MaxReducer");

	}

}

MergeMaxMapper：

package chain;

import java.io.IOException;

//import java.util.ArrayList;

//import java.util.HashMap;

//import java.util.Map;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapred.JobConf;

import org.apache.hadoop.mapred.MapReduceBase;

import org.apache.hadoop.mapred.Mapper;

import org.apache.hadoop.mapred.OutputCollector;

import org.apache.hadoop.mapred.Reporter;

import org.slf4j.Logger;

import org.slf4j.LoggerFactory;

public class MergeMaxMapper extends MapReduceBase implements Mapper<Text ,IntWritable,Text,IntWritable>{

	private Logger log = LoggerFactory.getLogger(MergeMaxMapper.class);

//	private Map<Text,ArrayList<IntWritable>> outMap= new HashMap<Text,ArrayList<IntWritable>>();

	@Override

	public void configure(JobConf conf){

		log.info("This is the begin of MergeMaxMapper");

	}

	@Override

	public void map(Text key, IntWritable value,

			OutputCollector<Text, IntWritable> out, Reporter reporter)

			throws IOException {

		log.info(key.toString()+"_MergeMaxMapper"+"-->"+value.get());

		out.collect(new Text(key.toString()+"_MergeMaxMapper"), value);

	}

	@Override

	public void close(){

		log.info("this is the end of MergeMaxMapper");

	}

}

编程思路例如以下：原始測试数据data1、data2首先经过MaxMapper（因为两个文件，所以生成了2个map），然后经过MergeMaxMapper，到MaxReducer，最后再次经过MergeMaxMapper。

在程序中加入了输出数据的log，能够通过log来查看各个map和reduce的数据流程。

mapper端的log（当中的一个mapper）：

2014-05-14 17:23:51,307 INFO [main] chain.MaxMapper: delimiter:,

2014-05-14 17:23:51,307 INFO [main] chain.MaxMapper: This is the begin of MaxMapper

2014-05-14 17:23:51,454 INFO [main] chain.MergeMaxMapper: This is the begin of MergeMaxMapper

2014-05-14 17:23:51,471 INFO [main] chain.MaxMapper: A-->20

2014-05-14 17:23:51,476 INFO [main] chain.MergeMaxMapper: A_MergeMaxMapper-->20

2014-05-14 17:23:51,476 INFO [main] chain.MaxMapper: A-->21

2014-05-14 17:23:51,477 INFO [main] chain.MergeMaxMapper: A_MergeMaxMapper-->21

2014-05-14 17:23:51,477 INFO [main] chain.MaxMapper: A-->22

2014-05-14 17:23:51,477 INFO [main] chain.MergeMaxMapper: A_MergeMaxMapper-->22

2014-05-14 17:23:51,477 INFO [main] chain.MaxMapper: A-->23

2014-05-14 17:23:51,477 INFO [main] chain.MergeMaxMapper: A_MergeMaxMapper-->23

2014-05-14 17:23:51,477 INFO [main] chain.MaxMapper: B-->201

2014-05-14 17:23:51,477 INFO [main] chain.MergeMaxMapper: B_MergeMaxMapper-->201

2014-05-14 17:23:51,477 INFO [main] chain.MaxMapper: B-->301

2014-05-14 17:23:51,477 INFO [main] chain.MergeMaxMapper: B_MergeMaxMapper-->301

2014-05-14 17:23:51,478 INFO [main] chain.MaxMapper: B-->401

2014-05-14 17:23:51,478 INFO [main] chain.MergeMaxMapper: B_MergeMaxMapper-->401

2014-05-14 17:23:51,478 INFO [main] chain.MaxMapper: B-->501

2014-05-14 17:23:51,478 INFO [main] chain.MergeMaxMapper: B_MergeMaxMapper-->501

2014-05-14 17:23:51,481 INFO [main] chain.MaxMapper: This is the end of MaxMapper

2014-05-14 17:23:51,481 INFO [main] chain.MergeMaxMapper: this is the end of MergeMaxMapper

通过上面log，能够看出，通过ChainMapper加入mapper的方式的mapper的处理顺序为：首先初始化第一个mapper（即调用configure方法）；接着初始第二个mapper（调用configure方法）；然后開始map函数，map函数针对一条记录，首先採用mapper1进行处理，然后使用mapper2进行处理；最后是关闭阶段，关闭的顺序同样是首先关闭mapper1（调用close方法），然后关闭mapper2。

reducer端的log（当中一个reducer）

2014-05-14 17:24:10,171 INFO [main] chain.MergeMaxMapper: This is the begin of MergeMaxMapper

2014-05-14 17:24:10,311 INFO [main] chain.MaxReducer: This is the begin of the MaxReducer

2014-05-14 17:24:10,671 INFO [main] chain.MaxReducer: B_MergeMaxMapper-->501

2014-05-14 17:24:10,672 INFO [main] chain.MergeMaxMapper: B_MergeMaxMapper_MergeMaxMapper-->501

2014-05-14 17:24:10,673 INFO [main] chain.MergeMaxMapper: this is the end of MergeMaxMapper

2014-05-14 17:24:10,673 INFO [main] chain.MaxReducer: This is the end of the MaxReducer

通过上面的log能够看出，通过ChainReducer加入mapper的方式，其数据处理顺序为：首先初始化Reducer之后的Mapper，接着初始化Reducer（看configure函数就可以知道）；然后处理reducer，reducer的输出接着交给mapper处理；最后先关闭Mapper，接着关闭reducer。

同一时候，注意到，reducer后面的mapper也是两个的，即有多少个reducer，就有多少个mapper。

通过实验得到上面的ChainReducer的数据处理流程，且ChainReducer没有addReducer的方法，也即是不能加入reducer了，那么最開始提出的mapreduce数据流程就不能採用这种方式实现了。

最后，前面提出的mapreduce数据流程应该是错的，在reducer out里面C组数据不会被拆分为两个reducer，同样的key仅仅会向同一个reducer传输。这里同样做了个试验，通过对接近90M的数据（仅仅有一个分组A）执行上面的程序，能够看到有2个mapper，2个reducer（此数值为设置值），可是在当中一个reducer中并没有A分组的不论什么数据，在另外一个reducer中才有数据。事实上，不用试验也是能够的，曾经看的书上一般都会说同样的key进入同一个reducer中。只是，假设是这种话，那么这种数据效率应该不高。

返回最開始提出的场景，最開始提出的问题，假设同样的key仅仅会进入一个reducer中，那么最后的2个数据文件（2个reducer生成2个数据文件）事实上里面不会有key冲突的数据，所以在进行后面的操作的时候能够直接读多个文件就可以，就像是读一个文件一样。

会产生这种认知错误，应该是对mapreduce 原理不清楚导致。

分享，成长，快乐

转载请注明blog地址：http://blog.csdn.net/fansy1990

Hadoop MapReduce链式实践--ChainReducer的更多相关文章

（转）Hadoop MapReduce链式实践--ChainReducer
版本:CDH5.0.0,HDFS:2.3.0,Mapreduce:2.3.0,Yarn:2.3.0. 场景描述:求一组数据中按照不同类别的最大值,比如,如下的数据: data1: A,10 A,11 ...
Hadoop MapReduce开发最佳实践（上篇）
body{ font-family: "Microsoft YaHei UI","Microsoft YaHei",SimSun,"Segoe UI& ...
[转] Hadoop MapReduce开发最佳实践（上篇）
前言本文是Hadoop最佳实践系列第二篇,上一篇为<Hadoop管理员的十个最佳实践>. MapRuduce开发对于大多数程序员都会觉得略显复杂,运行一个WordCount(Hadoop ...
Hadoop基础-Map端链式编程之MapReduce统计TopN示例
Hadoop基础-Map端链式编程之MapReduce统计TopN示例作者:尹正杰版权声明:原创作品,谢绝转载!否则将追究法律责任. 一.项目需求对“temp.txt”中的数据进行分析,统计出各 ...
Hadoop的ChainMapper和ChainReducer使用案例（链式处理）（四）
不多说,直接上干货! Hadoop的MR作业支持链式处理,类似在一个生产牛奶的流水线上,每一个阶段都有特定的任务要处理,比如提供牛奶盒,装入牛奶,封盒,打印出厂日期,等等,通过这样进一步的分 ...
链式mapreduce
在hadoop 中一个Job中可以按顺序运行多个mapper对数据进行前期的处理,再进行reduce,经reduce后的结果可经个经多个按顺序执行的mapper进行后期的处理,这样的Job是不会保存中 ...
组合式+迭代式+链式 MapReduce
1.迭代式mapreduce 一些复杂的任务难以用一次mapreduce处理完成,需要多次mapreduce才能完成任务,例如Pagrank,Kmeans算法都需要多次的迭代,关于mapreduce迭 ...
由表单验证说起，关于在C#中尝试链式编程的实践
在web开发中必不可少的会遇到表单验证的问题,为避免数据在写入到数据库时出现异常,一般比较安全的做法是前端会先做一次验证,通过后把数据提交到后端再验证一次,因为仅仅靠前端验证是不安全的,有太多的htt ...
Hadoop MapReduce编程 API入门系列之多个Job迭代式MapReduce运行（十二）
推荐 MapReduce分析明星微博数据 http://git.oschina.net/ljc520313/codeexample/tree/master/bigdata/hadoop/mapredu ...

随机推荐

struts2笔记04-XxxAware接口
1.XxxAware接口 ApplicationAware, RequestAware,SessionAware, ParameterAware. struts2提供了这四个Aware接口用 ...
windbg命令学习2
一.windbg查看内存命令: 当我们在调试器中分析问题时, 经常需要查看不同内存块的内容以分析产生的原因, 并且在随后验证所做出的假设是否正确. 由于各个对象的状态都是保存在内存中的, 因此内存的内 ...
poj2390
#include <stdio.h> #include <stdlib.h> int main() { int r,m,y,i; scanf("%d %d %d&qu ...
Openstack service default port
Block Storage (cinder) 8776 publicurl and adminurl Compute API (nova-api) 8773 EC2 API 8774 openstac ...
Android Json生成及解析实例
JSON的定义: 一种轻量级的数据交换格式,具有良好的可读和便于快速编写的特性.业内主流技术为其提供了完整的解决方案(有点类似于正则表达式 ,获得了当今大部分语言的支持),从而可以在不同平台间进行数据 ...
css组件规范
7月份研究了下写了下总结. 笔记地址
Python 读写文件操作
python进行文件读写的函数是open或file file_handler = open(filename,,mode) Table mode 模式描述 r 以读方式打开文件,可读取文件信息. w ...
ubuntu14.04下手动安装JDK + eclipse + Pydev
说明:本文在root用户下进行,如不是root用户命令前加sodu 一.手动安装JDK 1.下载JDK 从官网http://www.oracle.com/technetwork/java/javase ...
java23中设计模式详解
设计模式(Design Patterns) ——可复用面向对象软件的基础设计模式(Design pattern)是一套被反复使用.多数人知晓的.经过分类编目的.代码设计经验的总结.使用设计模式是为了 ...
Mac OS X下Maven的安装与配置
Mac OS X 安装Maven: 下载 Maven, 并解压到某个目录.例如/Users/robbie/apache-maven-3.3.3 打开Terminal,输入以下命令,设置Maven cl ...

Hadoop MapReduce链式实践--ChainReducer

Hadoop MapReduce链式实践--ChainReducer的更多相关文章

随机推荐

热门专题