MapReduce基础知识
hadoop版本:1.1.2
一、Mapper类的结构
Mapper类是Job.setInputFormatClass()方法的默认值,Mapper类将输入的键值对原封不动地输出。
org.apache.hadoop.mapreduce.Mapper类的结构如下:
public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> { public class Context
extends MapContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT> {
public Context(Configuration conf, TaskAttemptID taskid,
RecordReader<KEYIN,VALUEIN> reader,
RecordWriter<KEYOUT,VALUEOUT> writer,
OutputCommitter committer,
StatusReporter reporter,
InputSplit split) throws IOException, InterruptedException {
super(conf, taskid, reader, writer, committer, reporter, split);
}
} /**
* Called once at the beginning of the task.
* 在task开始之前调用一次
*
*/
protected void setup(Context context
) throws IOException, InterruptedException {
// NOTHING
} /**
* Called once for each key/value pair in the input split. Most applications
* should override this, but the default is the identity function.
* 对数据分块中的每个键值对都调用一次
*
*/
@SuppressWarnings("unchecked")
protected void map(KEYIN key, VALUEIN value,
Context context) throws IOException, InterruptedException {
context.write((KEYOUT) key, (VALUEOUT) value);
} /**
* Called once at the end of the task.
* 在task结束后调用一次
*
*/
protected void cleanup(Context context
) throws IOException, InterruptedException {
// NOTHING
} /**
* Expert users can override this method for more complete control over the
* execution of the Mapper.
* 默认先调用一次setup方法,然后循环对每个键值对调用map方法,最后调用一次cleanup方法。
*
* @param context
* @throws IOException
*/
public void run(Context context) throws IOException, InterruptedException {
setup(context);
while (context.nextKeyValue()) {
map(context.getCurrentKey(), context.getCurrentValue(), context);
}
cleanup(context);
}
}
二、Reducer类的结构
Reducer类是Job.setOutputFormatClass()方法的默认值,Reducer类将输入的键值对原封不动地输出。
org.apache.hadoop.mapreduce.Reduce与Mapper类似。
public class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT> { public class Context
extends ReduceContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT> {
public Context(Configuration conf, TaskAttemptID taskid,
RawKeyValueIterator input,
Counter inputKeyCounter,
Counter inputValueCounter,
RecordWriter<KEYOUT,VALUEOUT> output,
OutputCommitter committer,
StatusReporter reporter,
RawComparator<KEYIN> comparator,
Class<KEYIN> keyClass,
Class<VALUEIN> valueClass
) throws IOException, InterruptedException {
super(conf, taskid, input, inputKeyCounter, inputValueCounter,
output, committer, reporter,
comparator, keyClass, valueClass);
}
} /**
* Called once at the start of the task.
*/
protected void setup(Context context
) throws IOException, InterruptedException {
// NOTHING
} /**
* This method is called once for each key. Most applications will define
* their reduce class by overriding this method. The default implementation
* is an identity function.
*/
@SuppressWarnings("unchecked")
protected void reduce(KEYIN key, Iterable<VALUEIN> values, Context context
) throws IOException, InterruptedException {
for(VALUEIN value: values) {
context.write((KEYOUT) key, (VALUEOUT) value);
}
} /**
* Called once at the end of the task.
*/
protected void cleanup(Context context
) throws IOException, InterruptedException {
// NOTHING
} /**
* Advanced application writers can use the
* {@link #run(org.apache.hadoop.mapreduce.Reducer.Context)} method to
* control how the reduce task works.
*/
public void run(Context context) throws IOException, InterruptedException {
setup(context);
while (context.nextKey()) {
reduce(context.getCurrentKey(), context.getValues(), context);
}
cleanup(context);
}
}
三、hadoop提供的mapper和reducer实现
我们不一定总是要从头开始自己编写自己的Mapper和Reducer类。Hadoop提供了几种常见的Mapper和Reducer的子类,这些类可以直接用于我们的作业当中。
mapper可以在org.apache.hadoop.mapreduce.lib.map包下面找到如下子类:
- InverseMapper:A Mapper hat swaps keys and values.
- MultithreadedMapper:Multithreaded implementation for org.apache.hadoop.mapreduce.Mapper.
- TokenCounterMapper:Tokenize the input values and emit each word with a count of 1.
reducer可以在org.apache.hadoop.mapreduce.lib.reduce包下面找到如下子类:
- IntSumReducer:它输出每个键对应的整数值列表的总和。
- LongSumReducer:它输出每个键对应的长整数值列表的总和。
四、MapReduce的输入

该类的作用是将输入的数据分割成一个个的split,并将split进一步拆分成键值对作为map函数的输入。
InputFormat
describes the input-specification for a Map-Reduce job.
The Map-Reduce framework relies on the InputFormat
of the job to:
- Validate the input-specification of the job.
- Split-up the input file(s) into logical
InputSplit
s, each of which is then assigned to an individualMapper
. - Provide the
RecordReader
implementation to be used to glean input records from the logicalInputSplit
for processing by theMapper
.
The default behavior of file-based InputFormat
s, typically sub-classes of FileInputFormat
, is to split the input into logical InputSplit
s based on the total size, in bytes, of the input files. However, the FileSystem
blocksize of the input files is treated as an upper bound for input splits. A lower bound on the split size can be set via mapred.min.split.size.
Clearly, logical splits based on input-size is insufficient for many applications since record boundaries are to respected. In such cases, the application has to also implement a RecordReader
on whom lies the responsibility to respect record-boundaries and present a record-oriented view of the logical InputSplit
to the individual task.
2、RecordReader抽象类
The record reader breaks the data into key/value pairs for input to the Mapper
.
3、hadoop提供的InputFormat
hadoop在org.apache.hadoop.mapreduce.lib.input包下提供了一些InputFormat的实现。hadoop默认使用TextInputFormat类处理输入。
4、hadoop提供的RecordReader
hadoop在org.apache.hadoop.mapreduce.lib.input包下也提供了一些RecordReader的实现。
五、MapReduce的输出

OutputFormat
describes the output-specification for a Map-Reduce job.The Map-Reduce framework relies on the OutputFormat
of the job to:
- Validate the output-specification of the job. For e.g. check that the output directory doesn't already exist.
- Provide the
RecordWriter
implementation to be used to write out the output files of the job. Output files are stored in aFileSystem
.
2、RecordWriter抽象类
RecordWriter
writes the output <key, value> pairs to an output file.
RecordWriter
implementations write the job outputs to the FileSystem
.
3、hadoop提供的OutputFormat
hadoop在org.apache.hadoop.mapreduce.lib.output包下提供了一些OutputFormat的实现。hadoop默认使用TextOutputFormat类处理输出。
4、hadoop提供的RecordWriter
在org.apache.hadoop.mapreduce.lib.input包下的OutputFormat的实现类(子类)将它们所需的RecordWriter定义为内部类,因此不存在单独实现的RecordWriter类。
六、MapReduce各阶段涉及到的类
P70-71
1、InputFormat类
2、Mapper类
3、Combiner类
4、Partitioner类
5、Reducer类
6、OutputFormat类
7、其他
七、详解Shuffle过程:http://langyu.iteye.com/blog/992916
map->shuffle->reduce
P60-64,例子P64-68
附:WEB接口的端口号配置:
mapred-default.xml
<property>
<name>mapred.job.tracker.http.address</name>
<value>0.0.0.0:50030</value>
<description>
The job tracker http server address and port the server will listen on.
If the port is 0 then the server will start on a free port.
</description>
</property>
hdfs-default.xml
<property>
<name>dfs.http.address</name>
<value>0.0.0.0:50070</value>
<description>
The address and the base port where the dfs namenode web ui will listen on.
If the port is 0 then the server will start on a free port.
</description>
</property>
MapReduce基础知识的更多相关文章
- 小记---------Hadoop的MapReduce基础知识
MapReduce是一种分布式计算模型,主要用于搜索领域,解决海量数据的计算问题 MR由两个阶段组成:Map和Reduce,用户只需要实现map()和reduce()两个函数,即可实现分布式计算. 两 ...
- 基于C#的MongoDB数据库开发应用(1)--MongoDB数据库的基础知识和使用
在花了不少时间研究学习了MongoDB数据库的相关知识,以及利用C#对MongoDB数据库的封装.测试应用后,决定花一些时间来总结一下最近的研究心得,把这个数据库的应用单独作为一个系列来介绍,希望从各 ...
- MongoDB基础知识 02
MongoDB基础知识 02 6 数据类型 6.1 null : 表示空值或者不存在的字段 {"x":null} 6.2 布尔型 : 布尔类型只有两个值true和false {&q ...
- 大数据基础知识问答----spark篇,大数据生态圈
Spark相关知识点 1.Spark基础知识 1.Spark是什么? UCBerkeley AMPlab所开源的类HadoopMapReduce的通用的并行计算框架 dfsSpark基于mapredu ...
- 最全的spark基础知识解答
原文:http://www.36dsj.com/archives/61155 一. Spark基础知识 1.Spark是什么? UCBerkeley AMPlab所开源的类HadoopMapReduc ...
- JAVA基础知识|lambda与stream
lambda与stream是java8中比较重要两个新特性,lambda表达式采用一种简洁的语法定义代码块,允许我们将行为传递到函数中.之前我们想将行为传递到函数中,仅有的选择是使用匿名内部类,现在我 ...
- 常见问题:MongoDB基础知识
常见问题:MongoDB基础知识 ·MongoDB支持哪些平台? ·MongoDB作为托管服务提供吗? ·集合(collection)与表(table)有何不同? ·如何创建数据库(database) ...
- Hive 这些基础知识,你忘记了吗?
Hive 其实是一个客户端,类似于navcat.plsql 这种,不同的是Hive 是读取 HDFS 上的数据,作为离线查询使用,离线就意味着速度很慢,有可能跑一个任务需要几个小时甚至更长时间都有可能 ...
- [源码解析] 深度学习分布式训练框架 Horovod (1) --- 基础知识
[源码解析] 深度学习分布式训练框架 Horovod --- (1) 基础知识 目录 [源码解析] 深度学习分布式训练框架 Horovod --- (1) 基础知识 0x00 摘要 0x01 分布式并 ...
随机推荐
- 拒绝了对对象 'XXX' (数据库 'XXX',架构 'dbo')的 SELECT 权限
2010-04-17 23:16 在IIS里测试ASP.NET网站时会遇到这样的问题(ASP.NET+SQL2005)我自己的解决方法是这样的: 1.打开SQL2005管理界面(没有安装SQLServ ...
- Eclipse工作常见问题总结
一.Eclipse常见快捷键使用 自动完成单词:Alt+/ 重命名:Shift+Alt+r(统一改变字段或方法名) 生成getter/setter方法: Shift+Alt+s,然后r 删除当前行:C ...
- CentOs中mysql的安装与配置
在linux中安装数据库首选MySQL,Mysql数据库的第一个版本就是发行在Linux系统上,其他选择还可以有postgreSQL,oracle等 在Linux上安装mysql数据库,我们可以去其官 ...
- Javascript中判断数组的正确姿势
在 Javascript 中,如何判断一个变量是否是数组? 最好的方式是用 ES5 提供的 Array.isArray() 方法(毕竟原生的才是最屌的): var a = [0, 1, 2]; con ...
- HoloLens开发手记 - Unity development overview 使用Unity开发概述
Unity Technical Preview for HoloLens最新发行版为:Beta 24,发布于 09/07/2016 开始使用Unity开发HoloLens应用之前,确保你已经安装好了必 ...
- jQuery操作单选按钮(radio)用法
1.获取选中值,四种方法都可以: $('input:radio:checked').val():$("input[type='radio']:checked").val(); $( ...
- Nodejs进阶:核心模块https 之 如何优雅的访问12306
本文摘录自<Nodejs学习笔记>,更多章节及更新,请访问 github主页地址.欢迎加群交流,群号 197339705. 模块概览 这个模块的重要性,基本不用强调了.在网络安全问题日益严 ...
- retrofit2中ssl的Trust anchor for certification path not found问题
在retrofit2中使用ssl,刚刚接触,很可能会出现如下错误. java.security.cert.CertPathValidatorException: Trust anchor for ce ...
- [BZOJ3142][HNOI2013]数列(组合)
题目:http://www.lydsy.com:808/JudgeOnline/problem.php?id=3142 分析: 考虑差值序列a1,a2,...,ak-1 那么对于一个确定的差值序列,对 ...
- 最小/大费用最大流模板(codevs1914)
void addedge(int fr,int to,int cap,int cos){ sid[cnt].fr=fr;sid[cnt].des=to;sid[cnt].cap=cap;sid[cnt ...