Hadoop MapReduce InputFormat/OutputFormat
InputFormat
import java.io.IOException;
import java.util.List; /**
* InputFormat describes the input-specification for a Map-Reduce job.
*
* The Map-Reduce framework relies on the InputFormat of the job to:
*
* Validate the input-specification of the job.
*
* Split-up the input file(s) into logical InputSplits, each of which is then
* assigned to an individual Mapper.
*
* Provide the RecordReader implementation to be used to glean input records
* from the logical InputSplit for processing by the Mapper.
*
* The default behavior of file-based InputFormats, typically sub-classes of
* FileInputFormat, is to split the input into logical InputSplits based on the
* total size, in bytes, of the input files. However, the FileSystem blocksize
* of the input files is treated as an upper bound for input splits. A lower
* bound on the split size can be set via mapred.min.split.size.
*
* Clearly, logical splits based on input-size is insufficient for many
* applications since record boundaries are to respected. In such cases, the
* application has to also implement a RecordReader on whom lies the
* responsibility to respect record-boundaries and present a record-oriented
* view of the logical InputSplit to the individual task.
*
*/
public abstract class InputFormat<K, V> { /**
* Logically split the set of input files for the job.
*
* <p>
* Each {@link InputSplit} is then assigned to an individual {@link Mapper}
* for processing.
* </p>
*
* <p>
* <i>Note</i>: The split is a <i>logical</i> split of the inputs and the
* input files are not physically split into chunks. For e.g. a split could
* be <i><input-file-path, start, offset></i> tuple. The InputFormat
* also creates the {@link RecordReader} to read the {@link InputSplit}.
*
* @param context
* job configuration.
* @return an array of {@link InputSplit}s for the job.
*/
public abstract List<InputSplit> getSplits(JobContext context)
throws IOException, InterruptedException; /**
* Create a record reader for a given split. The framework will call
* {@link RecordReader#initialize(InputSplit, TaskAttemptContext)} before
* the split is used.
*
* @param split
* the split to be read
* @param context
* the information about the task
* @return a new record reader
* @throws IOException
* @throws InterruptedException
*/
public abstract RecordReader<K, V> createRecordReader(InputSplit split,
TaskAttemptContext context) throws IOException,
InterruptedException; }
OutputFormat
import java.io.IOException; import org.apache.hadoop.mapreduce.InputFormat;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.RecordReader; /**
* <code>InputSplit</code> represents the data to be processed by an individual
* {@link Mapper}.
*
* <p>
* Typically, it presents a byte-oriented view on the input and is the
* responsibility of {@link RecordReader} of the job to process this and present
* a record-oriented view.
*
* @see InputFormat
* @see RecordReader
*/
public abstract class InputSplit { /**
* Get the size of the split, so that the input splits can be sorted by
* size.
*
* @return the number of bytes in the split
* @throws IOException
* @throws InterruptedException
*/
public abstract long getLength() throws IOException, InterruptedException; /**
* Get the list of nodes by name where the data for the split would be
* local. The locations do not need to be serialized.
*
* @return a new array of the node nodes.
* @throws IOException
* @throws InterruptedException
*/
public abstract String[] getLocations() throws IOException,
InterruptedException; }
RecordReader
import java.io.Closeable;
import java.io.IOException; /**
* The record reader breaks the data into key/value pairs for input to the
* {@link Mapper}.
*
* @param <KEYIN>
* @param <VALUEIN>
*/
public abstract class RecordReader<KEYIN, VALUEIN> implements Closeable { /**
* Called once at initialization.
*
* @param split
* the split that defines the range of records to read
* @param context
* the information about the task
* @throws IOException
* @throws InterruptedException
*/
public abstract void initialize(InputSplit split, TaskAttemptContext context)
throws IOException, InterruptedException; /**
* Read the next key, value pair.
*
* @return true if a key/value pair was read
* @throws IOException
* @throws InterruptedException
*/
public abstract boolean nextKeyValue() throws IOException,
InterruptedException; /**
* Get the current key
*
* @return the current key or null if there is no current key
* @throws IOException
* @throws InterruptedException
*/
public abstract KEYIN getCurrentKey() throws IOException,
InterruptedException; /**
* Get the current value.
*
* @return the object that was read
* @throws IOException
* @throws InterruptedException
*/
public abstract VALUEIN getCurrentValue() throws IOException,
InterruptedException; /**
* The current progress of the record reader through its data.
*
* @return a number between 0.0 and 1.0 that is the fraction of the data
* read
* @throws IOException
* @throws InterruptedException
*/
public abstract float getProgress() throws IOException,
InterruptedException; /**
* Close the record reader.
*/
public abstract void close() throws IOException; }
OutputFormat
import java.io.IOException; import org.apache.hadoop.fs.FileSystem; /**
* <code>OutputFormat</code> describes the output-specification for a Map-Reduce
* job.
*
* <p>
* The Map-Reduce framework relies on the <code>OutputFormat</code> of the job
* to:
* <p>
* <ol>
* <li>
* Validate the output-specification of the job. For e.g. check that the output
* directory doesn't already exist.
* <li>
* Provide the {@link RecordWriter} implementation to be used to write out the
* output files of the job. Output files are stored in a {@link FileSystem}.</li>
* </ol>
*
* @see RecordWriter
*/
public abstract class OutputFormat<K, V> { /**
* Get the {@link RecordWriter} for the given task.
*
* @param context
* the information about the current task.
* @return a {@link RecordWriter} to write the output for the job.
* @throws IOException
*/
public abstract RecordWriter<K, V> getRecordWriter(
TaskAttemptContext context) throws IOException,
InterruptedException; /**
* Check for validity of the output-specification for the job.
*
* <p>
* This is to validate the output specification for the job when it is a job
* is submitted. Typically checks that it does not already exist, throwing
* an exception when it already exists, so that output is not overwritten.
* </p>
*
* @param context
* information about the job
* @throws IOException
* when output should not be attempted
*/
public abstract void checkOutputSpecs(JobContext context)
throws IOException, InterruptedException; /**
* Get the output committer for this output format. This is responsible for
* ensuring the output is committed correctly.
*
* @param context
* the task context
* @return an output committer
* @throws IOException
* @throws InterruptedException
*/
public abstract OutputCommitter getOutputCommitter(
TaskAttemptContext context) throws IOException,
InterruptedException; }
RecordWriter
import java.io.IOException; import org.apache.hadoop.fs.FileSystem; /**
* <code>RecordWriter</code> writes the output <key, value> pairs to an
* output file.
*
* <p>
* <code>RecordWriter</code> implementations write the job outputs to the
* {@link FileSystem}.
*
* @see OutputFormat
*/
public abstract class RecordWriter<K, V> { /**
* Writes a key/value pair.
*
* @param key
* the key to write.
* @param value
* the value to write.
* @throws IOException
*/
public abstract void write(K key, V value) throws IOException,
InterruptedException; /**
* Close this <code>RecordWriter</code> to future operations.
*
* @param context
* the context of the task
* @throws IOException
*/
public abstract void close(TaskAttemptContext context) throws IOException,
InterruptedException; }
OutputCommitter
import java.io.IOException; /**
* <code>OutputCommitter</code> describes the commit of task output for a
* Map-Reduce job.
*
* <p>
* The Map-Reduce framework relies on the <code>OutputCommitter</code> of the
* job to:
* <p>
* <ol>
* <li>
* Setup the job during initialization. For example, create the temporary output
* directory for the job during the initialization of the job.</li>
* <li>
* Cleanup the job after the job completion. For example, remove the temporary
* output directory after the job completion.</li>
* <li>
* Setup the task temporary output.</li>
* <li>
* Check whether a task needs a commit. This is to avoid the commit procedure if
* a task does not need commit.</li>
* <li>
* Commit of the task output.</li>
* <li>
* Discard the task commit.</li>
* </ol>
*
* @see org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
* @see JobContext
* @see TaskAttemptContext
*
*/
public abstract class OutputCommitter { /**
* For the framework to setup the job output during initialization
*
* @param jobContext
* Context of the job whose output is being written.
* @throws IOException
* if temporary output could not be created
*/
public abstract void setupJob(JobContext jobContext) throws IOException; /**
* For cleaning up the job's output after job completion
*
* @param jobContext
* Context of the job whose output is being written.
* @throws IOException
*/
public abstract void cleanupJob(JobContext jobContext) throws IOException; /**
* Sets up output for the task.
*
* @param taskContext
* Context of the task whose output is being written.
* @throws IOException
*/
public abstract void setupTask(TaskAttemptContext taskContext)
throws IOException; /**
* Check whether task needs a commit
*
* @param taskContext
* @return true/false
* @throws IOException
*/
public abstract boolean needsTaskCommit(TaskAttemptContext taskContext)
throws IOException; /**
* To promote the task's temporary output to final output location
*
* The task's output is moved to the job's output directory.
*
* @param taskContext
* Context of the task whose output is being written.
* @throws IOException
* if commit is not
*/
public abstract void commitTask(TaskAttemptContext taskContext)
throws IOException; /**
* Discard the task output
*
* @param taskContext
* @throws IOException
*/
public abstract void abortTask(TaskAttemptContext taskContext)
throws IOException; }
Hadoop MapReduce InputFormat/OutputFormat的更多相关文章
- [Hadoop] - 自定义Mapreduce InputFormat&OutputFormat
在MR程序的开发过程中,经常会遇到输入数据不是HDFS或者数据输出目的地不是HDFS的,MapReduce的设计已经考虑到这种情况,它为我们提供了两个组建,只需要我们自定义适合的InputFormat ...
- Hadoop MapReduce InputFormat基础
有时候你可能想要用不同的方法从input data中读取数据.那么你就需要创建一个自己的InputFormat类. InputFormat是一个只有两个函数的接口. public interf ...
- Hadoop MapReduce编程 API入门系列之网页流量版本1(二十二)
不多说,直接上代码. 对流量原始日志进行流量统计,将不同省份的用户统计结果输出到不同文件. 代码 package zhouls.bigdata.myMapReduce.flowsum; import ...
- Hadoop(20)-MapReduce框架原理-OutputFormat
1.outputFormat接口实现类 2.自定义outputFormat 步骤: 1). 定义一个类继承FileOutputFormat 2). 定义一个类继承RecordWrite,重写write ...
- Hadoop MapReduce编程学习
一直在搞spark,也没时间弄hadoop,不过Hadoop基本的编程我觉得我还是要会吧,看到一篇不错的文章,不过应该应用于hadoop2.0以前,因为代码中有 conf.set("map ...
- Hadoop MapReduce概念学习系列之mr程序组件全貌(二十)
其实啊,spilt是,控制Apache Hadoop Mapreduce的map并发任务数,详细见http://www.cnblogs.com/zlslch/p/5713652.html map,是m ...
- hadoop MapReduce
简单介绍 官方给出的介绍是hadoop MR是一个用于轻松编写以一种可靠的.容错的方式在商业化硬件上的大型集群上并行处理大量数据的应用程序的软件框架. MR任务通常会先把输入的数据集切分成独立的块(可 ...
- 【Big Data - Hadoop - MapReduce】hadoop 学习笔记:MapReduce框架详解
开始聊MapReduce,MapReduce是Hadoop的计算框架,我学Hadoop是从Hive开始入手,再到hdfs,当我学习hdfs时候,就感觉到hdfs和mapreduce关系的紧密.这个可能 ...
- Hadoop MapReduce编程 API入门系列之wordcount版本1(五)
这个很简单哈,编程的版本很多种. 代码版本1 package zhouls.bigdata.myMapReduce.wordcount5; import java.io.IOException; im ...
随机推荐
- Android NDK 环境搭建 + 测试例程(转)
懒得废话一大堆概念,关于ADT.NDK的概念要是你不懂,怎么会搜到这里来?所以你只需要根据下面的步骤来,就可以完成NDK环境搭建了. 步骤:(假设你未安装任何相关开发工具,如果已经安装了,就可以跳过) ...
- HeaderViewListAdapter
该类其实就是普通使用的Adapter的一个包装类,就是为了添加header和footer而定义的.该类一般不直接使用,当ListView有header和footer时,ListView中会自动把Ada ...
- 自签名SSL生成
本教程以AppServ生成自签名证书为例,同时配置OpenSSL环境1.生成自签名证书,控制台程序进入Apache目录下的bin目录 >openssl req -config ../conf/o ...
- Android中px、dp、sp的区别
px: 即像素,1px代表屏幕上一个物理的像素点: px单位不被建议使用,因为同样100px的图片,在不同手机上显示的实际大小可能不同,如下图所示(图片来自android developer guid ...
- iOS开发之Quartz2D详解
1. 什么是Quartz2D? Quartz 2D是一个二维绘图引擎,同时支持iOS和Mac系统 Quartz 2D能完成的工作 绘制图形 : 线条\三角形\矩形\圆\弧等 绘制文字 绘制\生成图片( ...
- iOS开发常用的第三方框架
1. AFNetworking 在众多iOS开源项目中,AFNetworking可以称得上是最受开发者欢迎的库项目.AFNetworking是一个轻量级的iOS.Mac OS X网络通信类库,现在是G ...
- 你好,C++(9)坐216路公交车去买3.5元一斤的西红柿——C++中如何表达各种数值数据 3.3 数值数据类型
3.3 数值数据类型 从每天早上睁开眼睛的那一刻开始,我们几乎每时每刻都在与数字打交道:从闹钟上的6点30分,到上班坐的216路公共汽车:从新闻中说的房价跌到了100元每平米到回家买菜时的西红柿3. ...
- 交叉编译:cannot find /lib/libc.so.6 collect2: ld returned 1 exit status
1.有时候明明指定了交叉编译的动态库搜索路径,但有些库提示还是搜索不到,而且提示的搜索路径有点奇怪,不是指定的路径,比如: /opt/mips-4.4/bin/../lib/gcc/mips-linu ...
- LXPanel自定义添加应用程序到快速启动栏
LXPanel是Linux下LXDE项目的一个桌面面板软件.我一开始接触的时候,对于自己自定义的程序到快速启动栏绕了很多弯路,这里记录下,防止以后自己忘了.还有一点就是很多时候,panel下的应用程序 ...
- Linux Chaining Operators用法学习
Linux Chaining Operators顾名思义,就是连接命令的操作,有些时候,往往一些命令可以用一行命令代替,我们就不需要大动干戈再去写Shell Script了,掌握和学习这些Chaini ...