Hadoop MapReduce InputFormat/OutputFormat
InputFormat
import java.io.IOException;
import java.util.List; /**
* InputFormat describes the input-specification for a Map-Reduce job.
*
* The Map-Reduce framework relies on the InputFormat of the job to:
*
* Validate the input-specification of the job.
*
* Split-up the input file(s) into logical InputSplits, each of which is then
* assigned to an individual Mapper.
*
* Provide the RecordReader implementation to be used to glean input records
* from the logical InputSplit for processing by the Mapper.
*
* The default behavior of file-based InputFormats, typically sub-classes of
* FileInputFormat, is to split the input into logical InputSplits based on the
* total size, in bytes, of the input files. However, the FileSystem blocksize
* of the input files is treated as an upper bound for input splits. A lower
* bound on the split size can be set via mapred.min.split.size.
*
* Clearly, logical splits based on input-size is insufficient for many
* applications since record boundaries are to respected. In such cases, the
* application has to also implement a RecordReader on whom lies the
* responsibility to respect record-boundaries and present a record-oriented
* view of the logical InputSplit to the individual task.
*
*/
public abstract class InputFormat<K, V> { /**
* Logically split the set of input files for the job.
*
* <p>
* Each {@link InputSplit} is then assigned to an individual {@link Mapper}
* for processing.
* </p>
*
* <p>
* <i>Note</i>: The split is a <i>logical</i> split of the inputs and the
* input files are not physically split into chunks. For e.g. a split could
* be <i><input-file-path, start, offset></i> tuple. The InputFormat
* also creates the {@link RecordReader} to read the {@link InputSplit}.
*
* @param context
* job configuration.
* @return an array of {@link InputSplit}s for the job.
*/
public abstract List<InputSplit> getSplits(JobContext context)
throws IOException, InterruptedException; /**
* Create a record reader for a given split. The framework will call
* {@link RecordReader#initialize(InputSplit, TaskAttemptContext)} before
* the split is used.
*
* @param split
* the split to be read
* @param context
* the information about the task
* @return a new record reader
* @throws IOException
* @throws InterruptedException
*/
public abstract RecordReader<K, V> createRecordReader(InputSplit split,
TaskAttemptContext context) throws IOException,
InterruptedException; }
OutputFormat
import java.io.IOException; import org.apache.hadoop.mapreduce.InputFormat;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.RecordReader; /**
* <code>InputSplit</code> represents the data to be processed by an individual
* {@link Mapper}.
*
* <p>
* Typically, it presents a byte-oriented view on the input and is the
* responsibility of {@link RecordReader} of the job to process this and present
* a record-oriented view.
*
* @see InputFormat
* @see RecordReader
*/
public abstract class InputSplit { /**
* Get the size of the split, so that the input splits can be sorted by
* size.
*
* @return the number of bytes in the split
* @throws IOException
* @throws InterruptedException
*/
public abstract long getLength() throws IOException, InterruptedException; /**
* Get the list of nodes by name where the data for the split would be
* local. The locations do not need to be serialized.
*
* @return a new array of the node nodes.
* @throws IOException
* @throws InterruptedException
*/
public abstract String[] getLocations() throws IOException,
InterruptedException; }
RecordReader
import java.io.Closeable;
import java.io.IOException; /**
* The record reader breaks the data into key/value pairs for input to the
* {@link Mapper}.
*
* @param <KEYIN>
* @param <VALUEIN>
*/
public abstract class RecordReader<KEYIN, VALUEIN> implements Closeable { /**
* Called once at initialization.
*
* @param split
* the split that defines the range of records to read
* @param context
* the information about the task
* @throws IOException
* @throws InterruptedException
*/
public abstract void initialize(InputSplit split, TaskAttemptContext context)
throws IOException, InterruptedException; /**
* Read the next key, value pair.
*
* @return true if a key/value pair was read
* @throws IOException
* @throws InterruptedException
*/
public abstract boolean nextKeyValue() throws IOException,
InterruptedException; /**
* Get the current key
*
* @return the current key or null if there is no current key
* @throws IOException
* @throws InterruptedException
*/
public abstract KEYIN getCurrentKey() throws IOException,
InterruptedException; /**
* Get the current value.
*
* @return the object that was read
* @throws IOException
* @throws InterruptedException
*/
public abstract VALUEIN getCurrentValue() throws IOException,
InterruptedException; /**
* The current progress of the record reader through its data.
*
* @return a number between 0.0 and 1.0 that is the fraction of the data
* read
* @throws IOException
* @throws InterruptedException
*/
public abstract float getProgress() throws IOException,
InterruptedException; /**
* Close the record reader.
*/
public abstract void close() throws IOException; }
OutputFormat
import java.io.IOException; import org.apache.hadoop.fs.FileSystem; /**
* <code>OutputFormat</code> describes the output-specification for a Map-Reduce
* job.
*
* <p>
* The Map-Reduce framework relies on the <code>OutputFormat</code> of the job
* to:
* <p>
* <ol>
* <li>
* Validate the output-specification of the job. For e.g. check that the output
* directory doesn't already exist.
* <li>
* Provide the {@link RecordWriter} implementation to be used to write out the
* output files of the job. Output files are stored in a {@link FileSystem}.</li>
* </ol>
*
* @see RecordWriter
*/
public abstract class OutputFormat<K, V> { /**
* Get the {@link RecordWriter} for the given task.
*
* @param context
* the information about the current task.
* @return a {@link RecordWriter} to write the output for the job.
* @throws IOException
*/
public abstract RecordWriter<K, V> getRecordWriter(
TaskAttemptContext context) throws IOException,
InterruptedException; /**
* Check for validity of the output-specification for the job.
*
* <p>
* This is to validate the output specification for the job when it is a job
* is submitted. Typically checks that it does not already exist, throwing
* an exception when it already exists, so that output is not overwritten.
* </p>
*
* @param context
* information about the job
* @throws IOException
* when output should not be attempted
*/
public abstract void checkOutputSpecs(JobContext context)
throws IOException, InterruptedException; /**
* Get the output committer for this output format. This is responsible for
* ensuring the output is committed correctly.
*
* @param context
* the task context
* @return an output committer
* @throws IOException
* @throws InterruptedException
*/
public abstract OutputCommitter getOutputCommitter(
TaskAttemptContext context) throws IOException,
InterruptedException; }
RecordWriter
import java.io.IOException; import org.apache.hadoop.fs.FileSystem; /**
* <code>RecordWriter</code> writes the output <key, value> pairs to an
* output file.
*
* <p>
* <code>RecordWriter</code> implementations write the job outputs to the
* {@link FileSystem}.
*
* @see OutputFormat
*/
public abstract class RecordWriter<K, V> { /**
* Writes a key/value pair.
*
* @param key
* the key to write.
* @param value
* the value to write.
* @throws IOException
*/
public abstract void write(K key, V value) throws IOException,
InterruptedException; /**
* Close this <code>RecordWriter</code> to future operations.
*
* @param context
* the context of the task
* @throws IOException
*/
public abstract void close(TaskAttemptContext context) throws IOException,
InterruptedException; }
OutputCommitter
import java.io.IOException; /**
* <code>OutputCommitter</code> describes the commit of task output for a
* Map-Reduce job.
*
* <p>
* The Map-Reduce framework relies on the <code>OutputCommitter</code> of the
* job to:
* <p>
* <ol>
* <li>
* Setup the job during initialization. For example, create the temporary output
* directory for the job during the initialization of the job.</li>
* <li>
* Cleanup the job after the job completion. For example, remove the temporary
* output directory after the job completion.</li>
* <li>
* Setup the task temporary output.</li>
* <li>
* Check whether a task needs a commit. This is to avoid the commit procedure if
* a task does not need commit.</li>
* <li>
* Commit of the task output.</li>
* <li>
* Discard the task commit.</li>
* </ol>
*
* @see org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
* @see JobContext
* @see TaskAttemptContext
*
*/
public abstract class OutputCommitter { /**
* For the framework to setup the job output during initialization
*
* @param jobContext
* Context of the job whose output is being written.
* @throws IOException
* if temporary output could not be created
*/
public abstract void setupJob(JobContext jobContext) throws IOException; /**
* For cleaning up the job's output after job completion
*
* @param jobContext
* Context of the job whose output is being written.
* @throws IOException
*/
public abstract void cleanupJob(JobContext jobContext) throws IOException; /**
* Sets up output for the task.
*
* @param taskContext
* Context of the task whose output is being written.
* @throws IOException
*/
public abstract void setupTask(TaskAttemptContext taskContext)
throws IOException; /**
* Check whether task needs a commit
*
* @param taskContext
* @return true/false
* @throws IOException
*/
public abstract boolean needsTaskCommit(TaskAttemptContext taskContext)
throws IOException; /**
* To promote the task's temporary output to final output location
*
* The task's output is moved to the job's output directory.
*
* @param taskContext
* Context of the task whose output is being written.
* @throws IOException
* if commit is not
*/
public abstract void commitTask(TaskAttemptContext taskContext)
throws IOException; /**
* Discard the task output
*
* @param taskContext
* @throws IOException
*/
public abstract void abortTask(TaskAttemptContext taskContext)
throws IOException; }
Hadoop MapReduce InputFormat/OutputFormat的更多相关文章
- [Hadoop] - 自定义Mapreduce InputFormat&OutputFormat
在MR程序的开发过程中,经常会遇到输入数据不是HDFS或者数据输出目的地不是HDFS的,MapReduce的设计已经考虑到这种情况,它为我们提供了两个组建,只需要我们自定义适合的InputFormat ...
- Hadoop MapReduce InputFormat基础
有时候你可能想要用不同的方法从input data中读取数据.那么你就需要创建一个自己的InputFormat类. InputFormat是一个只有两个函数的接口. public interf ...
- Hadoop MapReduce编程 API入门系列之网页流量版本1(二十二)
不多说,直接上代码. 对流量原始日志进行流量统计,将不同省份的用户统计结果输出到不同文件. 代码 package zhouls.bigdata.myMapReduce.flowsum; import ...
- Hadoop(20)-MapReduce框架原理-OutputFormat
1.outputFormat接口实现类 2.自定义outputFormat 步骤: 1). 定义一个类继承FileOutputFormat 2). 定义一个类继承RecordWrite,重写write ...
- Hadoop MapReduce编程学习
一直在搞spark,也没时间弄hadoop,不过Hadoop基本的编程我觉得我还是要会吧,看到一篇不错的文章,不过应该应用于hadoop2.0以前,因为代码中有 conf.set("map ...
- Hadoop MapReduce概念学习系列之mr程序组件全貌(二十)
其实啊,spilt是,控制Apache Hadoop Mapreduce的map并发任务数,详细见http://www.cnblogs.com/zlslch/p/5713652.html map,是m ...
- hadoop MapReduce
简单介绍 官方给出的介绍是hadoop MR是一个用于轻松编写以一种可靠的.容错的方式在商业化硬件上的大型集群上并行处理大量数据的应用程序的软件框架. MR任务通常会先把输入的数据集切分成独立的块(可 ...
- 【Big Data - Hadoop - MapReduce】hadoop 学习笔记:MapReduce框架详解
开始聊MapReduce,MapReduce是Hadoop的计算框架,我学Hadoop是从Hive开始入手,再到hdfs,当我学习hdfs时候,就感觉到hdfs和mapreduce关系的紧密.这个可能 ...
- Hadoop MapReduce编程 API入门系列之wordcount版本1(五)
这个很简单哈,编程的版本很多种. 代码版本1 package zhouls.bigdata.myMapReduce.wordcount5; import java.io.IOException; im ...
随机推荐
- [Javascript] Logging Pretty-Printing Tabular Data to the Console
Learn how to use console.table to render arrays and objects in a tabular format for easy scanning ov ...
- Linux cpuinfo 详解
在Linux系统中,如何详细了解CPU的信息呢? 当然是通过cat /proc/cpuinfo来检查了,但是比如几个物理CPU/几核/几线程,这些问题怎么确定呢? 经过查看,我的开发机器是1个物理C ...
- CSRF跨站点请求伪造漏洞问题
最近在写php,项目写完后送检发现一个漏洞问题CSRF,强行拖了我一天的时间,沉迷解决问题,茶饭不思,日渐消瘦,时间比较赶,这篇比较糙,凑合看下. 好了废话不多说下面是今天的解决方案. 博主用的是Th ...
- 转载:C# 之泛型详解
本文原地址:http://www.blogjava.net/Jack2007/archive/2008/05/05/198566.html.感谢博主分享! 什么是泛型 我们在编写程序时,经常遇到两个模 ...
- ADLINK 8158控制程序-连续运动(VB.NET)
运动平台:日脉的二维运动平台(一个旋转平台和一个滑动平台) 开发环境:VS2010 + .NET Framework + VB.NET 使用文件:pci_8158.vb motion_8158_2D. ...
- css渐变色
<!DOCTYPE html><html><head> <meta http-equiv="content-type" content=& ...
- 关于uploadify 没有显示进度条!!!!
如果你也弄了很久不知道为什么不出现上传进度条!,那就一定要看这里了! 我注释了 queueID 属性后 就出现了!!!!! 就是这么简答! //添加界面的附件管理 $('#file_upload'). ...
- mac在 aliyun linux ecs实例上安装 jdk tomcat mysql
用了一个ftp 工具 把 gz rpm 等 传递到ecs 上 -- 用这个Transmit 用ssh远程登录,然后依次安装 jdk tomcat mysql 到 /usr/local/... 设置环 ...
- 抓取锁的sql语句-第五次修改
CREATE OR REPLACE PROCEDURE SOLVE_LOCK AS V_SQL VARCHAR2(3000); --定义 v_sql 接受抓取锁的sql语句V_SQL02 VARCHA ...
- cocopods安装
CocoaPods安装和使用教程 Code4App 原创文章.转载请注明出处:http://code4app.com/article/cocoapods-install-usage 目录 CocoaP ...