Hadoop MapReduce InputFormat/OutputFormat
InputFormat
import java.io.IOException;
import java.util.List; /**
* InputFormat describes the input-specification for a Map-Reduce job.
*
* The Map-Reduce framework relies on the InputFormat of the job to:
*
* Validate the input-specification of the job.
*
* Split-up the input file(s) into logical InputSplits, each of which is then
* assigned to an individual Mapper.
*
* Provide the RecordReader implementation to be used to glean input records
* from the logical InputSplit for processing by the Mapper.
*
* The default behavior of file-based InputFormats, typically sub-classes of
* FileInputFormat, is to split the input into logical InputSplits based on the
* total size, in bytes, of the input files. However, the FileSystem blocksize
* of the input files is treated as an upper bound for input splits. A lower
* bound on the split size can be set via mapred.min.split.size.
*
* Clearly, logical splits based on input-size is insufficient for many
* applications since record boundaries are to respected. In such cases, the
* application has to also implement a RecordReader on whom lies the
* responsibility to respect record-boundaries and present a record-oriented
* view of the logical InputSplit to the individual task.
*
*/
public abstract class InputFormat<K, V> { /**
* Logically split the set of input files for the job.
*
* <p>
* Each {@link InputSplit} is then assigned to an individual {@link Mapper}
* for processing.
* </p>
*
* <p>
* <i>Note</i>: The split is a <i>logical</i> split of the inputs and the
* input files are not physically split into chunks. For e.g. a split could
* be <i><input-file-path, start, offset></i> tuple. The InputFormat
* also creates the {@link RecordReader} to read the {@link InputSplit}.
*
* @param context
* job configuration.
* @return an array of {@link InputSplit}s for the job.
*/
public abstract List<InputSplit> getSplits(JobContext context)
throws IOException, InterruptedException; /**
* Create a record reader for a given split. The framework will call
* {@link RecordReader#initialize(InputSplit, TaskAttemptContext)} before
* the split is used.
*
* @param split
* the split to be read
* @param context
* the information about the task
* @return a new record reader
* @throws IOException
* @throws InterruptedException
*/
public abstract RecordReader<K, V> createRecordReader(InputSplit split,
TaskAttemptContext context) throws IOException,
InterruptedException; }
OutputFormat
import java.io.IOException; import org.apache.hadoop.mapreduce.InputFormat;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.RecordReader; /**
* <code>InputSplit</code> represents the data to be processed by an individual
* {@link Mapper}.
*
* <p>
* Typically, it presents a byte-oriented view on the input and is the
* responsibility of {@link RecordReader} of the job to process this and present
* a record-oriented view.
*
* @see InputFormat
* @see RecordReader
*/
public abstract class InputSplit { /**
* Get the size of the split, so that the input splits can be sorted by
* size.
*
* @return the number of bytes in the split
* @throws IOException
* @throws InterruptedException
*/
public abstract long getLength() throws IOException, InterruptedException; /**
* Get the list of nodes by name where the data for the split would be
* local. The locations do not need to be serialized.
*
* @return a new array of the node nodes.
* @throws IOException
* @throws InterruptedException
*/
public abstract String[] getLocations() throws IOException,
InterruptedException; }
RecordReader
import java.io.Closeable;
import java.io.IOException; /**
* The record reader breaks the data into key/value pairs for input to the
* {@link Mapper}.
*
* @param <KEYIN>
* @param <VALUEIN>
*/
public abstract class RecordReader<KEYIN, VALUEIN> implements Closeable { /**
* Called once at initialization.
*
* @param split
* the split that defines the range of records to read
* @param context
* the information about the task
* @throws IOException
* @throws InterruptedException
*/
public abstract void initialize(InputSplit split, TaskAttemptContext context)
throws IOException, InterruptedException; /**
* Read the next key, value pair.
*
* @return true if a key/value pair was read
* @throws IOException
* @throws InterruptedException
*/
public abstract boolean nextKeyValue() throws IOException,
InterruptedException; /**
* Get the current key
*
* @return the current key or null if there is no current key
* @throws IOException
* @throws InterruptedException
*/
public abstract KEYIN getCurrentKey() throws IOException,
InterruptedException; /**
* Get the current value.
*
* @return the object that was read
* @throws IOException
* @throws InterruptedException
*/
public abstract VALUEIN getCurrentValue() throws IOException,
InterruptedException; /**
* The current progress of the record reader through its data.
*
* @return a number between 0.0 and 1.0 that is the fraction of the data
* read
* @throws IOException
* @throws InterruptedException
*/
public abstract float getProgress() throws IOException,
InterruptedException; /**
* Close the record reader.
*/
public abstract void close() throws IOException; }
OutputFormat
import java.io.IOException; import org.apache.hadoop.fs.FileSystem; /**
* <code>OutputFormat</code> describes the output-specification for a Map-Reduce
* job.
*
* <p>
* The Map-Reduce framework relies on the <code>OutputFormat</code> of the job
* to:
* <p>
* <ol>
* <li>
* Validate the output-specification of the job. For e.g. check that the output
* directory doesn't already exist.
* <li>
* Provide the {@link RecordWriter} implementation to be used to write out the
* output files of the job. Output files are stored in a {@link FileSystem}.</li>
* </ol>
*
* @see RecordWriter
*/
public abstract class OutputFormat<K, V> { /**
* Get the {@link RecordWriter} for the given task.
*
* @param context
* the information about the current task.
* @return a {@link RecordWriter} to write the output for the job.
* @throws IOException
*/
public abstract RecordWriter<K, V> getRecordWriter(
TaskAttemptContext context) throws IOException,
InterruptedException; /**
* Check for validity of the output-specification for the job.
*
* <p>
* This is to validate the output specification for the job when it is a job
* is submitted. Typically checks that it does not already exist, throwing
* an exception when it already exists, so that output is not overwritten.
* </p>
*
* @param context
* information about the job
* @throws IOException
* when output should not be attempted
*/
public abstract void checkOutputSpecs(JobContext context)
throws IOException, InterruptedException; /**
* Get the output committer for this output format. This is responsible for
* ensuring the output is committed correctly.
*
* @param context
* the task context
* @return an output committer
* @throws IOException
* @throws InterruptedException
*/
public abstract OutputCommitter getOutputCommitter(
TaskAttemptContext context) throws IOException,
InterruptedException; }
RecordWriter
import java.io.IOException; import org.apache.hadoop.fs.FileSystem; /**
* <code>RecordWriter</code> writes the output <key, value> pairs to an
* output file.
*
* <p>
* <code>RecordWriter</code> implementations write the job outputs to the
* {@link FileSystem}.
*
* @see OutputFormat
*/
public abstract class RecordWriter<K, V> { /**
* Writes a key/value pair.
*
* @param key
* the key to write.
* @param value
* the value to write.
* @throws IOException
*/
public abstract void write(K key, V value) throws IOException,
InterruptedException; /**
* Close this <code>RecordWriter</code> to future operations.
*
* @param context
* the context of the task
* @throws IOException
*/
public abstract void close(TaskAttemptContext context) throws IOException,
InterruptedException; }
OutputCommitter
import java.io.IOException; /**
* <code>OutputCommitter</code> describes the commit of task output for a
* Map-Reduce job.
*
* <p>
* The Map-Reduce framework relies on the <code>OutputCommitter</code> of the
* job to:
* <p>
* <ol>
* <li>
* Setup the job during initialization. For example, create the temporary output
* directory for the job during the initialization of the job.</li>
* <li>
* Cleanup the job after the job completion. For example, remove the temporary
* output directory after the job completion.</li>
* <li>
* Setup the task temporary output.</li>
* <li>
* Check whether a task needs a commit. This is to avoid the commit procedure if
* a task does not need commit.</li>
* <li>
* Commit of the task output.</li>
* <li>
* Discard the task commit.</li>
* </ol>
*
* @see org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
* @see JobContext
* @see TaskAttemptContext
*
*/
public abstract class OutputCommitter { /**
* For the framework to setup the job output during initialization
*
* @param jobContext
* Context of the job whose output is being written.
* @throws IOException
* if temporary output could not be created
*/
public abstract void setupJob(JobContext jobContext) throws IOException; /**
* For cleaning up the job's output after job completion
*
* @param jobContext
* Context of the job whose output is being written.
* @throws IOException
*/
public abstract void cleanupJob(JobContext jobContext) throws IOException; /**
* Sets up output for the task.
*
* @param taskContext
* Context of the task whose output is being written.
* @throws IOException
*/
public abstract void setupTask(TaskAttemptContext taskContext)
throws IOException; /**
* Check whether task needs a commit
*
* @param taskContext
* @return true/false
* @throws IOException
*/
public abstract boolean needsTaskCommit(TaskAttemptContext taskContext)
throws IOException; /**
* To promote the task's temporary output to final output location
*
* The task's output is moved to the job's output directory.
*
* @param taskContext
* Context of the task whose output is being written.
* @throws IOException
* if commit is not
*/
public abstract void commitTask(TaskAttemptContext taskContext)
throws IOException; /**
* Discard the task output
*
* @param taskContext
* @throws IOException
*/
public abstract void abortTask(TaskAttemptContext taskContext)
throws IOException; }
Hadoop MapReduce InputFormat/OutputFormat的更多相关文章
- [Hadoop] - 自定义Mapreduce InputFormat&OutputFormat
在MR程序的开发过程中,经常会遇到输入数据不是HDFS或者数据输出目的地不是HDFS的,MapReduce的设计已经考虑到这种情况,它为我们提供了两个组建,只需要我们自定义适合的InputFormat ...
- Hadoop MapReduce InputFormat基础
有时候你可能想要用不同的方法从input data中读取数据.那么你就需要创建一个自己的InputFormat类. InputFormat是一个只有两个函数的接口. public interf ...
- Hadoop MapReduce编程 API入门系列之网页流量版本1(二十二)
不多说,直接上代码. 对流量原始日志进行流量统计,将不同省份的用户统计结果输出到不同文件. 代码 package zhouls.bigdata.myMapReduce.flowsum; import ...
- Hadoop(20)-MapReduce框架原理-OutputFormat
1.outputFormat接口实现类 2.自定义outputFormat 步骤: 1). 定义一个类继承FileOutputFormat 2). 定义一个类继承RecordWrite,重写write ...
- Hadoop MapReduce编程学习
一直在搞spark,也没时间弄hadoop,不过Hadoop基本的编程我觉得我还是要会吧,看到一篇不错的文章,不过应该应用于hadoop2.0以前,因为代码中有 conf.set("map ...
- Hadoop MapReduce概念学习系列之mr程序组件全貌(二十)
其实啊,spilt是,控制Apache Hadoop Mapreduce的map并发任务数,详细见http://www.cnblogs.com/zlslch/p/5713652.html map,是m ...
- hadoop MapReduce
简单介绍 官方给出的介绍是hadoop MR是一个用于轻松编写以一种可靠的.容错的方式在商业化硬件上的大型集群上并行处理大量数据的应用程序的软件框架. MR任务通常会先把输入的数据集切分成独立的块(可 ...
- 【Big Data - Hadoop - MapReduce】hadoop 学习笔记:MapReduce框架详解
开始聊MapReduce,MapReduce是Hadoop的计算框架,我学Hadoop是从Hive开始入手,再到hdfs,当我学习hdfs时候,就感觉到hdfs和mapreduce关系的紧密.这个可能 ...
- Hadoop MapReduce编程 API入门系列之wordcount版本1(五)
这个很简单哈,编程的版本很多种. 代码版本1 package zhouls.bigdata.myMapReduce.wordcount5; import java.io.IOException; im ...
随机推荐
- hnsd11348tree(并查集)
Problem description A graph consists of a set of vertices and edges between pairs of vertices. Two v ...
- [Angular 2] Passing data to components with @Input
@Input allows you to pass data into your controller and templates through html and defining custom p ...
- linux中 vi / vim显示行号或取消行号命令
1. 显示行号 :set number 或者 :set nu 2. 取消行号显示 :set nu! 3. 每次打开都显示行号 修改vi ~/.vimrc 文件,添加:set number
- RJ45口线序的理解
RJ45线序就是TX_P / TX_N / RX_P / RX_N 四根线,分别用到的是1,2,3,6 因为TX要匹配RX,所以 线1 变成 另一端的 线3, 线2 变成 另一端的 线6 反过来也一样
- 循序渐近学docker---笔记
1.安装docker 环境:ubuntu 16.04 sudo apt-get install docker.io root@ld-Lenovo-G470:~# docker -vDocker ver ...
- Lambda表达式 简介 语法 示例
Lambda 表达式也称为闭包,是匿名类的简短形式.Lambda 表达式简化了[单一抽象方法声明接口]的使用,因此 lambda 表达式也称为功能接口. 在 Java SE 7 中,单一方法接口可使用 ...
- HTML简要内容
1. html基础 html是用来制作网页的标记语言,不需编译,直接由浏览器执行.大小写不敏感,推荐使用小写.html文件必须使用html或htm为文件名后缀. html主体结构: (1)DTD头: ...
- webApp禁止用户保存图像
img { -webkit-touch-callout: none; }
- IE浏览器中设为首页
<a onclick="this.style.behavior='url(#default#homepage)';this.setHomePage('<%=Configurati ...
- adb shell - device not found
如果是真机,则连接usb即可(我的是真机).