Hadoop MapReduce InputFormat/OutputFormat

InputFormat

import java.io.IOException;

import java.util.List;

/**

 * InputFormat describes the input-specification for a Map-Reduce job.

 *

 * The Map-Reduce framework relies on the InputFormat of the job to:

 *

 * Validate the input-specification of the job.

 *

 * Split-up the input file(s) into logical InputSplits, each of which is then

 * assigned to an individual Mapper.

 *

 * Provide the RecordReader implementation to be used to glean input records

 * from the logical InputSplit for processing by the Mapper.

 *

 * The default behavior of file-based InputFormats, typically sub-classes of

 * FileInputFormat, is to split the input into logical InputSplits based on the

 * total size, in bytes, of the input files. However, the FileSystem blocksize

 * of the input files is treated as an upper bound for input splits. A lower

 * bound on the split size can be set via mapred.min.split.size.

 *

 * Clearly, logical splits based on input-size is insufficient for many

 * applications since record boundaries are to respected. In such cases, the

 * application has to also implement a RecordReader on whom lies the

 * responsibility to respect record-boundaries and present a record-oriented

 * view of the logical InputSplit to the individual task.

 *

 */

public abstract class InputFormat<K, V> {

    /**

     * Logically split the set of input files for the job.

     *

     * <p>

     * Each {@link InputSplit} is then assigned to an individual {@link Mapper}

     * for processing.

     * </p>

     *

     * <p>

     * <i>Note</i>: The split is a <i>logical</i> split of the inputs and the

     * input files are not physically split into chunks. For e.g. a split could

     * be <i>&lt;input-file-path, start, offset&gt;</i> tuple. The InputFormat

     * also creates the {@link RecordReader} to read the {@link InputSplit}.

     *

     * @param context

     *            job configuration.

     * @return an array of {@link InputSplit}s for the job.

     */

    public abstract List<InputSplit> getSplits(JobContext context)

            throws IOException, InterruptedException;

    /**

     * Create a record reader for a given split. The framework will call

     * {@link RecordReader#initialize(InputSplit, TaskAttemptContext)} before

     * the split is used.

     *

     * @param split

     *            the split to be read

     * @param context

     *            the information about the task

     * @return a new record reader

     * @throws IOException

     * @throws InterruptedException

     */

    public abstract RecordReader<K, V> createRecordReader(InputSplit split,

            TaskAttemptContext context) throws IOException,

            InterruptedException;

}

OutputFormat

import java.io.IOException;

import org.apache.hadoop.mapreduce.InputFormat;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.RecordReader;

/**

 * <code>InputSplit</code> represents the data to be processed by an individual

 * {@link Mapper}.

 *

 * <p>

 * Typically, it presents a byte-oriented view on the input and is the

 * responsibility of {@link RecordReader} of the job to process this and present

 * a record-oriented view.

 *

 * @see InputFormat

 * @see RecordReader

 */

public abstract class InputSplit {

    /**

     * Get the size of the split, so that the input splits can be sorted by

     * size.

     *

     * @return the number of bytes in the split

     * @throws IOException

     * @throws InterruptedException

     */

    public abstract long getLength() throws IOException, InterruptedException;

    /**

     * Get the list of nodes by name where the data for the split would be

     * local. The locations do not need to be serialized.

     *

     * @return a new array of the node nodes.

     * @throws IOException

     * @throws InterruptedException

     */

    public abstract String[] getLocations() throws IOException,

            InterruptedException;

}

RecordReader

import java.io.Closeable;

import java.io.IOException;

/**

 * The record reader breaks the data into key/value pairs for input to the

 * {@link Mapper}.

 *

 * @param <KEYIN>

 * @param <VALUEIN>

 */

public abstract class RecordReader<KEYIN, VALUEIN> implements Closeable {

    /**

     * Called once at initialization.

     *

     * @param split

     *            the split that defines the range of records to read

     * @param context

     *            the information about the task

     * @throws IOException

     * @throws InterruptedException

     */

    public abstract void initialize(InputSplit split, TaskAttemptContext context)

            throws IOException, InterruptedException;

    /**

     * Read the next key, value pair.

     *

     * @return true if a key/value pair was read

     * @throws IOException

     * @throws InterruptedException

     */

    public abstract boolean nextKeyValue() throws IOException,

            InterruptedException;

    /**

     * Get the current key

     *

     * @return the current key or null if there is no current key

     * @throws IOException

     * @throws InterruptedException

     */

    public abstract KEYIN getCurrentKey() throws IOException,

            InterruptedException;

    /**

     * Get the current value.

     *

     * @return the object that was read

     * @throws IOException

     * @throws InterruptedException

     */

    public abstract VALUEIN getCurrentValue() throws IOException,

            InterruptedException;

    /**

     * The current progress of the record reader through its data.

     *

     * @return a number between 0.0 and 1.0 that is the fraction of the data

     *         read

     * @throws IOException

     * @throws InterruptedException

     */

    public abstract float getProgress() throws IOException,

            InterruptedException;

    /**

     * Close the record reader.

     */

    public abstract void close() throws IOException;

}

OutputFormat

import java.io.IOException;

import org.apache.hadoop.fs.FileSystem;

/**

 * <code>OutputFormat</code> describes the output-specification for a Map-Reduce

 * job.

 *

 * <p>

 * The Map-Reduce framework relies on the <code>OutputFormat</code> of the job

 * to:

 * <p>

 * <ol>

 * <li>

 * Validate the output-specification of the job. For e.g. check that the output

 * directory doesn't already exist.

 * <li>

 * Provide the {@link RecordWriter} implementation to be used to write out the

 * output files of the job. Output files are stored in a {@link FileSystem}.</li>

 * </ol>

 *

 * @see RecordWriter

 */

public abstract class OutputFormat<K, V> {

    /**

     * Get the {@link RecordWriter} for the given task.

     *

     * @param context

     *            the information about the current task.

     * @return a {@link RecordWriter} to write the output for the job.

     * @throws IOException

     */

    public abstract RecordWriter<K, V> getRecordWriter(

            TaskAttemptContext context) throws IOException,

            InterruptedException;

    /**

     * Check for validity of the output-specification for the job.

     *

     * <p>

     * This is to validate the output specification for the job when it is a job

     * is submitted. Typically checks that it does not already exist, throwing

     * an exception when it already exists, so that output is not overwritten.

     * </p>

     *

     * @param context

     *            information about the job

     * @throws IOException

     *             when output should not be attempted

     */

    public abstract void checkOutputSpecs(JobContext context)

            throws IOException, InterruptedException;

    /**

     * Get the output committer for this output format. This is responsible for

     * ensuring the output is committed correctly.

     *

     * @param context

     *            the task context

     * @return an output committer

     * @throws IOException

     * @throws InterruptedException

     */

    public abstract OutputCommitter getOutputCommitter(

            TaskAttemptContext context) throws IOException,

            InterruptedException;

}

RecordWriter

import java.io.IOException;

import org.apache.hadoop.fs.FileSystem;

/**

 * <code>RecordWriter</code> writes the output &lt;key, value&gt; pairs to an

 * output file.

 *

 * <p>

 * <code>RecordWriter</code> implementations write the job outputs to the

 * {@link FileSystem}.

 *

 * @see OutputFormat

 */

public abstract class RecordWriter<K, V> {

    /**

     * Writes a key/value pair.

     *

     * @param key

     *            the key to write.

     * @param value

     *            the value to write.

     * @throws IOException

     */

    public abstract void write(K key, V value) throws IOException,

            InterruptedException;

    /**

     * Close this <code>RecordWriter</code> to future operations.

     *

     * @param context

     *            the context of the task

     * @throws IOException

     */

    public abstract void close(TaskAttemptContext context) throws IOException,

            InterruptedException;

}

OutputCommitter

import java.io.IOException;

/**

 * <code>OutputCommitter</code> describes the commit of task output for a

 * Map-Reduce job.

 *

 * <p>

 * The Map-Reduce framework relies on the <code>OutputCommitter</code> of the

 * job to:

 * <p>

 * <ol>

 * <li>

 * Setup the job during initialization. For example, create the temporary output

 * directory for the job during the initialization of the job.</li>

 * <li>

 * Cleanup the job after the job completion. For example, remove the temporary

 * output directory after the job completion.</li>

 * <li>

 * Setup the task temporary output.</li>

 * <li>

 * Check whether a task needs a commit. This is to avoid the commit procedure if

 * a task does not need commit.</li>

 * <li>

 * Commit of the task output.</li>

 * <li>

 * Discard the task commit.</li>

 * </ol>

 *

 * @see org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter

 * @see JobContext

 * @see TaskAttemptContext

 *

 */

public abstract class OutputCommitter {

    /**

     * For the framework to setup the job output during initialization

     *

     * @param jobContext

     *            Context of the job whose output is being written.

     * @throws IOException

     *             if temporary output could not be created

     */

    public abstract void setupJob(JobContext jobContext) throws IOException;

    /**

     * For cleaning up the job's output after job completion

     *

     * @param jobContext

     *            Context of the job whose output is being written.

     * @throws IOException

     */

    public abstract void cleanupJob(JobContext jobContext) throws IOException;

    /**

     * Sets up output for the task.

     *

     * @param taskContext

     *            Context of the task whose output is being written.

     * @throws IOException

     */

    public abstract void setupTask(TaskAttemptContext taskContext)

            throws IOException;

    /**

     * Check whether task needs a commit

     *

     * @param taskContext

     * @return true/false

     * @throws IOException

     */

    public abstract boolean needsTaskCommit(TaskAttemptContext taskContext)

            throws IOException;

    /**

     * To promote the task's temporary output to final output location

     *

     * The task's output is moved to the job's output directory.

     *

     * @param taskContext

     *            Context of the task whose output is being written.

     * @throws IOException

     *             if commit is not

     */

    public abstract void commitTask(TaskAttemptContext taskContext)

            throws IOException;

    /**

     * Discard the task output

     *

     * @param taskContext

     * @throws IOException

     */

    public abstract void abortTask(TaskAttemptContext taskContext)

            throws IOException;

}

Hadoop MapReduce InputFormat/OutputFormat的更多相关文章

[Hadoop] - 自定义Mapreduce InputFormat&OutputFormat
在MR程序的开发过程中,经常会遇到输入数据不是HDFS或者数据输出目的地不是HDFS的,MapReduce的设计已经考虑到这种情况,它为我们提供了两个组建,只需要我们自定义适合的InputFormat ...
Hadoop MapReduce InputFormat基础
有时候你可能想要用不同的方法从input data中读取数据.那么你就需要创建一个自己的InputFormat类. InputFormat是一个只有两个函数的接口. public interf ...
Hadoop MapReduce编程 API入门系列之网页流量版本1（二十二）
不多说,直接上代码. 对流量原始日志进行流量统计,将不同省份的用户统计结果输出到不同文件. 代码 package zhouls.bigdata.myMapReduce.flowsum; import ...
Hadoop(20)-MapReduce框架原理-OutputFormat
1.outputFormat接口实现类 2.自定义outputFormat 步骤: 1). 定义一个类继承FileOutputFormat 2). 定义一个类继承RecordWrite,重写write ...
Hadoop MapReduce编程学习
一直在搞spark,也没时间弄hadoop,不过Hadoop基本的编程我觉得我还是要会吧,看到一篇不错的文章,不过应该应用于hadoop2.0以前,因为代码中有 conf.set("map ...
Hadoop MapReduce概念学习系列之mr程序组件全貌（二十）
其实啊,spilt是,控制Apache Hadoop Mapreduce的map并发任务数,详细见http://www.cnblogs.com/zlslch/p/5713652.html map,是m ...
hadoop MapReduce
简单介绍官方给出的介绍是hadoop MR是一个用于轻松编写以一种可靠的.容错的方式在商业化硬件上的大型集群上并行处理大量数据的应用程序的软件框架. MR任务通常会先把输入的数据集切分成独立的块(可 ...
【Big Data - Hadoop - MapReduce】hadoop 学习笔记：MapReduce框架详解
开始聊MapReduce,MapReduce是Hadoop的计算框架,我学Hadoop是从Hive开始入手,再到hdfs,当我学习hdfs时候,就感觉到hdfs和mapreduce关系的紧密.这个可能 ...
Hadoop MapReduce编程 API入门系列之wordcount版本1（五）
这个很简单哈,编程的版本很多种. 代码版本1 package zhouls.bigdata.myMapReduce.wordcount5; import java.io.IOException; im ...

随机推荐

rsync是类unix系统下的数据镜像备份工具
http://baike.baidu.com/view/1183189.htm Ubuntu安装: sudo apt-get install rsync RedHat: yum install r ...
[Angular 2] Using ng-model for two-way binding
Two-way binding still exists in Angular 2 and ng-model makes it simple. The syntax is a combination ...
一种解决的方法：CGContextSaveGState: invalid context 0x0
遇到这个问题找了好久答案,最后排错排出来了 CGContextSaveGState: invalid context 0x0. This is a serious error. This applic ...
python栈的实现（入栈，出栈）
#coding=utf-8 class Stack(): def __init__(st,size):#栈的初始化 st.stack=[]; st.size=size; st.top=-1 def p ...
EXT ajax简单实例
转载:http://www.cnblogs.com/xiepeixing/archive/2012/10/24/2736751.html EXT ajax request是ext中对于ajax请求的实 ...
CSS 布局Float 【2】
1.页面标准文档流.浮动层.float属性 1.1 文档流 HTML页面的标准文档流(默认布局)是:从上到下,从左到右,遇块(块级元素)换行. 1.2 浮动层浮动层:给元素的float属性赋值后,就 ...
[转]Java远程方法调用
Java远程方法调用,即Java RMI(Java Remote Method Invocation)是Java编程语言里,一种用于实现远程过程调用的应用程序编程接口.它使客户机上运行的程序可以调用远 ...
Cloneable接口分析
Cloneable接口是一个标记接口,也就是没有任何内容,定义如下: package java.lang; pubilc interface Cloneable { } 这里分析一下这个接口的用法 c ...
JS 日常
判断一个字符串是否在另一个字符串里面 var str = 'bblText'; if(str.indexOf("Text") > 0) alert("包含了Tex ...
织梦dede标签tags的美化教程
我们在行dede仿站的时候,经常会遇到tags标签的调用,调用非常简单,官方有专门的调用标签,但是官方的调用是一成不变的模式,dede5.6以前的版本,基本上都是黑色相同大小的表示,而在dede5.7 ...

Hadoop MapReduce InputFormat/OutputFormat

Hadoop MapReduce InputFormat/OutputFormat的更多相关文章

随机推荐

热门专题