Hadoop MapReduce InputFormat/OutputFormat

InputFormat

import java.io.IOException;

import java.util.List;

/**

 * InputFormat describes the input-specification for a Map-Reduce job.

 *

 * The Map-Reduce framework relies on the InputFormat of the job to:

 *

 * Validate the input-specification of the job.

 *

 * Split-up the input file(s) into logical InputSplits, each of which is then

 * assigned to an individual Mapper.

 *

 * Provide the RecordReader implementation to be used to glean input records

 * from the logical InputSplit for processing by the Mapper.

 *

 * The default behavior of file-based InputFormats, typically sub-classes of

 * FileInputFormat, is to split the input into logical InputSplits based on the

 * total size, in bytes, of the input files. However, the FileSystem blocksize

 * of the input files is treated as an upper bound for input splits. A lower

 * bound on the split size can be set via mapred.min.split.size.

 *

 * Clearly, logical splits based on input-size is insufficient for many

 * applications since record boundaries are to respected. In such cases, the

 * application has to also implement a RecordReader on whom lies the

 * responsibility to respect record-boundaries and present a record-oriented

 * view of the logical InputSplit to the individual task.

 *

 */

public abstract class InputFormat<K, V> {

    /**

     * Logically split the set of input files for the job.

     *

     * <p>

     * Each {@link InputSplit} is then assigned to an individual {@link Mapper}

     * for processing.

     * </p>

     *

     * <p>

     * <i>Note</i>: The split is a <i>logical</i> split of the inputs and the

     * input files are not physically split into chunks. For e.g. a split could

     * be <i>&lt;input-file-path, start, offset&gt;</i> tuple. The InputFormat

     * also creates the {@link RecordReader} to read the {@link InputSplit}.

     *

     * @param context

     *            job configuration.

     * @return an array of {@link InputSplit}s for the job.

     */

    public abstract List<InputSplit> getSplits(JobContext context)

            throws IOException, InterruptedException;

    /**

     * Create a record reader for a given split. The framework will call

     * {@link RecordReader#initialize(InputSplit, TaskAttemptContext)} before

     * the split is used.

     *

     * @param split

     *            the split to be read

     * @param context

     *            the information about the task

     * @return a new record reader

     * @throws IOException

     * @throws InterruptedException

     */

    public abstract RecordReader<K, V> createRecordReader(InputSplit split,

            TaskAttemptContext context) throws IOException,

            InterruptedException;

}

OutputFormat

import java.io.IOException;

import org.apache.hadoop.mapreduce.InputFormat;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.RecordReader;

/**

 * <code>InputSplit</code> represents the data to be processed by an individual

 * {@link Mapper}.

 *

 * <p>

 * Typically, it presents a byte-oriented view on the input and is the

 * responsibility of {@link RecordReader} of the job to process this and present

 * a record-oriented view.

 *

 * @see InputFormat

 * @see RecordReader

 */

public abstract class InputSplit {

    /**

     * Get the size of the split, so that the input splits can be sorted by

     * size.

     *

     * @return the number of bytes in the split

     * @throws IOException

     * @throws InterruptedException

     */

    public abstract long getLength() throws IOException, InterruptedException;

    /**

     * Get the list of nodes by name where the data for the split would be

     * local. The locations do not need to be serialized.

     *

     * @return a new array of the node nodes.

     * @throws IOException

     * @throws InterruptedException

     */

    public abstract String[] getLocations() throws IOException,

            InterruptedException;

}

RecordReader

import java.io.Closeable;

import java.io.IOException;

/**

 * The record reader breaks the data into key/value pairs for input to the

 * {@link Mapper}.

 *

 * @param <KEYIN>

 * @param <VALUEIN>

 */

public abstract class RecordReader<KEYIN, VALUEIN> implements Closeable {

    /**

     * Called once at initialization.

     *

     * @param split

     *            the split that defines the range of records to read

     * @param context

     *            the information about the task

     * @throws IOException

     * @throws InterruptedException

     */

    public abstract void initialize(InputSplit split, TaskAttemptContext context)

            throws IOException, InterruptedException;

    /**

     * Read the next key, value pair.

     *

     * @return true if a key/value pair was read

     * @throws IOException

     * @throws InterruptedException

     */

    public abstract boolean nextKeyValue() throws IOException,

            InterruptedException;

    /**

     * Get the current key

     *

     * @return the current key or null if there is no current key

     * @throws IOException

     * @throws InterruptedException

     */

    public abstract KEYIN getCurrentKey() throws IOException,

            InterruptedException;

    /**

     * Get the current value.

     *

     * @return the object that was read

     * @throws IOException

     * @throws InterruptedException

     */

    public abstract VALUEIN getCurrentValue() throws IOException,

            InterruptedException;

    /**

     * The current progress of the record reader through its data.

     *

     * @return a number between 0.0 and 1.0 that is the fraction of the data

     *         read

     * @throws IOException

     * @throws InterruptedException

     */

    public abstract float getProgress() throws IOException,

            InterruptedException;

    /**

     * Close the record reader.

     */

    public abstract void close() throws IOException;

}

OutputFormat

import java.io.IOException;

import org.apache.hadoop.fs.FileSystem;

/**

 * <code>OutputFormat</code> describes the output-specification for a Map-Reduce

 * job.

 *

 * <p>

 * The Map-Reduce framework relies on the <code>OutputFormat</code> of the job

 * to:

 * <p>

 * <ol>

 * <li>

 * Validate the output-specification of the job. For e.g. check that the output

 * directory doesn't already exist.

 * <li>

 * Provide the {@link RecordWriter} implementation to be used to write out the

 * output files of the job. Output files are stored in a {@link FileSystem}.</li>

 * </ol>

 *

 * @see RecordWriter

 */

public abstract class OutputFormat<K, V> {

    /**

     * Get the {@link RecordWriter} for the given task.

     *

     * @param context

     *            the information about the current task.

     * @return a {@link RecordWriter} to write the output for the job.

     * @throws IOException

     */

    public abstract RecordWriter<K, V> getRecordWriter(

            TaskAttemptContext context) throws IOException,

            InterruptedException;

    /**

     * Check for validity of the output-specification for the job.

     *

     * <p>

     * This is to validate the output specification for the job when it is a job

     * is submitted. Typically checks that it does not already exist, throwing

     * an exception when it already exists, so that output is not overwritten.

     * </p>

     *

     * @param context

     *            information about the job

     * @throws IOException

     *             when output should not be attempted

     */

    public abstract void checkOutputSpecs(JobContext context)

            throws IOException, InterruptedException;

    /**

     * Get the output committer for this output format. This is responsible for

     * ensuring the output is committed correctly.

     *

     * @param context

     *            the task context

     * @return an output committer

     * @throws IOException

     * @throws InterruptedException

     */

    public abstract OutputCommitter getOutputCommitter(

            TaskAttemptContext context) throws IOException,

            InterruptedException;

}

RecordWriter

import java.io.IOException;

import org.apache.hadoop.fs.FileSystem;

/**

 * <code>RecordWriter</code> writes the output &lt;key, value&gt; pairs to an

 * output file.

 *

 * <p>

 * <code>RecordWriter</code> implementations write the job outputs to the

 * {@link FileSystem}.

 *

 * @see OutputFormat

 */

public abstract class RecordWriter<K, V> {

    /**

     * Writes a key/value pair.

     *

     * @param key

     *            the key to write.

     * @param value

     *            the value to write.

     * @throws IOException

     */

    public abstract void write(K key, V value) throws IOException,

            InterruptedException;

    /**

     * Close this <code>RecordWriter</code> to future operations.

     *

     * @param context

     *            the context of the task

     * @throws IOException

     */

    public abstract void close(TaskAttemptContext context) throws IOException,

            InterruptedException;

}

OutputCommitter

import java.io.IOException;

/**

 * <code>OutputCommitter</code> describes the commit of task output for a

 * Map-Reduce job.

 *

 * <p>

 * The Map-Reduce framework relies on the <code>OutputCommitter</code> of the

 * job to:

 * <p>

 * <ol>

 * <li>

 * Setup the job during initialization. For example, create the temporary output

 * directory for the job during the initialization of the job.</li>

 * <li>

 * Cleanup the job after the job completion. For example, remove the temporary

 * output directory after the job completion.</li>

 * <li>

 * Setup the task temporary output.</li>

 * <li>

 * Check whether a task needs a commit. This is to avoid the commit procedure if

 * a task does not need commit.</li>

 * <li>

 * Commit of the task output.</li>

 * <li>

 * Discard the task commit.</li>

 * </ol>

 *

 * @see org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter

 * @see JobContext

 * @see TaskAttemptContext

 *

 */

public abstract class OutputCommitter {

    /**

     * For the framework to setup the job output during initialization

     *

     * @param jobContext

     *            Context of the job whose output is being written.

     * @throws IOException

     *             if temporary output could not be created

     */

    public abstract void setupJob(JobContext jobContext) throws IOException;

    /**

     * For cleaning up the job's output after job completion

     *

     * @param jobContext

     *            Context of the job whose output is being written.

     * @throws IOException

     */

    public abstract void cleanupJob(JobContext jobContext) throws IOException;

    /**

     * Sets up output for the task.

     *

     * @param taskContext

     *            Context of the task whose output is being written.

     * @throws IOException

     */

    public abstract void setupTask(TaskAttemptContext taskContext)

            throws IOException;

    /**

     * Check whether task needs a commit

     *

     * @param taskContext

     * @return true/false

     * @throws IOException

     */

    public abstract boolean needsTaskCommit(TaskAttemptContext taskContext)

            throws IOException;

    /**

     * To promote the task's temporary output to final output location

     *

     * The task's output is moved to the job's output directory.

     *

     * @param taskContext

     *            Context of the task whose output is being written.

     * @throws IOException

     *             if commit is not

     */

    public abstract void commitTask(TaskAttemptContext taskContext)

            throws IOException;

    /**

     * Discard the task output

     *

     * @param taskContext

     * @throws IOException

     */

    public abstract void abortTask(TaskAttemptContext taskContext)

            throws IOException;

}

Hadoop MapReduce InputFormat/OutputFormat的更多相关文章

[Hadoop] - 自定义Mapreduce InputFormat&OutputFormat
在MR程序的开发过程中,经常会遇到输入数据不是HDFS或者数据输出目的地不是HDFS的,MapReduce的设计已经考虑到这种情况,它为我们提供了两个组建,只需要我们自定义适合的InputFormat ...
Hadoop MapReduce InputFormat基础
有时候你可能想要用不同的方法从input data中读取数据.那么你就需要创建一个自己的InputFormat类. InputFormat是一个只有两个函数的接口. public interf ...
Hadoop MapReduce编程 API入门系列之网页流量版本1（二十二）
不多说,直接上代码. 对流量原始日志进行流量统计,将不同省份的用户统计结果输出到不同文件. 代码 package zhouls.bigdata.myMapReduce.flowsum; import ...
Hadoop(20)-MapReduce框架原理-OutputFormat
1.outputFormat接口实现类 2.自定义outputFormat 步骤: 1). 定义一个类继承FileOutputFormat 2). 定义一个类继承RecordWrite,重写write ...
Hadoop MapReduce编程学习
一直在搞spark,也没时间弄hadoop,不过Hadoop基本的编程我觉得我还是要会吧,看到一篇不错的文章,不过应该应用于hadoop2.0以前,因为代码中有 conf.set("map ...
Hadoop MapReduce概念学习系列之mr程序组件全貌（二十）
其实啊,spilt是,控制Apache Hadoop Mapreduce的map并发任务数,详细见http://www.cnblogs.com/zlslch/p/5713652.html map,是m ...
hadoop MapReduce
简单介绍官方给出的介绍是hadoop MR是一个用于轻松编写以一种可靠的.容错的方式在商业化硬件上的大型集群上并行处理大量数据的应用程序的软件框架. MR任务通常会先把输入的数据集切分成独立的块(可 ...
【Big Data - Hadoop - MapReduce】hadoop 学习笔记：MapReduce框架详解
开始聊MapReduce,MapReduce是Hadoop的计算框架,我学Hadoop是从Hive开始入手,再到hdfs,当我学习hdfs时候,就感觉到hdfs和mapreduce关系的紧密.这个可能 ...
Hadoop MapReduce编程 API入门系列之wordcount版本1（五）
这个很简单哈,编程的版本很多种. 代码版本1 package zhouls.bigdata.myMapReduce.wordcount5; import java.io.IOException; im ...

随机推荐

你好，C++（12）怎样管理多个类型同样性质同样的数据？3.6 数组
3.6 数组学过前面的基本数据类型之后,我们如今能够定义单个变量来表示单个的数据.比如,我们能够用int类型定义变量来表示公交车的216路:能够用float类型定义变量来表示西红柿3.5元一斤. ...
win32下进程间通信——共享内存
一.引言在Windows程序中,各个进程之间常常需要交换数据,进行数据通讯.WIN32 API提供了许多函数使我们能够方便高效的进行进程间的通讯,通过这些函数我们可以控制不同进程间的数据交换 ...
java内存不足
-Xmx1024m -Xms1024m -XX:PermSize=128m -XX:MaxPermSize=512m ------------------------- 亲测可用
案例：计算1！+2！+3！+......+n!
/* * 1!+2!+3!+......+n! * */ import java.util.Scanner; public class ForTest{ public static void main ...
Android（java）学习笔记230：服务（service）之绑定服务的细节
绑定服务的细节 1. 如果onbind方法返回值是null,onServiceConnect方法就不会被调用: 2. 绑定的服务,在系统设置界面,正在运行条目是看不到的: 3. 绑定的服务,不求同时生 ...
使用MWC四轴起飞侧翻解决方法
原因如下:1.电机顺序错了,如上图所示,上面蓝色的箭头是机头,绿色的箭头是电机转向,3.10.11.9对应MWC飞控版上的D3,D9,D11,D9,蓝色箭头对应MWC飞控板的箭头或者传感器的Y轴以 ...
学习JAVA第一部分总结
把自己这几天的学习情况记录下来. 第一章,认识JAVA,了解JAVA的运行机制,虚拟机. 第二章,了解java的注释,标识符,关键字.. 第三章,基本的数据类型,byte short int long ...
c - 计算1到20的阶乘
#include <stdio.h> /* 题目:求 1+2!+3!+...+20!的和 */ unsigned long long int factorial(long n) { uns ...
DbProviderFactories.GetFactory Oracle.ManagedDataAccess.Client
因为最近项目,要使用微软的EF框架不安装Oracle客户端的情况下,访问Oracle数据库.调用如下代码的时候会报错. System.Data.Common.DbProviderFactories.G ...
ExtJs API 下载以及部署
ExtJs API 下载方法 1.进入sencha官网:https://www.sencha.com/ 2.点击“Docs”进入文档帮助页面:http://docs.sencha.com/ 3.点击左 ...

Hadoop MapReduce InputFormat/OutputFormat

Hadoop MapReduce InputFormat/OutputFormat的更多相关文章

随机推荐

热门专题