1、InputFormat

InputFormat是Hadoop平台上Mapreduce输入的规范，仅有两个抽象方法。

List<InputSplit> getSplits()，获取由输入文件计算出输入分片(InputSplit)，解决数据或文件分割成片问题。
RecordReader<K,V> createRecordReader()，创建RecordReader，从InputSplit中读取数据，解决读取分片中数据问题。

InputFormat主要能完成下列工作：

1、Validate the input-specification of the job. （首先验证作业的输入的正确性）

2、 Split-up the input file(s) into logical InputSplits, each of which is then assigned to an individual Mapper.（将输入的文件划分成一系列逻辑分片

（InputSplit），一个InputSplit将会被分配给一个独立的MapTask ）

3、Provide the RecordReader implementation to be used to glean input records from the logical InputSplit for processing by the Mapper.（提供RecordReader实

现，读取InputSplit中的“K-V对”供Mapper使用）

InputFormat的源代码：

public abstract class InputFormat<K, V> {

  /**

   * 每个InputSplit的分片被分配到一个独立的Mapper上

   * 注：1、这个分片是逻辑上对输入数据进行分片，而实际上输入文件没有被切割成一个个小块。

   *     每个分片由输入文件的路径，起始位置，偏移量等

   *     2、在InputFormat中创建的RecordReader也要使用InputSplit

   */

  public abstract

    List<InputSplit> getSplits(JobContext context ) throws IOException, InterruptedException;

  /**

   * 为每个分片创建一个record reader

   */

  public abstract

    RecordReader<K,V> createRecordReader(InputSplit split,TaskAttemptContext context)

            throws IOException, InterruptedException;

2.InputSplit

Mapper输入的是一个个分片，称InputSplit。在这个抽象类中，是将每个输入分片(Split)中的内容解析成K-V值。

InputSplit的源代码：

public abstract class InputSplit {

  /**

   * 得到每个分片的大小，可以按照分片大小排序。

   */

  public abstract long getLength() throws IOException, InterruptedException;

  /**

   * Get the list of nodes by name where the data for the split would be local.

   * The locations do not need to be serialized.

   * 获取存储该分片的数据所在的节点位置

   */

  public abstract String[] getLocations() throws IOException, InterruptedException;

}

2.1 下面看看InputSplit的一个子类，FileSplit类：

  public FileSplit() {}

   /** Constructs a split with host information

    *

    * @param file the file name

    * @param start the position of the first byte in the file to process

    * @param length the number of bytes in the file to process

    * @param hosts the list of hosts containing the block, possibly null

    */

   public FileSplit(Path file, long start, long length, String[] hosts) {

     this.file = file;

     this.start = start;

     this.length = length;

     this.hosts = hosts;

   }

   /** The file containing this split's data. */

   public Path getPath() { return file; }

   /** The position of the first byte in the file to process. */

   public long getStart() { return start; }

   /** The number of bytes in the file to process. */

   @Override

   public long getLength() { return length; }

   @Override

   public String toString() { return file + ":" + start + "+" + length; }

   ////////////////////////////////////////////

   // Writable methods

   ////////////////////////////////////////////

   @Override

   public void write(DataOutput out) throws IOException {

     Text.writeString(out, file.toString());

     out.writeLong(start);

     out.writeLong(length);

   }

   @Override

   public void readFields(DataInput in) throws IOException {

     file = new Path(Text.readString(in));

     start = in.readLong();

     length = in.readLong();

     hosts = null;

   }

   @Override

   public String[] getLocations() throws IOException {

     if (this.hosts == null) {

       return new String[]{};

     } else {

       return this.hosts;

     }

   }

从源码中可以看出，FileSplit有四个属性：文件路径，分片起始位置，分片长度和存储分片的hosts。用这四项数据，就可以计算出提供给每个Mapper的分片数据。在InputFormat的getSplit()方法中构造分片，分片的四个属性会通过调用FileSplit的Constructor设置。

2.2再看一个InputSplit的子类：CombineFileSplit。源码如下：

为什么介绍该类呢，因为该类对小文件的处理是很有效的，所有深入理解该类，将有助于该节学习。

上面我们介绍的FileSplit对应的是一个输入文件，也就是说，如果用FileSplit对应的FileInputFormat作为输入格式，那么即使文件特别小，也是作为一个单独的InputSplit来处理，而每一个InputSplit将会由一个独立的Mapper Task来处理。在输入数据是由大量小文件组成的情形下，就会有同样大量的InputSplit，从而需要同样大量的Mapper来处理，大量的Mapper Task创建销毁开销将是巨大的，甚至对集群来说，是灾难性的！

CombineFileSplit是针对小文件的分片，它将一系列小文件封装在一个InputSplit内，这样一个Mapper就可以处理多个小文件。可以有效的降低进程开销。与FileSplit类似，CombineFileSplit同样包含文件路径，分片起始位置，分片大小和分片数据所在的host列表四个属性，只不过这些属性不再是一个值，而是一个列表。

需要注意的一点是，CombineFileSplit的getLength()方法，返回的是这一系列数据的数据的总长度。

现在，我们已深入的了解了InputSplit的概念，看了其源码，知道了其属性。我们知道数据分片是在InputFormat中实现的，接下来，我们就深入InputFormat的一个子类，FileInputFormat看看分片是如何进行的。

3 、FileInputFormat

FileInputFormat中，分片方法代码及详细注释如下，就不再详细解释该方法：

Hadoop中的InputFormat解析的更多相关文章

Hadoop中Partition深度解析
本文地址:http://www.cnblogs.com/archimedes/p/hadoop-partitioner.html,转载请注明源地址. 旧版 API 的 Partitioner 解析 P ...
hadoop中InputFormat 接口的设计与实现
InputFormat 主要用于描述输入数据的格式, 它提供以下两个功能.❑数据切分:按照某个策略将输入数据切分成若干个 split, 以便确定 Map Task 个数以及对应的 split.❑为 M ...
Hadoop中常用的InputFormat、OutputFormat（转）
Hadoop中的Map Reduce框架依赖InputFormat提供数据,依赖OutputFormat输出数据,每一个Map Reduce程序都离不开它们.Hadoop提供了一系列InputForm ...
Hadoop 中疑问解析
Hadoop 中疑问解析 FAQ问题剖析一.HDFS 文件备份与数据安全性分析1 HDFS 原理分析1.1 Hdfs master/slave模型 hdfs采用的是master/slave模型,一个 ...
Hadoop中Yarnrunner里面submit Job以及AM生成至Job处理过程源码解析
参考 http://blog.csdn.net/caodaoxi/article/details/12970993 Hadoop中Yarnrunner里面submit Job以及AM生成至Job处理 ...
hadoop中OutputFormat 接口的设计与实现
OutputFormat 主要用于描述输出数据的格式,它能够将用户提供的 key/value 对写入特定格式的文件中. 本文将介绍 Hadoop 如何设计 OutputFormat 接口 , 以及一些 ...
[转] - hadoop中使用lzo的压缩
在hadoop中使用lzo的压缩算法可以减小数据的大小和数据的磁盘读写时间,不仅如此,lzo是基于block分块的,这样他就允许数据被分解成chunk,并行的被hadoop处理.这样的特点,就可以让l ...
Hadoop工程包架构解析
Hadoop源码解析 1 --- Hadoop工程包架构解析 1 Hadoop中各工程包依赖简述 Google的核心竞争技术是它的计算平台.Google的大牛们用了下面5篇文章,介绍了它们的计算 ...
hadoop中MapReduce中压缩的使用及4种压缩格式的特征的比较
在比较四中压缩方法之前,先来点干的,说一下在MapReduce的job中怎么使用压缩. MapReduce的压缩分为map端输出内容的压缩和reduce端输出的压缩,配置很简单,只要在作业的conf中 ...

随机推荐

Graham's Scan算法
原文链接:http://www.cnblogs.com/devymex/archive/2010/08/09/1795392.html C++/STL实现: #include <algorith ...
Javascript表格中搜索
<!DOCTYPE html> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <m ...
**【ci框架】精通CodeIgniter框架
http://blog.csdn.net/yanhui_wei/article/details/25803945 一.大纲 1.codeigniter框架的授课内容安排 2.codeigniter框架 ...
将DJANGO管理界面的filter_horizontal移到前面来复用
参考URL: http://www.hoboes.com/Mimsy/hacks/replicating-djangos-admin/reusing-djangos-filter_horizontal ...
手动挂载安装VMware tools
在VMware 10上装了Red Hat Enterprise Linux 4后,点击“安装VMware tools”后,虚拟机桌面一直不出现挂载了VMware tools的虚拟光驱.在/mnt 和/ ...
【mongoDB基础篇②】PHP-mongo扩展的编译以及使用
安装PHP-mongo扩展安装php-mongo扩展和安装其他php扩展的步骤一样: #1.首先上http://pecl.php.net上面搜索mongo,得到下载地址 wget http://pe ...
【转】Windows Server 2008修改远程桌面连接数
按照下面的设置是成功了的,我设置的连接数是5个. http://jingyan.baidu.com/article/154b463150d1b128ca8f4194.html
ApplePay扩大全球发卡行合作，“苹果税”撑不住了？
5月11日Apple Pay全面登陆加拿大地区,更为重要的是,苹果终于在一些地区,开始和美国运通之外的发卡行达成了合作.这对于老是因为分账问题不愿意走出下一步的Apple Pay来说,已经是巨大的进步 ...
39. Combination Sum
题目: Given a set of candidate numbers (C) and a target number (T), find all unique combinations in C ...
33. Search in Rotated Sorted Array
题目: Suppose a sorted array is rotated at some pivot unknown to you beforehand. (i.e., 0 1 2 4 5 6 7 ...