Hadoop源码解析之: TextInputFormat如何处理跨split的行

我们知道hadoop将数据给到map进行处理前会使用InputFormat对数据进行两方面的预处理：

对输入数据进行切分，生成一组split，一个split会分发给一个mapper进行处理。
针对每个split，再创建一个RecordReader读取Split内的数据，并按照<key,value>的形式组织成一条record传给map函数进行处理。

最常见的FormatInput就是TextInputFormat，在split的读取方面,它是将给到的Split按行读取,以行首字节在文件中的偏移做key,以行数据做value传给map函数处理,这部分的逻辑是由它所创建并使用的RecordReader:LineRecordReader封装和实现的.关于这部分逻辑,在一开始接触hadoop时会有一个常见的疑问:如果一个行被切分到两个split里(这几乎是一定会发生的情况),TextInputFormat是如何处理的?如果是生硬地把一行切割到两个split里,是对数据的一种破坏,可能会影响数据分析的正确性(比如word count就是一个例子).搞清楚这个问题还是需要从源码入手了解TextInputFormat的详细工作方式,这里简单地梳理记录如下(本文参考的是hadoop1.1.2的源码)：

1. LineRecordReader会创建一个org.apache.hadoop.util.LineReader实例,并依赖这个LineReader的readLine方法来读取一行记录,具体可参考org.apache.hadoop.mapred.LineRecordReader.next(LongWritable, Text),Line 176),那么关键的逻辑就在这个readLine方法里了,下面是添加了额外中文注释的该方法源码.这个方法主要的逻辑归纳起来是3点:

总是是从buffer里读取数据,如果buffer里的数据读完了,先加载下一批数据到buffer
在buffer中查找"行尾",将开始位置至行尾处的数据拷贝给str(也就是最后的Value).如果为遇到"行尾",继续加载新的数据到buffer进行查找.
关键点在于:给到buffer的数据是直接从文件中读取的,完全不会考虑是否超过了split的界限,而是一直读取到当前行结束为止

/**

	   * Read one line from the InputStream into the given Text.  A line

	   * can be terminated by one of the following: '\n' (LF) , '\r' (CR),

	   * or '\r\n' (CR+LF).  EOF also terminates an otherwise unterminated

	   * line.

	   *

	   * @param str the object to store the given line (without newline)

	   * @param maxLineLength the maximum number of bytes to store into str;

	   *  the rest of the line is silently discarded.

	   * @param maxBytesToConsume the maximum number of bytes to consume

	   *  in this call.  This is only a hint, because if the line cross

	   *  this threshold, we allow it to happen.  It can overshoot

	   *  potentially by as much as one buffer length.

	   *

	   * @return the number of bytes read including the (longest) newline

	   * found.

	   *

	   * @throws IOException if the underlying stream throws

	   */

	  public int readLine(Text str, int maxLineLength,

	                      int maxBytesToConsume) throws IOException {

	    /* We're reading data from in, but the head of the stream may be

	     * already buffered in buffer, so we have several cases:

	     * 1. No newline characters are in the buffer, so we need to copy

	     *    everything and read another buffer from the stream.

	     * 2. An unambiguously terminated line is in buffer, so we just

	     *    copy to str.

	     * 3. Ambiguously terminated line is in buffer, i.e. buffer ends

	     *    in CR.  In this case we copy everything up to CR to str, but

	     *    we also need to see what follows CR: if it's LF, then we

	     *    need consume LF as well, so next call to readLine will read

	     *    from after that.

	     * We use a flag prevCharCR to signal if previous character was CR

	     * and, if it happens to be at the end of the buffer, delay

	     * consuming it until we have a chance to look at the char that

	     * follows.

	     */

	    str.clear();

	    int txtLength = 0; //tracks str.getLength(), as an optimization

	    int newlineLength = 0; //length of terminating newline

	    boolean prevCharCR = false; //true of prev char was CR

	    long bytesConsumed = 0;

	    do {

	      int startPosn = bufferPosn; //starting from where we left off the last time

		  //如果buffer中的数据读完了，先加载一批数据到buffer里

	      if (bufferPosn >= bufferLength) {

	        startPosn = bufferPosn = 0;

	        if (prevCharCR)

	          ++bytesConsumed; //account for CR from previous read

	        bufferLength = in.read(buffer);

	        if (bufferLength <= 0)

	          break; // EOF

	      }

		  //注意：这里的逻辑有点tricky,由于不同操作系统对“行结束符“的定义不同：

		  //UNIX: '\n'  (LF)

		  //Mac:  '\r'  (CR)

		  //Windows: '\r\n'  (CR)(LF)

		  //为了准确判断一行的结尾，程序的判定逻辑是：

		  //1.如果当前符号是LF，可以确定一定是到了行尾，但是需要参考一下前一个

		  //字符，因为如果前一个字符是CR，那就是windows文件，“行结束符的长度”

		  //(即变量：newlineLength,这个变量名起的有点糟糕)应该是2，否则就是UNIX文件，“行结束符的长度”为1。

		  //2.如果当前符号不是LF，看一下前一个符号是不是CR，如果是也可以确定一定上个字符就是行尾了，这是一个mac文件。

		  //3.如果当前符号是CR的话，还需要根据下一个字符是不是LF判断“行结束符的长度”，所以只是标记一下prevCharCR=true，供读取下个字符时参考。

	      for (; bufferPosn < bufferLength; ++bufferPosn) { //search for newline

	        if (buffer[bufferPosn] == LF) {

	          newlineLength = (prevCharCR) ? 2 : 1;

	          ++bufferPosn; // at next invocation proceed from following byte

	          break;

	        }

	        if (prevCharCR) { //CR + notLF, we are at notLF

	          newlineLength = 1;

	          break;

	        }

	        prevCharCR = (buffer[bufferPosn] == CR);

	      }

	      int readLength = bufferPosn - startPosn;

	      if (prevCharCR && newlineLength == 0)

	        --readLength; //CR at the end of the buffer

	      bytesConsumed += readLength;

	      int appendLength = readLength - newlineLength;

	      if (appendLength > maxLineLength - txtLength) {

	        appendLength = maxLineLength - txtLength;

	      }

	      if (appendLength > 0) {

	        str.append(buffer, startPosn, appendLength);

	        txtLength += appendLength;

	      }//newlineLength == 0 就意味着始终没有读到行尾，程序会继续通过文件输入流继续从文件里读取数据。

	      //这里有一个非常重要的地方：in的实例创建自构造函数：org.apache.hadoop.mapred.LineRecordReader.LineRecordReader(Configuration, FileSplit)

	      //第86行:FSDataInputStream fileIn = fs.open(split.getPath()); 我们看以看到:

	      //对于LineRecordReader：当它对取“一行”时，一定是读取到完整的行，不会受filesplit的任何影响，因为它读取是filesplit所在的文件，而不是限定在filesplit的界限范围内。

	      //所以不会出现“断行”的问题！

	    } while (newlineLength == 0 && bytesConsumed < maxBytesToConsume);

	    if (bytesConsumed > (long)Integer.MAX_VALUE)

	      throw new IOException("Too many bytes before newline: " + bytesConsumed);

	    return (int)bytesConsumed;

	  }

2. 按照readLine的上述行为,在遇到跨split的行时,会到下一个split继续读取数据直至行尾,那么下一个split怎么判定开头的一行有没有被上一个split的LineRecordReader读取过从而避免漏读或重复读取开头一行呢?这方面LineRecordReader使用了一个简单而巧妙的方法:既然无法断定每一个split开始的一行是独立的一行还是被切断的一行的一部分,那就跳过每个split的开始一行(当然要除第一个split之外),从第二行开始读取,然后在到达split的结尾端时总是再多读一行,这样数据既能接续起来又避开了断行带来的麻烦.以下是相关的源码:

在LineRecordReader的构造函数org.apache.hadoop.mapred.LineRecordReader.LineRecordReader(Configuration, FileSplit) 108到113行确定start位置时，明确注明::会特别地忽略掉第一行！

// If this is not the first split, we always throw away first record

    // because we always (except the last split) read one extra line in

    // next() method.

    if (start != 0) {

      start += in.readLine(new Text(), 0, maxBytesToConsume(start));

    }

相应地,在LineRecordReader判断是否还有下一行的方法:org.apache.hadoop.mapred.LineRecordReader.next(LongWritable, Text) 170到173行中,while使用的判定条件是:当前位置小于
或等于split的结尾位置,也就说
:当当前以处于split的结尾位置上时,while依然会执行一次,这一次读到显然已经是下一个split的开始行了!

    // We always read one extra line, which lies outside the upper

    // split limit i.e. (end - 1)

    while (getFilePosition() <= end) {

	...

小结:

至此,跨split的行读取的逻辑就完备了.
如果引申地来看,这是map-reduce前期数据切分的一个普遍性问题,即不管我们用什么方式切分和读取一份大数据中的小部分,包括我们在实现自己的InputFormat时,都会面临在切分处数据时的连续性解析问题. 对此我们应该深刻地认识到:split最直接的现实作用是取出大数据中的一小部分给mapper处理,但这只是一种"逻辑"上的,"宏观"上的切分,在"微观"上,在split的首尾切分处,为了确保数据连续性,跨越split接续并拼接数据也是完全正当和合理的.

Hadoop源码解析之: TextInputFormat如何处理跨split的行的更多相关文章

Hadoop源码解析之 rpc通信 client到server通信
rpc是Hadoop分布式底层通信的基础,无论是client和namenode,namenode和datanode,以及yarn新框架之间的通信模式等等都是采用的rpc方式. 下面我们来概要分析一下H ...
Hadoop源码解析 1 --- Hadoop工程包架构解析
1 Hadoop中各工程包依赖简述 Google的核心竞争技术是它的计算平台.Google的大牛们用了下面5篇文章,介绍了它们的计算设施. GoogleCluster: http:// ...
Hadoop源码解析之: HBase Security
文不打算对这部分代码进行全面的解读,而是先对几个主要类的职能进行概述,然后再罗列一些有价值的重要细节. 第一部分:HBase Security 概述 HBase Security主要是基于User和U ...
spring MVC cors跨域实现源码解析
# spring MVC cors跨域实现源码解析 > 名词解释:跨域资源共享(Cross-Origin Resource Sharing) 简单说就是只要协议.IP.http方法任意一个不同就 ...
spring MVC cors跨域实现源码解析 CorsConfiguration UrlBasedCorsConfigurationSource
spring MVC cors跨域实现源码解析 spring MVC cors跨域实现源码解析名词解释:跨域资源共享(Cross-Origin Resource Sharing) 简单说就是只要协议 ...
SpringBoot源码学习1——SpringBoot自动装配源码解析+Spring如何处理配置类的
系列文章目录和关于我一丶什么是SpringBoot自动装配 SpringBoot通过SPI的机制,在我们程序员引入一些starter之后,扫描外部引用 jar 包中的META-INF/spring. ...
Hadoop源码篇---解读Mapprer源码Input输入
一.前述上次分析了客户端源码,这次分析mapper源码让大家对hadoop框架有更清晰的认识二.代码自定义代码如下: public class MyMapper extends Mapper&l ...
zookeeper集群搭建及Leader选举算法源码解析
第一章.zookeeper概述一.zookeeper 简介 zookeeper 是一个开源的分布式应用程序协调服务器,是 Hadoop 的重要组件. zooKeeper 是一个分布式的,开放源码的分 ...
[源码解析] Flink的groupBy和reduce究竟做了什么
[源码解析] Flink的groupBy和reduce究竟做了什么目录 [源码解析] Flink的groupBy和reduce究竟做了什么 0x00 摘要 0x01 问题和概括 1.1 问题 1.2 ...

随机推荐

增强Delphi.RemObject.DataAbstract的脚本功能：多数据库同时操作
我们知道,通过Schema,一个DataAbstracService对应一个数据库:一个服务器可以包含多个DataAbstracService,从而实现对多个数据库的操作.通过事件处理我们可以在一个D ...
2014多校3 Wow! Such Sequence!段树
主题链接:http://acm.hdu.edu.cn/showproblem.php? pid=4893 这个问题还真是纠结啊--好久不写线段树的题了.由于这几天学伸展树.然后认为线段树小case了. ...
java--折半查找
/* 折半查找 */ class TwoSearch { //折半查找可以提高效率,但必须得保证是有序的数组 public static int halfSearch(int[] arr,int ke ...
Visual Studio Tip: Get Public Key Token for a Strong Named Assembly
The first 3 parts are easy to get. I should know the name, version, and culture for the assembly sin ...
[置顶] cocos2d-x 植物大战僵尸（13）类似酷跑的【同一角色不同动画间的切换的实现】
有几天没和大家分享博客了,原因很简单,就是我在运行第12章所写的代码时:(开始一切正常,不过没多久就出现了内存泄露!.可能求成心切吧,当时没多加考虑就把代码发上去了.我在此对看过第12章得 ...
php随笔8-thinkphp OA系统客户管理
Action: CustomerinfosAction.class.php <?php /* * 客户信息控制器 * @author lifu <504861378@qq.com> ...
（C）高级排序法
1.快速排序法 //方法1 从大到小 #include <iostream.h> void run(int* pData,int left,int right) { int i,j; in ...
更改firefox默认搜索引擎
使用Organizie Search Engines 这个插件这个插件似乎不能添加搜索引擎但是可以修改搜索引擎好吧由于有时候firefox会自动更新搜索引擎所以不要对Google搜索引擎 ...
Using WMIC For Gathering System Info
WMIC is a command line interface to WMI (Windows Management Instrumentation). While it has many uses ...
Collection用法
Queue接口与List.Set同一级别,都是继承了Collection接口.LinkedList实现了Queue接口.在队列这种数据结构中,最先插入的元素将是最先被删除的元素:反之最后插入的元素将 ...

Hadoop源码解析之: TextInputFormat如何处理跨split的行

Hadoop源码解析之: TextInputFormat如何处理跨split的行的更多相关文章

随机推荐

热门专题