FileInputFormat 的实现之TextInputFormat

说明

TextInputFormat默认是按行切分记录record，本篇在于理解，对于同一条记录record，如果被切分在不同的split时是怎么处理的。首先getSplits是在逻辑上划分，并没有物理切分，也就是只是记录每个split从文件的个位置读到哪个位置，文件还是一个整体。所以在LineRecordReader中，它的处理方式是每个split多读一行，也就是读到下一个split的第一行。然后除了每个文件的第一个split，其他split都跳过第一行，进而避免重复读取，这种方式去处理。

FileInputFomat 之 getSplits

TextInputFormat 继承TextInputFormat，并没有重写getSplits，而是沿用父类的getSplits方法，下面看下该方法的源码

public List<InputSplit> getSplits(JobContext job) throws IOException {

    StopWatch sw = new StopWatch().start();

    //getFormatMinSplitSize() == 1，getMinSplitSize(job)为用户设置的切片最小值，默认1。 job.getConfiguration().getLong("mapreduce.input.fileinputformat.split.minsize", 1L);

    long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));

    // getMaxSplitSize(job)为用户设置的切片最大值，context.getConfiguration().getLong("mapreduce.input.fileinputformat.split.maxsize", Long.MAX_VALUE);

    long maxSize = getMaxSplitSize(job);

    // generate splits

    List<InputSplit> splits = new ArrayList<InputSplit>();

    List<FileStatus> files = listStatus(job);

    for (FileStatus file: files) {

      Path path = file.getPath();

      long length = file.getLen();

      if (length != 0) {

        BlockLocation[] blkLocations;

        //LocatedFileStatus带有blockLocation信息

        if (file instanceof LocatedFileStatus) {

          blkLocations = ((LocatedFileStatus) file).getBlockLocations();

        } else {

          FileSystem fs = path.getFileSystem(job.getConfiguration());

          blkLocations = fs.getFileBlockLocations(file, 0, length);

        }

        //判断文件是否可切分

        if (isSplitable(job, path)) {

          long blockSize = file.getBlockSize();

          //真正的切片设置大小判断，computeSplitSize方法中的实现，返回值 Math.max(minSize, Math.min(maxSize, blockSize));

          long splitSize = computeSplitSize(blockSize, minSize, maxSize);

          long bytesRemaining = length;

          while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {

            int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);

            splits.add(makeSplit(path, length-bytesRemaining, splitSize,

                        blkLocations[blkIndex].getHosts(),

                        blkLocations[blkIndex].getCachedHosts()));

            bytesRemaining -= splitSize;

          }

          if (bytesRemaining != 0) {

            int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);

            splits.add(makeSplit(path, length-bytesRemaining, bytesRemaining,

                       blkLocations[blkIndex].getHosts(),

                       blkLocations[blkIndex].getCachedHosts()));

          }

        } else { // not splitable

          if (LOG.isDebugEnabled()) {

            // Log only if the file is big enough to be splitted

            if (length > Math.min(file.getBlockSize(), minSize)) {

              LOG.debug("File is not splittable so no parallelization "

                  + "is possible: " + file.getPath());

            }

          }

          splits.add(makeSplit(path, 0, length, blkLocations[0].getHosts(),

                      blkLocations[0].getCachedHosts()));

        }

      } else {

        //Create empty hosts array for zero length files

        splits.add(makeSplit(path, 0, length, new String[0]));

      }

    }

    // Save the number of input files for metrics/loadgen

    job.getConfiguration().setLong(NUM_INPUT_FILES, files.size());

    sw.stop();

    if (LOG.isDebugEnabled()) {

      LOG.debug("Total # of splits generated by getSplits: " + splits.size()

          + ", TimeTaken: " + sw.now(TimeUnit.MILLISECONDS));

    }

    return splits;

  }

FileInputFomat 之 createRecordReader，主要是看LineRecordReader

public RecordReader<LongWritable, Text>

    createRecordReader(InputSplit split,

                       TaskAttemptContext context) {

    //设置record的分隔符

    String delimiter = context.getConfiguration().get(

        "textinputformat.record.delimiter");

    byte[] recordDelimiterBytes = null;

    if (null != delimiter)

      recordDelimiterBytes = delimiter.getBytes(Charsets.UTF_8);

    return new LineRecordReader(recordDelimiterBytes);

  }

LineRecordReader的方法initialize和nextKeyValue方法

public void initialize(InputSplit genericSplit,

                         TaskAttemptContext context) throws IOException {

    FileSplit split = (FileSplit) genericSplit;

    Configuration job = context.getConfiguration();

    this.maxLineLength = job.getInt(MAX_LINE_LENGTH, Integer.MAX_VALUE);

    start = split.getStart();

    end = start + split.getLength();

    final Path file = split.getPath();

    // open the file and seek to the start of the split

    final FileSystem fs = file.getFileSystem(job);

    fileIn = fs.open(file);

    //判断是否压缩，赋值对应的SplitLineReader

    CompressionCodec codec = new CompressionCodecFactory(job).getCodec(file);

    if (null!=codec) {

      isCompressedInput = true;

      decompressor = CodecPool.getDecompressor(codec);

      if (codec instanceof SplittableCompressionCodec) {

        final SplitCompressionInputStream cIn =

          ((SplittableCompressionCodec)codec).createInputStream(

            fileIn, decompressor, start, end,

            SplittableCompressionCodec.READ_MODE.BYBLOCK);

        in = new CompressedSplitLineReader(cIn, job,

            this.recordDelimiterBytes);

        start = cIn.getAdjustedStart();

        end = cIn.getAdjustedEnd();

        filePosition = cIn;

      } else {

        in = new SplitLineReader(codec.createInputStream(fileIn,

            decompressor), job, this.recordDelimiterBytes);

        filePosition = fileIn;

      }

    } else {

      fileIn.seek(start);

      in = new UncompressedSplitLineReader(

          fileIn, job, this.recordDelimiterBytes, split.getLength());

      filePosition = fileIn;

    }

    //这句是关键，由于getSplits的时候，并不能保证一条record记录，不被切分到不同的split。所以处理方式是，除了每个文件的第一个split，其他每个split多读一行

    //所以避免重复读，不是开始的split都跳过第一行。

    // If this is not the first split, we always throw away first record

    // because we always (except the last split) read one extra line in

    // next() method.

    if (start != 0) {

      start += in.readLine(new Text(), 0, maxBytesToConsume(start));

    }

    this.pos = start;

  }

接下来是nextKeyValue

public boolean nextKeyValue() throws IOException {

    if (key == null) {

      key = new LongWritable();

    }

    key.set(pos);

    if (value == null) {

      value = new Text();

    }

    int newSize = 0;

    // We always read one extra line, which lies outside the upper

    // split limit i.e. (end - 1)

    //这个in具体看是CompressedSplitLineReader还是UncompressedSplitLineReader，重写了其中的readerLine方法

    while (getFilePosition() <= end || in.needAdditionalRecordAfterSplit()) {

      if (pos == 0) {

        //跳过utf的开头

        newSize = skipUtfByteOrderMark();

      } else {

        //readerLine有两种实现方法，一种readCustomLine这种是自己定义了record的分隔符，还有一种是readDefaultLine，这种是没有自定义分隔符，默认的读取数据的方式，用\r,\n或者\r\n分割

        newSize = in.readLine(value, maxLineLength, maxBytesToConsume(pos));

        pos += newSize;

      }

      if ((newSize == 0) || (newSize < maxLineLength)) {

        break;

      }

      // line too long. try again

      LOG.info("Skipped line of size " + newSize + " at pos " +

               (pos - newSize));

    }

    if (newSize == 0) {

      key = null;

      value = null;

      return false;

    } else {

      return true;

    }

  }

FileInputFormat 的实现之TextInputFormat的更多相关文章

MapReduce ：基于 FileInputFormat 的 mapper 数量控制
本篇分两部分,第一部分分析使用 java 提交 mapreduce 任务时对 mapper 数量的控制,第二部分分析使用 streaming 形式提交 mapreduce 任务时对 mapper 数量 ...
MR 的 mapper 数量问题
看到群里面一篇文章涨了贱识 http://www.cnblogs.com/xuxm2007/archive/2011/09/01/2162011.html 之前关注过 reduceer 的数量问题,还 ...
（转）通过input分片的大小来设置map的个数
摘要通过input分片的大小来设置map的个数 map inputsplit hadoop 前言:在具体执行Hadoop程序的时候,我们要根据不同的情况来设置Map的个数.除了设置固定的每个节点上可 ...
Spark大数据针对性问题。
1.海量日志数据,提取出某日访问百度次数最多的那个IP. 解决方案:首先是将这一天,并且是访问百度的日志中的IP取出来,逐个写入到一个大文件中.注意到IP是32位的,最多有个2^32个IP.同样可以采 ...
通过inputSplit分片size控制map数目
前言:在具体执行Hadoop程序的时候,我们要根据不同的情况来设置Map的个数.除了设置固定的每个节点上可运行的最大map个数外,我们还需要控制真正执行Map操作的任务个数. 1.如何控制实际运行的m ...
MapReduce框架原理-InputFormat数据输入
InputFormat简介 InputFormat:管控MR程序文件输入到Mapper阶段,主要做两项操作:怎么去切片?怎么将切片数据转换成键值对数据. InputFormat是一个抽象类,没有实现怎 ...
旧版API的TextInputFormat源码分析
TextInputFormat类 package org.apache.hadoop.mapred; import java.io.*; import org.apache.hadoop.fs.*; ...
FileInputFormat
MapReduce框架要处理数据的文件类型 FileInputFormat这个类决定. TextInputFormat是框架默认的文件类型,可以处理Text文件类型,如果你要处理的文件类型不是Text ...
MapReduce中TextInputFormat分片和读取分片数据源码级分析
InputFormat主要用于描述输入数据的格式(我们只分析新API,即org.apache.hadoop.mapreduce.lib.input.InputFormat),提供以下两个功能: (1) ...

随机推荐

IDEA控制台中文乱码问题
Tomcat启动时乱码在tomcat启动时,控制台中的中文为乱码在idea安装路径的bin文件夹下,找到idea64.exe.vmoptions这个配置文件,添加如下代码 -Dfile.encod ...
Beego学习笔记6:分页的实现
实现分页的效果 1> 分页的实现的业务逻辑 1->每个页面显示N条数据,总的数据记录数M,则分页的个数为M%N==0?M/N:M/N+1; 2->页面渲染分页的html部分 ...
20、解决Vue使用bus兄弟组件间传值，第一次监听不到数据
1.新建bus.js文件: import Vue from 'vue' export default new Vue; 2.在需要通信组件A,B中引入bus: A组件: import Bus from ...
flask启动找不到路由问题
解决方法
Zebra-打印特殊字符
Zebra在打印一些特殊的字符时,会出异常. 在要打印的字符串前加 ^FH 然后将字符换成 ASCii编码或utf-8编码的16进制,在前面加_,如D094写成_DO_94 查看字符的编码 htt ...
centos 修改默认启动内核，及删除无用内核
#使用cat /boot/grub2/grub.cfg |grep menuentry 查看系统可用内核 [root@bigapp-slave27 ~]# cat /boot/grub2/grub.c ...
Spark-Bench 测试教程
Spark-Bench 教程本文原始地址:https://sitoi.cn/posts/19752.html 系统环境配置操作系统:centos7 环境要求:安装 JDK, Hadoop, Spa ...
python的一些包安装
Linux下pip 的安装方法: 使用get-pip.py安装要安装pip,请安全下载get-pip.py.1: curl https://bootstrap.pypa.io/get-pip.py ...
Veritas NetBackup 8.1.2 升级的主要理由--附升级兼容性检查网址
管理更简单 NetBackup™ 8.1.2 基于 Web 的全新直观用户界面让操作变得极度简单化,最常用操作现在只需单击几次或触摸几下即可完成.通过台式机或移动设备可为不同角色的其他用户授予访问权 ...
css overflow失效的原因
声明转载自https://my.oschina.net/xuqianwen/blog/540587 项目中常常有同学遇到这样的问题,现象是给元素设置了overflow:hidden,但超出容器的部分 ...

FileInputFormat 的实现之TextInputFormat

说明

FileInputFomat 之 getSplits

FileInputFomat 之 createRecordReader，主要是看LineRecordReader

LineRecordReader的方法initialize和nextKeyValue方法

接下来是nextKeyValue

FileInputFormat 的实现之TextInputFormat的更多相关文章

随机推荐

热门专题