Hadoop源码篇---解读Mapprer源码outPut输出

一。前述

上次讲完MapReduce的输入后，这次开始讲MapReduce的输出。注意MapReduce的原语很重要：

“相同”的key为一组，调用一次reduce方法，方法内迭代这一组数据进行计算！！！！！

二。代码

继续看MapTask任务。

private <INKEY,INVALUE,OUTKEY,OUTVALUE>

  void runNewMapper(final JobConf job,

                    final TaskSplitIndex splitIndex,

                    final TaskUmbilicalProtocol umbilical,

                    TaskReporter reporter

                    ) throws IOException, ClassNotFoundException,

                             InterruptedException {

    // make a task context so we can get the classes

    org.apache.hadoop.mapreduce.TaskAttemptContext taskContext =

      new org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl(job,

                                                                  getTaskID(),

                                                                  reporter);

    // make a mapper

    org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE> mapper =

      (org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE>)

        ReflectionUtils.newInstance(taskContext.getMapperClass(), job);

    // make the input format

    org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE> inputFormat =

      (org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE>)

        ReflectionUtils.newInstance(taskContext.getInputFormatClass(), job);

    // rebuild the input split

    org.apache.hadoop.mapreduce.InputSplit split = null;

    split = getSplitDetails(new Path(splitIndex.getSplitLocation()),

        splitIndex.getStartOffset());

    LOG.info("Processing split: " + split);

    org.apache.hadoop.mapreduce.RecordReader<INKEY,INVALUE> input =

      new NewTrackingRecordReader<INKEY,INVALUE>

        (split, inputFormat, reporter, taskContext);

    job.setBoolean(JobContext.SKIP_RECORDS, isSkipping());

    org.apache.hadoop.mapreduce.RecordWriter output = null;

    // get an output object

    if (job.getNumReduceTasks() == 0) {

      output =

        new NewDirectOutputCollector(taskContext, job, umbilical, reporter);

    } else {

      output = new NewOutputCollector(taskContext, job, umbilical, reporter);源码解析一

    }

    org.apache.hadoop.mapreduce.MapContext<INKEY, INVALUE, OUTKEY, OUTVALUE>

    mapContext =

      new MapContextImpl<INKEY, INVALUE, OUTKEY, OUTVALUE>(job, getTaskID(),

          input, output,

          committer,

          reporter, split);

    org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE>.Context

        mapperContext =

          new WrappedMapper<INKEY, INVALUE, OUTKEY, OUTVALUE>().getMapContext(

              mapContext);

    try {

      input.initialize(split, mapperContext);

      mapper.run(mapperContext);

      mapPhase.complete();

      setPhase(TaskStatus.Phase.SORT);

      statusUpdate(umbilical);

      input.close();

      input = null;

      output.close(mapperContext);

      output = null;

    } finally {

      closeQuietly(input);

      closeQuietly(output, mapperContext);

    }

  }

解析一。构造OutPut对象：

 NewOutputCollector(org.apache.hadoop.mapreduce.JobContext jobContext,

                       JobConf job,

                       TaskUmbilicalProtocol umbilical,

                       TaskReporter reporter

                       ) throws IOException, ClassNotFoundException {

      collector = createSortingCollector(job, reporter);//对应解析源码1.2

      partitions = jobContext.getNumReduceTasks();//分区数等于Reduce数，分区数大于分组的概念。

      if (partitions > 1) {

        partitioner = (org.apache.hadoop.mapreduce.Partitioner<K,V>)

          ReflectionUtils.newInstance(jobContext.getPartitionerClass(), job);//对应源码1.1

      } else {

        partitioner = new org.apache.hadoop.mapreduce.Partitioner<K,V>() {

          @Override

          public int getPartition(K key, V value, int numPartitions) {

            return partitions - 1;//用户不设置时默认框架一个reduce，并且分区号为0

          }

        };

      }

    }
  @Override
    public void write(K key, V value) throws IOException, InterruptedException {
      collector.collect(key, value,
                        partitioner.getPartition(key, value, partitions));//上下文对象构造写出的值，放在collect缓存区中。
    }

解析1.1

public Class<? extends Partitioner<?,?>> getPartitionerClass()

throws ClassNotFoundException {

return (Class<? extends Partitioner<?,?>>)

conf.getClass(PARTITIONER_CLASS_ATTR, HashPartitioner.class);//当用户设置取用户的，没设置默认HashPartitioner 对应解析源码1.1.1

解析源码1.2createSortingCollector类的具体实现

 private <KEY, VALUE> MapOutputCollector<KEY, VALUE>

          createSortingCollector(JobConf job, TaskReporter reporter)

    throws IOException, ClassNotFoundException {

    MapOutputCollector.Context context =

      new MapOutputCollector.Context(this, job, reporter);

    Class<?>[] collectorClasses = job.getClasses(

      JobContext.MAP_OUTPUT_COLLECTOR_CLASS_ATTR, MapOutputBuffer.class);

    int remainingCollectors = collectorClasses.length;

    for (Class clazz : collectorClasses) {

      try {

        if (!MapOutputCollector.class.isAssignableFrom(clazz)) {

          throw new IOException("Invalid output collector class: " + clazz.getName() +

            " (does not implement MapOutputCollector)");

        }

        Class<? extends MapOutputCollector> subclazz =

          clazz.asSubclass(MapOutputCollector.class);

        LOG.debug("Trying map output collector class: " + subclazz.getName());

        MapOutputCollector<KEY, VALUE> collector =

          ReflectionUtils.newInstance(subclazz, job);

        collector.init(context);//解析源码对应1.2.1

        LOG.info("Map output collector class = " + collector.getClass().getName());

        return collector;

      } catch (Exception e) {

        String msg = "Unable to initialize MapOutputCollector " + clazz.getName();

        if (--remainingCollectors > 0) {

          msg += " (" + remainingCollectors + " more collector(s) to try)";

        }

        LOG.warn(msg, e);

      }

    }

    throw new IOException("Unable to initialize any output collector");

  }

解析源码1.2.1 缓冲区collect的初始化

 public void init(MapOutputCollector.Context context

                    ) throws IOException, ClassNotFoundException {

      job = context.getJobConf();

      reporter = context.getReporter();

      mapTask = context.getMapTask();

      mapOutputFile = mapTask.getMapOutputFile();

      sortPhase = mapTask.getSortPhase();

      spilledRecordsCounter = reporter.getCounter(TaskCounter.SPILLED_RECORDS);

      partitions = job.getNumReduceTasks();

      rfs = ((LocalFileSystem)FileSystem.getLocal(job)).getRaw();

      //sanity checks

      final float spillper =

        job.getFloat(JobContext.MAP_SORT_SPILL_PERCENT, (float)0.8);//缓冲区溢写阈值，

      final int sortmb = job.getInt(JobContext.IO_SORT_MB, 100);//缓冲区默认单位是100M

      indexCacheMemoryLimit = job.getInt(JobContext.INDEX_CACHE_MEMORY_LIMIT,

                                         INDEX_CACHE_MEMORY_LIMIT_DEFAULT);

      if (spillper > (float)1.0 || spillper <= (float)0.0) {

        throw new IOException("Invalid \"" + JobContext.MAP_SORT_SPILL_PERCENT +

            "\": " + spillper);

      }

      if ((sortmb & 0x7FF) != sortmb) {

        throw new IOException(

            "Invalid \"" + JobContext.IO_SORT_MB + "\": " + sortmb);

      }

      sorter = ReflectionUtils.newInstance(job.getClass("map.sort.class",

            QuickSort.class, IndexedSorter.class), job);//Map从缓冲区往磁盘写文件的时候需要排序，用的快排。

      // buffers and accounting

      int maxMemUsage = sortmb << 20;

      maxMemUsage -= maxMemUsage % METASIZE;

      kvbuffer = new byte[maxMemUsage];

      bufvoid = kvbuffer.length;

      kvmeta = ByteBuffer.wrap(kvbuffer)

         .order(ByteOrder.nativeOrder())

         .asIntBuffer();

      setEquator(0);

      bufstart = bufend = bufindex = equator;

      kvstart = kvend = kvindex;

      maxRec = kvmeta.capacity() / NMETA;

      softLimit = (int)(kvbuffer.length * spillper);

      bufferRemaining = softLimit;

      LOG.info(JobContext.IO_SORT_MB + ": " + sortmb);

      LOG.info("soft limit at " + softLimit);

      LOG.info("bufstart = " + bufstart + "; bufvoid = " + bufvoid);

      LOG.info("kvstart = " + kvstart + "; length = " + maxRec);
      comparator = job.getOutputKeyComparator();//排序所使用的比较器 见源码解析1,2.1.1
      keyClass = (Class<K>)job.getMapOutputKeyClass();
      valClass = (Class<V>)job.getMapOutputValueClass();
      serializationFactory = new SerializationFactory(job);
      keySerializer = serializationFactory.getSerializer(keyClass);
      keySerializer.open(bb);
      valSerializer = serializationFactory.getSerializer(valClass);
      valSerializer.open(bb);
// combiner
      final Counters.Counter combineInputCounter =
        reporter.getCounter(TaskCounter.COMBINE_INPUT_RECORDS);
      combinerRunner = CombinerRunner.create(job, getTaskID(), //map端的组合 
                                             combineInputCounter,
                                             reporter, null);
      if (combinerRunner != null) {
        final Counters.Counter combineOutputCounter =
          reporter.getCounter(TaskCounter.COMBINE_OUTPUT_RECORDS);
        combineCollector= new CombineOutputCollector<K,V>(combineOutputCounter, reporter, job);
      } else {
        combineCollector = null;
      }

      spillInProgress = false;
      minSpillsForCombine = job.getInt(JobContext.MAP_COMBINE_MIN_SPILLS, 3);//小文件最少是3时，会合并小文件。
      spillThread.setDaemon(true);//线程是另外一个线程负责写的 见解析源码1.2.1.2
      spillThread.setName("SpillThread");
      spillLock.lock();

总结：Mappper输出到缓冲区默认是100M，写到0.8时，会溢写！！！！这块可以调优。通过来回折半来调比如第一次调整50% 然后再80%中减小 70% 然后60%来回折半。

Combine一定要注意，比如求平均值

解析1,2.1.1排序比较器的实现

 public RawComparator getOutputKeyComparator() {

    Class<? extends RawComparator> theClass = getClass(

      JobContext.KEY_COMPARATOR, null, RawComparator.class);字典排序 默认

    if (theClass != null)

      return ReflectionUtils.newInstance(theClass, this);

    return WritableComparator.get(getMapOutputKeyClass().asSubclass(WritableComparable.class), this);//如果用户没有设置排序比较器，就是Key类型自己的比较器，所以Key必须实现序列化，反序列化，比较器。

  }

总结：框架默认使用Key的比较器，字典排序默认，用户也可以覆盖Key的比较器，自定义。！！！

解析源码1.2.1.2 溢写线程做的事

protected class SpillThread extends Thread {

      @Override

      public void run() {

        spillLock.lock();

        spillThreadRunning = true;

        try {

          while (true) {

            spillDone.signal();

            while (!spillInProgress) {

              spillReady.await();

            }

            try {

              spillLock.unlock();

              sortAndSpill();//排序溢写

            } catch (Throwable t) {

              sortSpillException = t;

            } finally {

              spillLock.lock();

              if (bufend < bufstart) {

                bufvoid = kvbuffer.length;

              }

              kvstart = kvend;

              bufstart = bufend;

              spillInProgress = false;

            }

          }

        } catch (InterruptedException e) {

          Thread.currentThread().interrupt();

        } finally {

          spillLock.unlock();

          spillThreadRunning = false;

        }

      }

    }

总结：Map往缓冲区写入东西，线程把缓冲区中的内容做溢写，开始排序，溢写使用快排！！！Combine也在内存中，buffer也在内存，这些计算逻辑都在内存中，排序算法也在内存中，因为Map方法在内存中，这是第一次Combine,从Buffer产生一堆小文件的时候，然后一堆小文件在合并的时候还会执行一次Combine，这次有条件限制（小文件数量大于3）。

解析源码1.1.1

public class HashPartitioner<K, V> extends Partitioner<K, V> {

  /** Use {@link Object#hashCode()} to partition. */

  public int getPartition(K key, V value,

                          int numReduceTasks) {

    return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;!!!

  }

return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;!!!重要取分区的写法！！

总结1.以上源码来源于 output = new NewOutputCollector(taskContext, job, umbilical, reporter)；所以可得出在输出构造的时候需要构造一个分区器。要么是0的，要么是用户设置的，要么是默认的。
总结2.在输出构造中，有缓冲区的设置。
总结3，以上方法都是OutPut的初始化。
总结4.Map输出的K，V变成K，V,P然后写入到环形缓冲区，内存缓存区80%，然后溢写排序，（先按分区排序，然后再按Key的组排序），然后生成小文件，然后合并，用的归并算法，此时小文件已经是内部有序的，所以使用归并算法，一次io即可。

持续更新中。。。。，欢迎大家关注我的公众号LHWorld.

Hadoop源码篇---解读Mapprer源码outPut输出的更多相关文章

Hadoop源码篇---解读Mapprer源码Input输入
一.前述上次分析了客户端源码,这次分析mapper源码让大家对hadoop框架有更清晰的认识二.代码自定义代码如下: public class MyMapper extends Mapper&l ...
源码篇：SDWebImage
攀登,一步一个脚印,方能知其乐源码篇:SDWebImage 源码来源:https://github.com/rs/SDWebImage 版本: 3.7 SDWebImage是一个开源的第三方库,它提 ...
MyBatis 源码篇-MyBatis-Spring 剖析
本章通过分析 mybatis-spring-x.x.x.jar Jar 包中的源码,了解 MyBatis 是如何与 Spring 进行集成的. Spring 配置文件 MyBatis 与 Spring ...
MyBatis 源码篇-Transaction
本章简单介绍一下 MyBatis 的事务模块,这块内容比较简单,主要为后面介绍 mybatis-spring-1.**.jar(MyBatis 与 Spring 集成)中的事务模块做准备. 类图结构 ...
MyBatis 源码篇-DataSource
本章介绍 MyBatis 提供的数据源模块,为后面与 Spring 集成做铺垫,从以下三点出发: 描述 MyBatis 数据源模块的类图结构: MyBatis 是如何集成第三方数据源组件的: Pool ...
MyBatis 源码篇-插件模块
本章主要描述 MyBatis 插件模块的原理,从以下两点出发: MyBatis 是如何加载插件配置的? MyBatis 是如何实现用户使用自定义拦截器对 SQL 语句执行过程中的某一点进行拦截的? 示 ...
MyBatis 源码篇-日志模块2
上一章的案例,配置日志级别为 debug,执行一个简单的查询操作,会将 JDBC 操作打印出来.本章通过 MyBatis 日志部分源码分析它是如何实现日志打印的. 在 MyBatis 的日志模块中有一 ...
MyBatis 源码篇-日志模块1
在 Java 开发中常用的日志框架有 Log4j.Log4j2.Apache Common Log.java.util.logging.slf4j 等,这些日志框架对外提供的接口各不相同.本章详细描述 ...
MyBatis 源码篇-资源加载
本章主要描述 MyBatis 资源加载模块中的 ClassLoaderWrapper 类和 Java 加载配置文件的三种方式. ClassLoaderWrapper 上一章的案例,使用 org.apa ...

随机推荐

SAP GUI 750 安装包及补丁3 共享
SAP GUI 750 安装包及补丁3 共享链接: https://pan.baidu.com/s/1hstkfUs%20 密码: ggbz -------------------------- ...
[Android] AutoCompleteTextView：自己主动完毕输入内容的控件（自己主动补全）
AutoCompleteTextView是EditText的直接子类,与普通EditText的最大不同就是.在用户输入的过程中,能够列出可供选择的输入项.方便使用者. AutoCompleteText ...
Maven实战（九）——打包的技巧
"打包"这个词听起来比較土.比較正式的说法应该是"构建项目软件包".详细说就是将项目中的各种文件,比方源代码.编译生成的字节码.配置文件.文档,依照规范的格式生 ...
Java -Xms -Xmx -Xss -XX:MaxNewSize -XX:MaxPermSize含义记录
出现java.lang.OutOfMemoryError异常时,常使用的方法便是将例如以下配置语句: -Xms512m -Xmx512m -Xss1024k -XX:MaxNewSize=256M - ...
Java集合源代码剖析（二）【HashMap、Hashtable】
HashMap源代码剖析 ; // 最大容量(必须是2的幂且小于2的30次方.传入容量过大将被这个值替换) static final int MAXIMUM_CAPACITY = 1 << ...
Hibernate的load()和get()区别
最近在用Hibernate的时候发现一个问题:比如我们从数据库获得一个对象时,使用session.get()方法还是session.load()? 两种方法在获得一个实体对象时是有区别的,在查询性能 ...
linux vi/vim编辑文件显示行号
方法一(最尴尬的方法): 1.显示当前行行号,在VI的命令模式下输入 :nu 2.显示所有行号,在VI的命令模式下输入 :set nu #这是:set number 的简写方法二(最好的方法): 使 ...
神经网络NN笔记
参考:http://www.cnblogs.com/subconscious/p/5058741.html 俗话说,好记性不如烂笔头~~~~ 边学边记,方便以后查找~~~~~ 一.介绍一下经典的神经网 ...
52、css属性操作
前面说的主要是css的使用规则和选择器等,这篇主要讲解css的具体使用. 一.css text 1.文本颜色:color 颜色属性被用来设置文字的颜色. 颜色是通过CSS最经常的指定: 1)十六进制值 ...
Android 开发，你遇上 Emoji 头疼吗？
在 Android 中,如果需要使用的到 Emoji 表情,你会发现在某些设备上,有一些 Emoji 表情会被以豆腐块 "☐" 的形式显示,这是因为当前设备并不支持这个 Emoji ...

Hadoop源码篇---解读Mapprer源码outPut输出

Hadoop源码篇---解读Mapprer源码outPut输出的更多相关文章

随机推荐

热门专题