Hadoop源码篇--Client源码

一。前述

今天起剖析源码，先从Client看起，因为Client在MapReduce的过程中承担了很多重要的角色。

二。MapReduce框架主类

代码如下：

public static void main(String[] args) throws Exception {

        Configuration conf = new Configuration(true);

        //job  作业

        Job  job = Job.getInstance(conf);

         // Create a new Job

//         Job job = Job.getInstance();

         job.setJarByClass(MyWC.class);

         // Specify various job-specific parameters

         job.setJobName("myjob");

//         job.setInputPath(new Path("in"));

//         job.setOutputPath(new Path("out"));

         Path input = new Path("/user/root");

        FileInputFormat.addInputPath(job, input );

         Path output = new Path("/output/wordcount");

         if(output.getFileSystem(conf).exists(output)){

             output.getFileSystem(conf).delete(output,true);

         }

        FileOutputFormat.setOutputPath(job, output );

         job.setMapperClass(MyMapper.class);

         job.setMapOutputKeyClass(Text.class);

         job.setMapOutputValueClass(IntWritable.class);

         job.setReducerClass(MyReducer.class);

         // Submit the job, then poll for progress until the job is complete

         job.waitForCompletion(true);

第一步，先分析Job，可以看见源码中Job实现了public class Job extends JobContextImpl implements JobContext

然后JobContext实现了 MRJobConfig,可以看见其中有很多配置

因为job中传的参数为conf,所以这里的配置即对应我们的配置文件中的属性值。

  Job  job = Job.getInstance(conf);

挑几个重要的看下：

public static final int DEFAULT_MAP_MEMORY_MB = 1024;//默认的Mapper任务内存大小。

第二步，分析提交过程 job.waitForCompletion(true); 追踪源码发现主要实现这个类

JobStatus submitJobInternal(Job job, Cluster cluster) 
  throws ClassNotFoundException, InterruptedException, IOException

Checking the input and output specifications of the job.//检查输入输出路径
Computing the InputSplits for the job.//检查切片
Setup the requisite accounting information for the DistributedCache of the job, if necessary.
Copying the job's jar and configuration to the map-reduce system directory on the distributed file-system.
Submitting the job to the JobTracker and optionally monitoring it's status.

在此方法中，中重点看下此方法 int maps = writeSplits(job, submitJobDir);

追踪后具体实现可知

private <T extends InputSplit>

  int writeNewSplits(JobContext job, Path jobSubmitDir) throws IOException,

      InterruptedException, ClassNotFoundException {

    Configuration conf = job.getConfiguration();

    InputFormat<?, ?> input =

      ReflectionUtils.newInstance(job.getInputFormatClass(), conf);

    List<InputSplit> splits = input.getSplits(job);

    T[] array = (T[]) splits.toArray(new InputSplit[splits.size()]);

    // sort the splits into order based on size, so that the biggest

    // go first

    Arrays.sort(array, new SplitComparator());

    JobSplitWriter.createSplitFiles(jobSubmitDir, conf,

        jobSubmitDir.getFileSystem(conf), array);

    return array.length;

  }

追踪job.getInputFormatClass()可以发现如下代码： 

public Class<? extends InputFormat<?,?>> getInputFormatClass()

     throws ClassNotFoundException {

    return (Class<? extends InputFormat<?,?>>)

      conf.getClass(INPUT_FORMAT_CLASS_ATTR, TextInputFormat.class);
//根据用户配置文件首先取用，如果没有被取用则使用默认输入格式TextInputFormat

  }

所以可得知用户的默认输入类是TextInputformat类并且继承关系如下：

TextInputforMat-->FileinputFormat-->InputFormat

追踪 List<InputSplit> splits = input.getSplits(job);可以得到如下源码：

最为重要的一个源码！！！！！！！！！！！

public List<InputSplit> getSplits(JobContext job) throws IOException {

    Stopwatch sw = new Stopwatch().start();

    long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));如果用户设置则取用户，没有是1

    long maxSize = getMaxSplitSize(job);//如果用户设置则取用户，没有取最大值

    // generate splits

    List<InputSplit> splits = new ArrayList<InputSplit>();

    List<FileStatus> files = listStatus(job);

    for (FileStatus file: files) {

      Path path = file.getPath();//取输入文件的大小和路径

      long length = file.getLen();

      if (length != 0) {

        BlockLocation[] blkLocations;

        if (file instanceof LocatedFileStatus) {

          blkLocations = ((LocatedFileStatus) file).getBlockLocations();

        } else {

          FileSystem fs = path.getFileSystem(job.getConfiguration());

          blkLocations = fs.getFileBlockLocations(file, 0, length);//获得所有块的位置。

        }

        if (isSplitable(job, path)) {

          long blockSize = file.getBlockSize();

          long splitSize = computeSplitSize(blockSize, minSize, maxSize);//获得切片大小

          long bytesRemaining = length;

          while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {

            int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);//这一块传参传的是切块的偏移量，返回这个块的索引

            splits.add(makeSplit(path, length-bytesRemaining, splitSize,

                        blkLocations[blkIndex].getHosts(),//根据当前块的索引号取出来块的位置包括副本的位置 然后传递给切片，然后切片知道往哪运算。即往块的位置信息计算

                        blkLocations[blkIndex].getCachedHosts()));

            bytesRemaining -= splitSize;

          }

          if (bytesRemaining != 0) {

            int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);

            splits.add(makeSplit(path, length-bytesRemaining, bytesRemaining,

                       blkLocations[blkIndex].getHosts(),

                       blkLocations[blkIndex].getCachedHosts()));

          }

        } else { // not splitable

          splits.add(makeSplit(path, 0, length, blkLocations[0].getHosts(),

                      blkLocations[0].getCachedHosts()));

        }

      } else {

        //Create empty hosts array for zero length files

        splits.add(makeSplit(path, 0, length, new String[0]));

      }

    }

    // Save the number of input files for metrics/loadgen

    job.getConfiguration().setLong(NUM_INPUT_FILES, files.size());

    sw.stop();

    if (LOG.isDebugEnabled()) {

      LOG.debug("Total # of splits generated by getSplits: " + splits.size()

          + ", TimeTaken: " + sw.elapsedMillis());

    }

    return splits;

  }

 1.long splitSize = computeSplitSize(blockSize, minSize, maxSize);追踪源码发现

protected long computeSplitSize(long blockSize, long minSize, long maxSize) {

    return Math.max(minSize, Math.min(maxSize, blockSize));

  }

切片大小默认是块的大小！！！！

假如让切片大小 < 块的大小则更改配置的最大值MaxSize，让其小于blocksize

假如让切片大小 > 块的大小则更改配置的最小值MinSize，让其大于blocksize

通过FileInputFormat.setMinInputSplitSize即可。

2. int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining) 追踪源码发现

 protected int getBlockIndex(BlockLocation[] blkLocations,

                              long offset) {

    for (int i = 0 ; i < blkLocations.length; i++) {

      // is the offset inside this block?

      if ((blkLocations[i].getOffset() <= offset) &&

          (offset < blkLocations[i].getOffset() + blkLocations[i].getLength())){//切片要大于>=块的起始量，小于一个块的末尾量。

        return i;//返回这个块

      }

    }

    BlockLocation last = blkLocations[blkLocations.length -1];

    long fileLength = last.getOffset() + last.getLength() -1;

    throw new IllegalArgumentException("Offset " + offset +

                                       " is outside of file (0.." +

                                       fileLength + ")");

  }

3. splits.add(makeSplit(path, length-bytesRemaining, splitSize, blkLocations[blkIndex].getHosts()

创建切片的时候，一个切片对应一个mapperr任务，所以创建切片的四个位置（path,0,10,host）

根据host可知mapper任务的计算位置，则对应计算向数据移动！！！！块是逻辑的，并没有真正切割数据。！！

4.上述getSplits方法最终得到一个切片的清单,清单的数目就是mapper的数量！！即开始方法的入口 int maps = writeSplits(job, submitJobDir);返回值。

5.计算向数据移动时会拉取只属于自己的文件。

持续更新中。。。。，欢迎大家关注我的公众号LHWorld.

Hadoop源码篇--Client源码的更多相关文章

MyBatis 源码篇-MyBatis-Spring 剖析
本章通过分析 mybatis-spring-x.x.x.jar Jar 包中的源码,了解 MyBatis 是如何与 Spring 进行集成的. Spring 配置文件 MyBatis 与 Spring ...
MyBatis 源码篇-Transaction
本章简单介绍一下 MyBatis 的事务模块,这块内容比较简单,主要为后面介绍 mybatis-spring-1.**.jar(MyBatis 与 Spring 集成)中的事务模块做准备. 类图结构 ...
MyBatis 源码篇-DataSource
本章介绍 MyBatis 提供的数据源模块,为后面与 Spring 集成做铺垫,从以下三点出发: 描述 MyBatis 数据源模块的类图结构: MyBatis 是如何集成第三方数据源组件的: Pool ...
MyBatis 源码篇-插件模块
本章主要描述 MyBatis 插件模块的原理,从以下两点出发: MyBatis 是如何加载插件配置的? MyBatis 是如何实现用户使用自定义拦截器对 SQL 语句执行过程中的某一点进行拦截的? 示 ...
MyBatis 源码篇-日志模块2
上一章的案例,配置日志级别为 debug,执行一个简单的查询操作,会将 JDBC 操作打印出来.本章通过 MyBatis 日志部分源码分析它是如何实现日志打印的. 在 MyBatis 的日志模块中有一 ...
MyBatis 源码篇-日志模块1
在 Java 开发中常用的日志框架有 Log4j.Log4j2.Apache Common Log.java.util.logging.slf4j 等,这些日志框架对外提供的接口各不相同.本章详细描述 ...
MyBatis 源码篇-资源加载
本章主要描述 MyBatis 资源加载模块中的 ClassLoaderWrapper 类和 Java 加载配置文件的三种方式. ClassLoaderWrapper 上一章的案例,使用 org.apa ...
MyBatis 源码篇-SQL 执行的流程
本章通过一个简单的例子,来了解 MyBatis 执行一条 SQL 语句的大致过程是怎样的. 案例代码如下所示: public class MybatisTest { @Test public void ...
MyBatis 源码篇-整体架构
MyBatis 的整体架构分为三层, 分别是基础支持层.核心处理层和接口层,如下图所示. 基础支持层反射模块该模块对 Java 原生的反射进行了良好的封装,提供了更加简洁易用的 API ,方便上层 ...

随机推荐

Dynamic HTML权威指南（读书笔记）— 第一章 HTML与XHTML参考
1. 对齐常量(text-align和vertical-align) 1.1 盒外对齐这种对齐属性决定环绕着元素外部矩形空间的文本对齐方式.W3C中,这类HTML元素包括:applet.iframe ...
Python爬虫(二十一)_Selenium与PhantomJS
本章将介绍使用Selenium和PhantomJS两种工具用来加载动态数据,更多内容请参考:Python学习指南 Selenium Selenium是一个Web的自动化测试工具,最初是为网站自动化测试 ...
C++、Objective-C 混合编程
在XCODE中想使用C++代码,你须要把文件的扩展名从.m改成.mm.这样才会启动g++编译器. 我们来看个測试代码: [java] view plaincopy class TestC { priv ...
【SqlServer系列】JSON数据
1 概述本文将结合MSDN简要概述JSON数据. 2 具体内容 JSON 是一种流行的数据格式,用于在现代 Web 和移动应用程序中交换数据. JSON 还可用于在 Microsoft Az ...
F04 我的投资策略
我的投资理念:价值投资和右侧趋势投资.我的目标年化收益率: 15-25%我的投资时间:3-5年我的投资品种:股票 + EFT基金买卖时间点的纪律(买入,卖出的时间原则)股票MA20为界限,高于则持有 ...
自学Zabbix3.7.2-事件Event-来源与分类
一.zabbix 事件从字面理解,就是发生了一个事情就算是一个事件.就在trigger的文章内,我们已经有用到事件,这个事件要讲概念真心不知道怎么说,就拿trigger事件来说,如果trigger从当 ...
【java设计模式】【结构模式Structural Pattern】合成模式Composite Pattern
package com.tn.pattern; import java.util.Vector; public class Client { public static void main(Strin ...
【HTML5】增强的表单
<!DOCTYPE html> <html> <head> <meta charset="utf-8"> <title> ...
【python】字符串
>>> str1="welcom to China">>> str1[2:4]'lc'>>> str1[7]'t'>&g ...
java.lang基础数据类型boolean、char、byte、short、int、long、float、double (JDK1.8)
java.lang.Boolean public static int hashCode(boolean value) { return value ? 1231 : 1237; } JDK 1.8新 ...

Hadoop源码篇--Client源码

Hadoop源码篇--Client源码的更多相关文章

随机推荐

热门专题