hadoop2.7.2 MapReduce Job提交源码及切片源码分析

  1. 首先从waitForCompletion函数进入
boolean result = job.waitForCompletion(true);
/**
* Submit the job to the cluster and wait for it to finish.
* @param verbose print the progress to the user
* @return true if the job succeeded
* @throws IOException thrown if the communication with the
* <code>JobTracker</code> is lost
*/
public boolean waitForCompletion(boolean verbose
) throws IOException, InterruptedException,
ClassNotFoundException {
// 首先判断state,当state为DEFINE时可以提交,进入 submit() 方法
if (state == JobState.DEFINE) {
submit();
}
if (verbose) {
monitorAndPrintJob();
} else {
// get the completion poll interval from the client.
int completionPollIntervalMillis =
Job.getCompletionPollInterval(cluster.getConf());
while (!isComplete()) {
try {
Thread.sleep(completionPollIntervalMillis);
} catch (InterruptedException ie) {
}
}
}
return isSuccessful();
}
  1. 进入submit()方法
/**
* Submit the job to the cluster and return immediately.
* @throws IOException
*/
public void submit()
throws IOException, InterruptedException, ClassNotFoundException {
// 确认JobState状态为可提交状态,否则不能提交
ensureState(JobState.DEFINE);
// 设置使用最新的API
setUseNewAPI();
// 进入connect()方法,MapReduce作业提交时连接集群是通过Job类的connect()方法实现的,
// 它实际上是构造集群Cluster实例cluster
connect();
// connect()方法执行完之后,定义提交者submitter
final JobSubmitter submitter =
getJobSubmitter(cluster.getFileSystem(), cluster.getClient());
status = ugi.doAs(new PrivilegedExceptionAction<JobStatus>() {
public JobStatus run() throws IOException, InterruptedException,
ClassNotFoundException {
// 这里的核心方法是submitJobInternal(),顾名思义,提交job的内部方法,实现了提交job的所有业务逻辑
// 进入submitJobInternal
return submitter.submitJobInternal(Job.this, cluster);
}
});
// 提交之后state状态改变
state = JobState.RUNNING;
LOG.info("The url to track the job: " + getTrackingURL());
}
  1. 进入connect()方法
  • MapReduce作业提交时连接集群通过Job的Connect方法实现,它实际上是构造集群Cluster实例cluster
  • cluster是连接MapReduce集群的一种工具,提供了获取MapReduce集群信息的方法
  • 在Cluster内部,有一个与集群进行通信的客户端通信协议ClientProtocol的实例client,它由ClientProtocolProvider的静态create()方法构造
  • 在create内部,Hadoop2.x中提供了两种模式的ClientProtocol,分别为Yarn模式的YARNRunner和Local模式的LocalJobRunner,Cluster实际上是由它们负责与集群进行通信的
  private synchronized void connect()
throws IOException, InterruptedException, ClassNotFoundException {
if (cluster == null) {// cluster提供了远程获取MapReduce的方法
cluster =
ugi.doAs(new PrivilegedExceptionAction<Cluster>() {
public Cluster run()
throws IOException, InterruptedException,
ClassNotFoundException {
// 只需关注这个Cluster()构造器,构造集群cluster实例
return new Cluster(getConfiguration());
}
});
}
}
  1. 进入Cluster()构造器
// 首先调用一个参数的构造器,间接调用两个参数的构造器
public Cluster(Configuration conf) throws IOException {
this(null, conf);
} public Cluster(InetSocketAddress jobTrackAddr, Configuration conf)
throws IOException {
this.conf = conf;
this.ugi = UserGroupInformation.getCurrentUser();
// 最重要的initialize方法
initialize(jobTrackAddr, conf);
} // cluster中要关注的两个成员变量是客户端通讯协议提供者ClientProtocolProvider和客户端通讯协议ClientProtocol实例client
private void initialize(InetSocketAddress jobTrackAddr, Configuration conf)
throws IOException { synchronized (frameworkLoader) {
for (ClientProtocolProvider provider : frameworkLoader) {
LOG.debug("Trying ClientProtocolProvider : "
+ provider.getClass().getName());
ClientProtocol clientProtocol = null;
try {
// 如果配置文件没有配置YARN信息,则构建LocalRunner,MR任务本地运行
// 如果配置文件有配置YARN信息,则构建YarnRunner,MR任务在YARN集群上运行
if (jobTrackAddr == null) {
// 客户端通讯协议client是调用ClientProtocolProvider的create()方法实现
clientProtocol = provider.create(conf);
} else {
clientProtocol = provider.create(jobTrackAddr, conf);
} if (clientProtocol != null) {
clientProtocolProvider = provider;
client = clientProtocol;
LOG.debug("Picked " + provider.getClass().getName()
+ " as the ClientProtocolProvider");
break;
}
else {
LOG.debug("Cannot pick " + provider.getClass().getName()
+ " as the ClientProtocolProvider - returned null protocol");
}
}
catch (Exception e) {
LOG.info("Failed to use " + provider.getClass().getName()
+ " due to error: ", e);
}
}
} if (null == clientProtocolProvider || null == client) {
throw new IOException(
"Cannot initialize Cluster. Please check your configuration for "
+ MRConfig.FRAMEWORK_NAME
+ " and the correspond server addresses.");
}
}
  1. 进入submitJobInternal(),job的内部提交方法,用于提交job到集群
JobStatus submitJobInternal(Job job, Cluster cluster)
throws ClassNotFoundException, InterruptedException, IOException { //validate the jobs output specs
// 检查结果的输出路径是否已经存在,如果存在会报异常
checkSpecs(job); // conf里边是集群的xml配置文件信息
Configuration conf = job.getConfiguration();
// 添加MR框架到分布式缓存中
addMRFrameworkToDistributedCache(conf); // 获取提交执行时相关资源的临时存放路径
// 参数未配置时默认是(工作空间根目录下的)/tmp/hadoop-yarn/staging/提交作业用户名/.staging
Path jobStagingArea = JobSubmissionFiles.getStagingDir(cluster, conf);
//configure the command line options correctly on the submitting dfs
InetAddress ip = InetAddress.getLocalHost();
if (ip != null) {//记录提交作业的主机IP、主机名,并且设置配置信息conf
submitHostAddress = ip.getHostAddress();
submitHostName = ip.getHostName();
conf.set(MRJobConfig.JOB_SUBMITHOST,submitHostName);
conf.set(MRJobConfig.JOB_SUBMITHOSTADDR,submitHostAddress);
}
// 获取JobId
JobID jobId = submitClient.getNewJobID();
// 设置jobId
job.setJobID(jobId);
// 提交作业的路径Path(Path parent, String child),会将两个参数拼接为一个路径
Path submitJobDir = new Path(jobStagingArea, jobId.toString());
// job的状态
JobStatus status = null;
try {
conf.set(MRJobConfig.USER_NAME,
UserGroupInformation.getCurrentUser().getShortUserName());
conf.set("hadoop.http.filter.initializers",
"org.apache.hadoop.yarn.server.webproxy.amfilter.AmFilterInitializer");
conf.set(MRJobConfig.MAPREDUCE_JOB_DIR, submitJobDir.toString());
LOG.debug("Configuring job " + jobId + " with " + submitJobDir
+ " as the submit dir");
// get delegation token for the dir
TokenCache.obtainTokensForNamenodes(job.getCredentials(),
new Path[] { submitJobDir }, conf); populateTokenCache(conf, job.getCredentials()); // generate a secret to authenticate shuffle transfers
if (TokenCache.getShuffleSecretKey(job.getCredentials()) == null) {
KeyGenerator keyGen;
try {
keyGen = KeyGenerator.getInstance(SHUFFLE_KEYGEN_ALGORITHM);
keyGen.init(SHUFFLE_KEY_LENGTH);
} catch (NoSuchAlgorithmException e) {
throw new IOException("Error generating shuffle secret key", e);
}
SecretKey shuffleKey = keyGen.generateKey();
TokenCache.setShuffleSecretKey(shuffleKey.getEncoded(),
job.getCredentials());
}
if (CryptoUtils.isEncryptedSpillEnabled(conf)) {
conf.setInt(MRJobConfig.MR_AM_MAX_ATTEMPTS, 1);
LOG.warn("Max job attempts set to 1 since encrypted intermediate" +
"data spill is enabled");
} // 拷贝jar包到集群
// 此方法中调用如下方法:rUploader.uploadFiles(job, jobSubmitDir);
// uploadFiles方法将jar包拷贝到集群
copyAndConfigureFiles(job, submitJobDir); Path submitJobFile = JobSubmissionFiles.getJobConfPath(submitJobDir); // Create the splits for the job
LOG.debug("Creating splits at " + jtFs.makeQualified(submitJobDir));
// 计算切片,生成切片规划文件
int maps = writeSplits(job, submitJobDir);
conf.setInt(MRJobConfig.NUM_MAPS, maps);
LOG.info("number of splits:" + maps); // write "queue admins of the queue to which job is being submitted"
// to job file.
String queue = conf.get(MRJobConfig.QUEUE_NAME,
JobConf.DEFAULT_QUEUE_NAME);
AccessControlList acl = submitClient.getQueueAdmins(queue);
conf.set(toFullPropertyName(queue,
QueueACL.ADMINISTER_JOBS.getAclName()), acl.getAclString()); // removing jobtoken referrals before copying the jobconf to HDFS
// as the tasks don't need this setting, actually they may break
// because of it if present as the referral will point to a
// different job.
TokenCache.cleanUpTokenReferral(conf); if (conf.getBoolean(
MRJobConfig.JOB_TOKEN_TRACKING_IDS_ENABLED,
MRJobConfig.DEFAULT_JOB_TOKEN_TRACKING_IDS_ENABLED)) {
// Add HDFS tracking ids
ArrayList<String> trackingIds = new ArrayList<String>();
for (Token<? extends TokenIdentifier> t :
job.getCredentials().getAllTokens()) {
trackingIds.add(t.decodeIdentifier().getTrackingId());
}
conf.setStrings(MRJobConfig.JOB_TOKEN_TRACKING_IDS,
trackingIds.toArray(new String[trackingIds.size()]));
} // Set reservation info if it exists
ReservationId reservationId = job.getReservationId();
if (reservationId != null) {
conf.set(MRJobConfig.RESERVATION_ID, reservationId.toString());
} // Write job file to submit dir
writeConf(conf, submitJobFile); //
// Now, actually submit the job (using the submit name)
// 开始正式提交job
printTokens(jobId, job.getCredentials());
status = submitClient.submitJob(
jobId, submitJobDir.toString(), job.getCredentials());
if (status != null) {
return status;
} else {
throw new IOException("Could not launch job");
}
} finally {
if (status == null) {
LOG.info("Cleaning up the staging area " + submitJobDir);
if (jtFs != null && submitJobDir != null)
jtFs.delete(submitJobDir, true); }
}
}
  1. 进入writeSplits(job, submitJobDir),计算切片,生成切片规划文件
  • 内部会调用writeNewSplits(job, jobSubmitDir)方法
  • writeNewSplits(job, jobSubmitDir)内部定义了一个InputFormat类型的实例input
  • InputFormat主要作用
    • 验证job的输入规范
    • 对输入的文件进行切分,形成多个InputSplit(切片)文件,每一个InputSplit对应着一个map任务(MapTask)
    • 将切片后的数据按照规则形成key,value键值对RecordReader
  • input调用getSplits()方法:List<InputSplit> splits = input.getSplits(job);
  1. 进入FileInputFormat类下的getSplits(job)方法
/**
* Generate the list of files and make them into FileSplits.
* @param job the job context
* @throws IOException
*/
public List<InputSplit> getSplits(JobContext job) throws IOException {
StopWatch sw = new StopWatch().start(); // getFormatMinSplitSize()返回值固定为1,getMinSplitSize(job)返回job大小
long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));
// getMaxSplitSize(job)返回Lang类型的最大值
long maxSize = getMaxSplitSize(job); // generate splits 生成切片
List<InputSplit> splits = new ArrayList<InputSplit>();
List<FileStatus> files = listStatus(job);
// 遍历job下的所有文件
for (FileStatus file: files) {
// 获取文件路径
Path path = file.getPath();
// 获取文件大小
long length = file.getLen();
if (length != 0) {
BlockLocation[] blkLocations;
if (file instanceof LocatedFileStatus) {
blkLocations = ((LocatedFileStatus) file).getBlockLocations();
} else {
FileSystem fs = path.getFileSystem(job.getConfiguration());
blkLocations = fs.getFileBlockLocations(file, 0, length);
}
// 判断是否可分割
if (isSplitable(job, path)) {
// 获取块大小
// 本地环境块大小默认为32MB,YARN环境在hadoop2.x新版本为128MB,旧版本为64MB
long blockSize = file.getBlockSize();
// 计算切片的逻辑大小,默认等于块大小
// 返回值为:return Math.max(minSize, Math.min(maxSize, blockSize));
// 其中minSize=1, maxSize=Long类型最大值, blockSize为切片大小
long splitSize = computeSplitSize(blockSize, minSize, maxSize); long bytesRemaining = length;
// 每次切片时就要判断切片剩下的部分是否大于切片大小的SPLIT_SLOP(默认为1.1)倍,
// 否则就不再切分,划为一块
while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
splits.add(makeSplit(path, length-bytesRemaining, splitSize,
blkLocations[blkIndex].getHosts(),
blkLocations[blkIndex].getCachedHosts()));
bytesRemaining -= splitSize;
} if (bytesRemaining != 0) {
int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
splits.add(makeSplit(path, length-bytesRemaining, bytesRemaining,
blkLocations[blkIndex].getHosts(),
blkLocations[blkIndex].getCachedHosts()));
}
} else { // not splitable
splits.add(makeSplit(path, 0, length, blkLocations[0].getHosts(),
blkLocations[0].getCachedHosts()));
}
} else {
//Create empty hosts array for zero length files
splits.add(makeSplit(path, 0, length, new String[0]));
}
}
// Save the number of input files for metrics/loadgen
job.getConfiguration().setLong(NUM_INPUT_FILES, files.size());
sw.stop();
if (LOG.isDebugEnabled()) {
LOG.debug("Total # of splits generated by getSplits: " + splits.size()
+ ", TimeTaken: " + sw.now(TimeUnit.MILLISECONDS));
}
return splits;
}

欢迎关注下方公众号,获取更多文章信息
![1](https://img2018.cnblogs.com/blog/1816877/201909/1816877-20190929183629964-656221265.jpg)

MapReduce之Job提交流程源码和切片源码分析的更多相关文章

  1. Hadoop(13)-MapReduce框架原理--Job提交源码和切片源码解析

    1.MapReduce的数据流 1) Input -> Mapper阶段 这一阶段的主要分工就是将文件切片和把文件转成K,V对 输入源是一个文件,经过InputFormat之后,到了Mapper ...

  2. mapreduce job提交流程源码级分析(三)

    mapreduce job提交流程源码级分析(二)(原创)这篇文章说到了jobSubmitClient.submitJob(jobId, submitJobDir.toString(), jobCop ...

  3. 大数据入门第七天——MapReduce详解(二)切片源码浅析与自定义patition

    一.mapTask并行度的决定机制 1.概述 一个job的map阶段并行度由客户端在提交job时决定 而客户端对map阶段并行度的规划的基本逻辑为: 将待处理数据执行逻辑切片(即按照一个特定切片大小, ...

  4. mapreduce job提交流程源码级分析(一)(原创)

    首先,在自己写的MR程序中通过org.apache.hadoop.mapreduce.Job来创建Job.配置好之后通过waitForCompletion方法来提交Job并打印MR执行过程的log.H ...

  5. mapreduce job提交流程源码级分析(二)(原创)

    上一小节(http://www.cnblogs.com/lxf20061900/p/3643581.html)讲到Job. submit()方法中的: info = jobClient.submitJ ...

  6. Spark3.0YarnCluster模式任务提交流程源码分析

    1.通过spark-submit脚本提交spark程序 在spark-submit脚本里面执行了SparkSubmit类的main方法 2.运行SparkSubmit类的main方法 3.调用doSu ...

  7. 【图解源码】Zookeeper3.7源码分析,包含服务启动流程源码、网络通信源码、RequestProcessor处理请求源码

    Zookeeper3.7源码剖析 能力目标 能基于Maven导入最新版Zookeeper源码 能说出Zookeeper单机启动流程 理解Zookeeper默认通信中4个线程的作用 掌握Zookeepe ...

  8. 【安卓本卓】Android系统源码篇之(一)源码获取、源码目录结构及源码阅读工具简介

    前言        古人常说,“熟读唐诗三百首,不会作诗也会吟”,说明了大量阅读诗歌名篇对学习作诗有非常大的帮助.做开发也一样,Android源码是全世界最优秀的Android工程师编写的代码,也是A ...

  9. 老李推荐:第6章4节《MonkeyRunner源码剖析》Monkey原理分析-事件源-事件源概览-翻译命令字串

    老李推荐:第6章4节<MonkeyRunner源码剖析>Monkey原理分析-事件源-事件源概览-翻译命令字串   poptest是国内唯一一家培养测试开发工程师的培训机构,以学员能胜任自 ...

随机推荐

  1. Delphi - cxGrid添加Footer显示

    cxGrid - 添加footer显示 1:添加Footer Items 单击cxGrid Customize... ,Summary,Add: 2:添加Footer items数据绑定 选中一条需要 ...

  2. 转载 江南一点雨 一键部署docker

    一键部署 Spring Boot 到远程 Docker 容器,就是这么秀!   不知道各位小伙伴在生产环境都是怎么部署 Spring Boot 的,打成 jar 直接一键运行?打成 war 扔到 To ...

  3. Keras(三)backend 兼容 Regressor 回归 Classifier 分类 原理及实例

    backend 兼容 backend,即基于什么来做运算 Keras 可以基于两个Backend,一个是 Theano,一个是 Tensorflow 查看当前backend import keras ...

  4. 为什么有了Compose和Swarm,还会有Kubernetes的出现?

    一.k8s设计思想更先进 k8s的主要设置思想,是从更宏观的角度,以统一的方式来定义任务之间的各种关系 1.k8s的核心功能图 2.k8s的全局架构图 kube-apiserver:API服务 Kub ...

  5. DOM的高级操作-一种JS控制元素的视觉假象

    1.运动中的边界处理(让其在一个指定区域内运动) 当元素的offsetLeft值超出一定距离或达到一个我们想要设置的边界值时,停止计时器. var timer; timer = setInterval ...

  6. Eclipse批量注释、批量缩进、批量取消缩进技巧

    1.批量注释:选中若干行,按"Ctrl"+"/" 2.批量缩进:选中若干行,按TAB 3.批量取消缩进:选中若干行,按SHIFT+TAB

  7. HDU 4565 So Easy! 广义斐波拉数 数论 (a+sqrt(b))^n%mod 模板

    So Easy! Time Limit: 2000/1000 MS (Java/Others)    Memory Limit: 32768/32768 K (Java/Others)Total Su ...

  8. zoj 5823 Soldier Game 2018 青岛 I

    题目传送门 题意:现在有n个人,现在可以把这n个人分成若干组,只有连续的人才能被分为一组,并且一个组内最多2个人,现在问你 所有组内的最大值-最小值 这个差值最小是多少. 题解: 将每个人的情况3种情 ...

  9. 良许 | 听说,有个同事因为关闭服务器被打进 ICU ……

    提问:你是如何关闭电脑的? 普通青年 文艺青年 二逼青年 你是属于哪一种呢? 实话说, 这三种良许都干过~ 还好我没有对服务器这么做, 否则-- 分分钟被打进 ICU -- 1. 关机命令知多少 对于 ...

  10. 用webpack构建一个常规项目,好处和坏处分析

    最近项目改版,用webpack重新架构. 些许心得我会写几篇记录一下. 好处如下: 1.ES6语法用起来,babel-loader转义,各种新语法用起来. 2.import 语法写起来,webpack ...