hadoop2.7.2 MapReduce Job提交源码及切片源码分析

  1. 首先从waitForCompletion函数进入
boolean result = job.waitForCompletion(true);
/**
* Submit the job to the cluster and wait for it to finish.
* @param verbose print the progress to the user
* @return true if the job succeeded
* @throws IOException thrown if the communication with the
* <code>JobTracker</code> is lost
*/
public boolean waitForCompletion(boolean verbose
) throws IOException, InterruptedException,
ClassNotFoundException {
// 首先判断state,当state为DEFINE时可以提交,进入 submit() 方法
if (state == JobState.DEFINE) {
submit();
}
if (verbose) {
monitorAndPrintJob();
} else {
// get the completion poll interval from the client.
int completionPollIntervalMillis =
Job.getCompletionPollInterval(cluster.getConf());
while (!isComplete()) {
try {
Thread.sleep(completionPollIntervalMillis);
} catch (InterruptedException ie) {
}
}
}
return isSuccessful();
}
  1. 进入submit()方法
/**
* Submit the job to the cluster and return immediately.
* @throws IOException
*/
public void submit()
throws IOException, InterruptedException, ClassNotFoundException {
// 确认JobState状态为可提交状态,否则不能提交
ensureState(JobState.DEFINE);
// 设置使用最新的API
setUseNewAPI();
// 进入connect()方法,MapReduce作业提交时连接集群是通过Job类的connect()方法实现的,
// 它实际上是构造集群Cluster实例cluster
connect();
// connect()方法执行完之后,定义提交者submitter
final JobSubmitter submitter =
getJobSubmitter(cluster.getFileSystem(), cluster.getClient());
status = ugi.doAs(new PrivilegedExceptionAction<JobStatus>() {
public JobStatus run() throws IOException, InterruptedException,
ClassNotFoundException {
// 这里的核心方法是submitJobInternal(),顾名思义,提交job的内部方法,实现了提交job的所有业务逻辑
// 进入submitJobInternal
return submitter.submitJobInternal(Job.this, cluster);
}
});
// 提交之后state状态改变
state = JobState.RUNNING;
LOG.info("The url to track the job: " + getTrackingURL());
}
  1. 进入connect()方法
  • MapReduce作业提交时连接集群通过Job的Connect方法实现,它实际上是构造集群Cluster实例cluster
  • cluster是连接MapReduce集群的一种工具,提供了获取MapReduce集群信息的方法
  • 在Cluster内部,有一个与集群进行通信的客户端通信协议ClientProtocol的实例client,它由ClientProtocolProvider的静态create()方法构造
  • 在create内部,Hadoop2.x中提供了两种模式的ClientProtocol,分别为Yarn模式的YARNRunner和Local模式的LocalJobRunner,Cluster实际上是由它们负责与集群进行通信的
  private synchronized void connect()
throws IOException, InterruptedException, ClassNotFoundException {
if (cluster == null) {// cluster提供了远程获取MapReduce的方法
cluster =
ugi.doAs(new PrivilegedExceptionAction<Cluster>() {
public Cluster run()
throws IOException, InterruptedException,
ClassNotFoundException {
// 只需关注这个Cluster()构造器,构造集群cluster实例
return new Cluster(getConfiguration());
}
});
}
}
  1. 进入Cluster()构造器
// 首先调用一个参数的构造器,间接调用两个参数的构造器
public Cluster(Configuration conf) throws IOException {
this(null, conf);
} public Cluster(InetSocketAddress jobTrackAddr, Configuration conf)
throws IOException {
this.conf = conf;
this.ugi = UserGroupInformation.getCurrentUser();
// 最重要的initialize方法
initialize(jobTrackAddr, conf);
} // cluster中要关注的两个成员变量是客户端通讯协议提供者ClientProtocolProvider和客户端通讯协议ClientProtocol实例client
private void initialize(InetSocketAddress jobTrackAddr, Configuration conf)
throws IOException { synchronized (frameworkLoader) {
for (ClientProtocolProvider provider : frameworkLoader) {
LOG.debug("Trying ClientProtocolProvider : "
+ provider.getClass().getName());
ClientProtocol clientProtocol = null;
try {
// 如果配置文件没有配置YARN信息,则构建LocalRunner,MR任务本地运行
// 如果配置文件有配置YARN信息,则构建YarnRunner,MR任务在YARN集群上运行
if (jobTrackAddr == null) {
// 客户端通讯协议client是调用ClientProtocolProvider的create()方法实现
clientProtocol = provider.create(conf);
} else {
clientProtocol = provider.create(jobTrackAddr, conf);
} if (clientProtocol != null) {
clientProtocolProvider = provider;
client = clientProtocol;
LOG.debug("Picked " + provider.getClass().getName()
+ " as the ClientProtocolProvider");
break;
}
else {
LOG.debug("Cannot pick " + provider.getClass().getName()
+ " as the ClientProtocolProvider - returned null protocol");
}
}
catch (Exception e) {
LOG.info("Failed to use " + provider.getClass().getName()
+ " due to error: ", e);
}
}
} if (null == clientProtocolProvider || null == client) {
throw new IOException(
"Cannot initialize Cluster. Please check your configuration for "
+ MRConfig.FRAMEWORK_NAME
+ " and the correspond server addresses.");
}
}
  1. 进入submitJobInternal(),job的内部提交方法,用于提交job到集群
JobStatus submitJobInternal(Job job, Cluster cluster)
throws ClassNotFoundException, InterruptedException, IOException { //validate the jobs output specs
// 检查结果的输出路径是否已经存在,如果存在会报异常
checkSpecs(job); // conf里边是集群的xml配置文件信息
Configuration conf = job.getConfiguration();
// 添加MR框架到分布式缓存中
addMRFrameworkToDistributedCache(conf); // 获取提交执行时相关资源的临时存放路径
// 参数未配置时默认是(工作空间根目录下的)/tmp/hadoop-yarn/staging/提交作业用户名/.staging
Path jobStagingArea = JobSubmissionFiles.getStagingDir(cluster, conf);
//configure the command line options correctly on the submitting dfs
InetAddress ip = InetAddress.getLocalHost();
if (ip != null) {//记录提交作业的主机IP、主机名,并且设置配置信息conf
submitHostAddress = ip.getHostAddress();
submitHostName = ip.getHostName();
conf.set(MRJobConfig.JOB_SUBMITHOST,submitHostName);
conf.set(MRJobConfig.JOB_SUBMITHOSTADDR,submitHostAddress);
}
// 获取JobId
JobID jobId = submitClient.getNewJobID();
// 设置jobId
job.setJobID(jobId);
// 提交作业的路径Path(Path parent, String child),会将两个参数拼接为一个路径
Path submitJobDir = new Path(jobStagingArea, jobId.toString());
// job的状态
JobStatus status = null;
try {
conf.set(MRJobConfig.USER_NAME,
UserGroupInformation.getCurrentUser().getShortUserName());
conf.set("hadoop.http.filter.initializers",
"org.apache.hadoop.yarn.server.webproxy.amfilter.AmFilterInitializer");
conf.set(MRJobConfig.MAPREDUCE_JOB_DIR, submitJobDir.toString());
LOG.debug("Configuring job " + jobId + " with " + submitJobDir
+ " as the submit dir");
// get delegation token for the dir
TokenCache.obtainTokensForNamenodes(job.getCredentials(),
new Path[] { submitJobDir }, conf); populateTokenCache(conf, job.getCredentials()); // generate a secret to authenticate shuffle transfers
if (TokenCache.getShuffleSecretKey(job.getCredentials()) == null) {
KeyGenerator keyGen;
try {
keyGen = KeyGenerator.getInstance(SHUFFLE_KEYGEN_ALGORITHM);
keyGen.init(SHUFFLE_KEY_LENGTH);
} catch (NoSuchAlgorithmException e) {
throw new IOException("Error generating shuffle secret key", e);
}
SecretKey shuffleKey = keyGen.generateKey();
TokenCache.setShuffleSecretKey(shuffleKey.getEncoded(),
job.getCredentials());
}
if (CryptoUtils.isEncryptedSpillEnabled(conf)) {
conf.setInt(MRJobConfig.MR_AM_MAX_ATTEMPTS, 1);
LOG.warn("Max job attempts set to 1 since encrypted intermediate" +
"data spill is enabled");
} // 拷贝jar包到集群
// 此方法中调用如下方法:rUploader.uploadFiles(job, jobSubmitDir);
// uploadFiles方法将jar包拷贝到集群
copyAndConfigureFiles(job, submitJobDir); Path submitJobFile = JobSubmissionFiles.getJobConfPath(submitJobDir); // Create the splits for the job
LOG.debug("Creating splits at " + jtFs.makeQualified(submitJobDir));
// 计算切片,生成切片规划文件
int maps = writeSplits(job, submitJobDir);
conf.setInt(MRJobConfig.NUM_MAPS, maps);
LOG.info("number of splits:" + maps); // write "queue admins of the queue to which job is being submitted"
// to job file.
String queue = conf.get(MRJobConfig.QUEUE_NAME,
JobConf.DEFAULT_QUEUE_NAME);
AccessControlList acl = submitClient.getQueueAdmins(queue);
conf.set(toFullPropertyName(queue,
QueueACL.ADMINISTER_JOBS.getAclName()), acl.getAclString()); // removing jobtoken referrals before copying the jobconf to HDFS
// as the tasks don't need this setting, actually they may break
// because of it if present as the referral will point to a
// different job.
TokenCache.cleanUpTokenReferral(conf); if (conf.getBoolean(
MRJobConfig.JOB_TOKEN_TRACKING_IDS_ENABLED,
MRJobConfig.DEFAULT_JOB_TOKEN_TRACKING_IDS_ENABLED)) {
// Add HDFS tracking ids
ArrayList<String> trackingIds = new ArrayList<String>();
for (Token<? extends TokenIdentifier> t :
job.getCredentials().getAllTokens()) {
trackingIds.add(t.decodeIdentifier().getTrackingId());
}
conf.setStrings(MRJobConfig.JOB_TOKEN_TRACKING_IDS,
trackingIds.toArray(new String[trackingIds.size()]));
} // Set reservation info if it exists
ReservationId reservationId = job.getReservationId();
if (reservationId != null) {
conf.set(MRJobConfig.RESERVATION_ID, reservationId.toString());
} // Write job file to submit dir
writeConf(conf, submitJobFile); //
// Now, actually submit the job (using the submit name)
// 开始正式提交job
printTokens(jobId, job.getCredentials());
status = submitClient.submitJob(
jobId, submitJobDir.toString(), job.getCredentials());
if (status != null) {
return status;
} else {
throw new IOException("Could not launch job");
}
} finally {
if (status == null) {
LOG.info("Cleaning up the staging area " + submitJobDir);
if (jtFs != null && submitJobDir != null)
jtFs.delete(submitJobDir, true); }
}
}
  1. 进入writeSplits(job, submitJobDir),计算切片,生成切片规划文件
  • 内部会调用writeNewSplits(job, jobSubmitDir)方法
  • writeNewSplits(job, jobSubmitDir)内部定义了一个InputFormat类型的实例input
  • InputFormat主要作用
    • 验证job的输入规范
    • 对输入的文件进行切分,形成多个InputSplit(切片)文件,每一个InputSplit对应着一个map任务(MapTask)
    • 将切片后的数据按照规则形成key,value键值对RecordReader
  • input调用getSplits()方法:List<InputSplit> splits = input.getSplits(job);
  1. 进入FileInputFormat类下的getSplits(job)方法
/**
* Generate the list of files and make them into FileSplits.
* @param job the job context
* @throws IOException
*/
public List<InputSplit> getSplits(JobContext job) throws IOException {
StopWatch sw = new StopWatch().start(); // getFormatMinSplitSize()返回值固定为1,getMinSplitSize(job)返回job大小
long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));
// getMaxSplitSize(job)返回Lang类型的最大值
long maxSize = getMaxSplitSize(job); // generate splits 生成切片
List<InputSplit> splits = new ArrayList<InputSplit>();
List<FileStatus> files = listStatus(job);
// 遍历job下的所有文件
for (FileStatus file: files) {
// 获取文件路径
Path path = file.getPath();
// 获取文件大小
long length = file.getLen();
if (length != 0) {
BlockLocation[] blkLocations;
if (file instanceof LocatedFileStatus) {
blkLocations = ((LocatedFileStatus) file).getBlockLocations();
} else {
FileSystem fs = path.getFileSystem(job.getConfiguration());
blkLocations = fs.getFileBlockLocations(file, 0, length);
}
// 判断是否可分割
if (isSplitable(job, path)) {
// 获取块大小
// 本地环境块大小默认为32MB,YARN环境在hadoop2.x新版本为128MB,旧版本为64MB
long blockSize = file.getBlockSize();
// 计算切片的逻辑大小,默认等于块大小
// 返回值为:return Math.max(minSize, Math.min(maxSize, blockSize));
// 其中minSize=1, maxSize=Long类型最大值, blockSize为切片大小
long splitSize = computeSplitSize(blockSize, minSize, maxSize); long bytesRemaining = length;
// 每次切片时就要判断切片剩下的部分是否大于切片大小的SPLIT_SLOP(默认为1.1)倍,
// 否则就不再切分,划为一块
while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
splits.add(makeSplit(path, length-bytesRemaining, splitSize,
blkLocations[blkIndex].getHosts(),
blkLocations[blkIndex].getCachedHosts()));
bytesRemaining -= splitSize;
} if (bytesRemaining != 0) {
int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
splits.add(makeSplit(path, length-bytesRemaining, bytesRemaining,
blkLocations[blkIndex].getHosts(),
blkLocations[blkIndex].getCachedHosts()));
}
} else { // not splitable
splits.add(makeSplit(path, 0, length, blkLocations[0].getHosts(),
blkLocations[0].getCachedHosts()));
}
} else {
//Create empty hosts array for zero length files
splits.add(makeSplit(path, 0, length, new String[0]));
}
}
// Save the number of input files for metrics/loadgen
job.getConfiguration().setLong(NUM_INPUT_FILES, files.size());
sw.stop();
if (LOG.isDebugEnabled()) {
LOG.debug("Total # of splits generated by getSplits: " + splits.size()
+ ", TimeTaken: " + sw.now(TimeUnit.MILLISECONDS));
}
return splits;
}

欢迎关注下方公众号,获取更多文章信息
![1](https://img2018.cnblogs.com/blog/1816877/201909/1816877-20190929183629964-656221265.jpg)

MapReduce之Job提交流程源码和切片源码分析的更多相关文章

  1. Hadoop(13)-MapReduce框架原理--Job提交源码和切片源码解析

    1.MapReduce的数据流 1) Input -> Mapper阶段 这一阶段的主要分工就是将文件切片和把文件转成K,V对 输入源是一个文件,经过InputFormat之后,到了Mapper ...

  2. mapreduce job提交流程源码级分析(三)

    mapreduce job提交流程源码级分析(二)(原创)这篇文章说到了jobSubmitClient.submitJob(jobId, submitJobDir.toString(), jobCop ...

  3. 大数据入门第七天——MapReduce详解(二)切片源码浅析与自定义patition

    一.mapTask并行度的决定机制 1.概述 一个job的map阶段并行度由客户端在提交job时决定 而客户端对map阶段并行度的规划的基本逻辑为: 将待处理数据执行逻辑切片(即按照一个特定切片大小, ...

  4. mapreduce job提交流程源码级分析(一)(原创)

    首先,在自己写的MR程序中通过org.apache.hadoop.mapreduce.Job来创建Job.配置好之后通过waitForCompletion方法来提交Job并打印MR执行过程的log.H ...

  5. mapreduce job提交流程源码级分析(二)(原创)

    上一小节(http://www.cnblogs.com/lxf20061900/p/3643581.html)讲到Job. submit()方法中的: info = jobClient.submitJ ...

  6. Spark3.0YarnCluster模式任务提交流程源码分析

    1.通过spark-submit脚本提交spark程序 在spark-submit脚本里面执行了SparkSubmit类的main方法 2.运行SparkSubmit类的main方法 3.调用doSu ...

  7. 【图解源码】Zookeeper3.7源码分析,包含服务启动流程源码、网络通信源码、RequestProcessor处理请求源码

    Zookeeper3.7源码剖析 能力目标 能基于Maven导入最新版Zookeeper源码 能说出Zookeeper单机启动流程 理解Zookeeper默认通信中4个线程的作用 掌握Zookeepe ...

  8. 【安卓本卓】Android系统源码篇之(一)源码获取、源码目录结构及源码阅读工具简介

    前言        古人常说,“熟读唐诗三百首,不会作诗也会吟”,说明了大量阅读诗歌名篇对学习作诗有非常大的帮助.做开发也一样,Android源码是全世界最优秀的Android工程师编写的代码,也是A ...

  9. 老李推荐:第6章4节《MonkeyRunner源码剖析》Monkey原理分析-事件源-事件源概览-翻译命令字串

    老李推荐:第6章4节<MonkeyRunner源码剖析>Monkey原理分析-事件源-事件源概览-翻译命令字串   poptest是国内唯一一家培养测试开发工程师的培训机构,以学员能胜任自 ...

随机推荐

  1. python画混淆矩阵(confusion matrix)

    混淆矩阵(Confusion Matrix),是一种在深度学习中常用的辅助工具,可以让你直观地了解你的模型在哪一类样本里面表现得不是很好. 如上图,我们就可以看到,有一个样本原本是0的,却被预测成了1 ...

  2. C#开发BIMFACE系列9 服务端API之获取应用支持的文件类型

    系列目录     [已更新最新开发文章,点击查看详细] BIMFACE最核心能力之一是工程文件格式转换.无需安装插件,支持数十种工程文件格式在云端转换,完整保留原始文件信息.开发者将告别原始文件解析烦 ...

  3. ssh延迟加载问题的解决方案

    1. 什么是延迟加载问题 ? 业务层查询的数据 关闭session 之后...web层获取延迟加载的数据失败. 例如:查询订单没有查询客户,需要显示客户,session已经关闭,无法查询 2. 如何解 ...

  4. 【selenium】- 自动化测试必备工具FireBug&FirePath

    本文由小编根据慕课网视频亲自整理,转载请注明出处和作者. 1. FireBug FireBug的安装: 如果使用Firefox浏览器的话,推荐使用较低版本,比如27-32.否则会报错. 点击右上角的菜 ...

  5. HDU 1087 Super Jumping! Jumping! Jumping! 最长递增子序列(求可能的递增序列的和的最大值) *

    Super Jumping! Jumping! Jumping! Time Limit:1000MS     Memory Limit:32768KB     64bit IO Format:%I64 ...

  6. CF922A Cloning Toys

    A. Cloning Toys time limit per test 1 second memory limit per test 256 megabytes input standard inpu ...

  7. CSS动效集锦,视觉魔法的碰撞与融合(二)

    引言 长久以来,我认识到.CSS,是存在极限的.正如曾经替你扛下一切的那个男人,也总有他眼含热泪地拼上一切,却也无法帮你做到的事情,他只能困窘地让你看到他的无能为力,怅然若失. 然后和曾经他成长的时代 ...

  8. CCPC-Wannafly Camp #2 (部分题解)

    L: New Game! 题目描述: Eagle Jump公司正在开发一款新的游戏.泷本一二三作为其员工,获得了提前试玩的机会.现在她正在试图通过一个迷宫. 这个迷宫有一些特点.为了方便描述,我们对这 ...

  9. Spring MVC中返回JSON数据的几种方式

    我们都知道Spring MVC 的Controller方法中默认可以返回ModeAndView 和String 类型,返回的这两种类型数据是被DispatcherServlet拿来给到视图解析器进行继 ...

  10. ORACLE SQL语句练习题

    --1:选择部门30中的所有员工select * from emp where deptno=30--2:列出所有办事员(clerk) 的姓名.编号和部门编号select empno,ename,de ...