Spark Shuffle(三)Executor是如何fetch shuffle的数据文件(转载)
1. 前言
在前面的博客中讨论了Executor, Driver之间如何汇报Executor生成的Shuffle的数据文件,以及Executor获取到Shuffle的数据文件的分布,那么Executor是如何获取到Shuffle的数据文件进行Action的算子的计算呢?
在ResultTask中,Executor通过MapOutPutTracker向Driver获取了ShuffID的Shuffle数据块的结构,整理成以BlockManangerId为Key的结构,这样可以更容易区分究竟是本地的Shuffle还是远端executor的Shuffle
2. Fetch数据
Seq[(BlockManagerId, Seq[(BlockId, Long)])]
BlockManagerId结构
class BlockManagerId private (
private var executorId_ : String,
private var host_ : String,
private var port_ : Int,
private var topologyInfo_ : Option[String])
extends Externalizable
2.1 读取本Executor文件
for ((address, blockInfos) <- blocksByAddress) {
totalBlocks += blockInfos.size
if (address.executorId == blockManager.blockManagerId.executorId) {
// Filter out zero-sized blocks
localBlocks ++= blockInfos.filter(_._2 != ).map(_._1)
numBlocksToFetch += localBlocks.size
}
}
- 同一个Executor会生成多个Task,单个Executor里的Task运行可以直接获取本地文件,不需要通过网络
- 同一台机器多个Executor,在这种情况下,不同的Executor获取相同机器下的其他的Executor的文件,需要通过网络
2.2 读取非本Executor文件
2.2.1 构造FetchRequest请求
spark.reducer.maxSizeInFlight
val targetRequestSize = math.max(maxBytesInFlight / , 1L)
对同一个Executor,如果请求多个Block请求的数据大小未超过targetRequestSize,将会被分配到同一个FetchRequest中,以避免多次FetchRequest的请求
val iterator = blockInfos.iterator
var curRequestSize = 0L
var curBlocks = new ArrayBuffer[(BlockId, Long)]
while (iterator.hasNext) {
val (blockId, size) = iterator.next()
// Skip empty blocks
if (size > ) {
curBlocks += ((blockId, size))
remoteBlocks += blockId
numBlocksToFetch +=
curRequestSize += size
} else if (size < ) {
throw new BlockException(blockId, "Negative block size " + size)
}
if (curRequestSize >= targetRequestSize) {
// Add this FetchRequest
remoteRequests += new FetchRequest(address, curBlocks)
curBlocks = new ArrayBuffer[(BlockId, Long)]
logDebug(s"Creating fetch request of $curRequestSize at $address")
curRequestSize =
}
}
// Add in the final request
if (curBlocks.nonEmpty) {
remoteRequests += new FetchRequest(address, curBlocks)
}
2.2.1 发送FetchRequest
FetchRequest并不是并行提交的,对同一个Task来说,在Executor的做combine的时候是一个一个的BlockID块合并的,而Task本身就是一个线程运行的,所以不需要设计FetchRequest成并行提交,当一个BlockID完成计算后,才需要判断是否需要进行下一个FetchRequest请求,因为FetchRequest是多个Block提交的,为了控制Executor获取多个BlockID的shuffle数据的带宽,在提交FetchRequest的时候控制了请求的频率
在满足下面以下条件下,才允许提交下个FetchRequest
- 当正在请求的所有BlockId的内容和下一个FetchRequest的请求内容之和小于maxBytesInFlight的时候,才能进行下一个FetchRequest 的请求
- 当正在请求的数量小于所设置的最大的允许请求数量的时候,才能进行下一个FetchRequest的请求,控制参数如下:
spark.reducer.maxReqsInFlight
2.2.2 完整的FetchRequest流程
- Executor A 通过ExternalShuffleClient 进行fetchBlocks的操作,如果配置了io.maxRetries
- 最大重试参数的话,将启动一个能重试RetryingBlockFetcher的获取器
- 初始化TransportClient,OneForOneBlockFetcher获取器
- 在OneForOneBlockFetcher里首先向另一个Executor B发送了OpenBlocks的询问请求,里面告知ExecutorID, APPID和BlockID的集合
- Executor B获取到BlockIDs,后通过BlockManager获取相关的BlockID的文件(通过mapid, reduceid获取相关的索引和数据文件),构建FileSegmentManagedBuffer
- 通过StreamManager(OneForOneStreamManager) registerStream 生成streamId,和StreamState(多个ManagedBuffer,AppID)的缓存返回所生成的StreamId
- Executor B 返回给 StreamHandle的消息,里面包含了StreamId和Chunk的数量,这里chunk的数量其实就是Block的数量
- Executor A 获取到 StreamHandle的消息,一个一个的发送ChunkFetchRequest里面包含了StreamId, Chunk index,去真实的获取Executor B的shuffle数据文件
- Executor B 通过传递的ChunkFetchRequest消息获取到StreamId, Chunk index, 通过缓存获取到对应的FileSgementManagedBuffer,返回chunkFetchSuccess消息,里面包含着streamID, 和FileSegmentManagedBuffer
- 在步骤3-6步骤里是堵塞在Task线程里,而步骤7一个一个发送ChunkFetchRequest后,并不堵塞等待返回结果,结果是通过回调函数来实现的,在调用前注册了一个回调函数
client.fetchChunk(streamHandle.streamId, i, chunkCallback);
private class ChunkCallback implements ChunkReceivedCallback {
@Override
public void onSuccess(int chunkIndex, ManagedBuffer buffer) {
// On receipt of a chunk, pass it upwards as a block.
listener.onBlockFetchSuccess(blockIds[chunkIndex], buffer);
} @Override
public void onFailure(int chunkIndex, Throwable e) {
// On receipt of a failure, fail every block from chunkIndex onwards.
String[] remainingBlockIds = Arrays.copyOfRange(blockIds, chunkIndex, blockIds.length);
failRemainingBlocks(remainingBlockIds, e);
}
}
在这里的listener就是前面fetchBlocks里注入的BlockFetchingListener
new BlockFetchingListener {
override def onBlockFetchSuccess(blockId: String, buf: ManagedBuffer): Unit = {
// Only add the buffer to results queue if the iterator is not zombie,
// i.e. cleanup() has not been called yet.
ShuffleBlockFetcherIterator.this.synchronized {
if (!isZombie) {
// Increment the ref count because we need to pass this to a different thread.
// This needs to be released after use.
buf.retain()
remainingBlocks -= blockId
results.put(new SuccessFetchResult(BlockId(blockId), address, sizeMap(blockId), buf,
remainingBlocks.isEmpty))
logDebug("remainingBlocks: " + remainingBlocks)
}
}
logTrace("Got remote block " + blockId + " after " + Utils.getUsedTimeMs(startTime))
} override def onBlockFetchFailure(blockId: String, e: Throwable): Unit = {
logError(s"Failed to get block(s) from ${req.address.host}:${req.address.port}", e)
results.put(new FailureFetchResult(BlockId(blockId), address, e))
}
}
- 如果获取成功将封装SuccessFetchResult里面保存着blockId,地址,数据大小,以及ManagedBuffer,并保存到results的queue中
2.2.3 Fetch 迭代获取数据文件
override def hasNext: Boolean = numBlocksProcessed < numBlocksToFetch
3. Fetch 交互协议
在前面的博客里描述了很多交互协议都使用了Java的原生态的反序列化,但在上文描述的Fetch协议中,是Spark单独定义的一套协议标准,自己实现encoder和decoder
3.1 Message Encoder
public interface Message extends Encodable{}
而核心的是Encodable,有点类似Java的Serializable接口,需要自己实现Encoder和Decoder的方法
public interface Encodable {
/** Number of bytes of the encoded form of this object. */
int encodedLength(); /**
* Serializes this object by writing into the given ByteBuf.
* This method must write exactly encodedLength() bytes.
*/
void encode(ByteBuf buf);
}
MessageToMessageEncoder protected abstract void encode(ChannelHandlerContext paramChannelHandlerContext, I paramI, List<Object> paramList)
/* */ throws Exception;
在Spark里自己实现MessageToMessageEncoder的encoder的方法
public final class MessageEncoder extends MessageToMessageEncoder<Message> { private static final Logger logger = LoggerFactory.getLogger(MessageEncoder.class); /***
* Encodes a Message by invoking its encode() method. For non-data messages, we will add one
* ByteBuf to 'out' containing the total frame length, the message type, and the message itself.
* In the case of a ChunkFetchSuccess, we will also add the ManagedBuffer corresponding to the
* data to 'out', in order to enable zero-copy transfer.
*/
@Override
public void encode(ChannelHandlerContext ctx, Message in, List<Object> out) throws Exception {
Object body = null;
long bodyLength = ;
boolean isBodyInFrame = false; // If the message has a body, take it out to enable zero-copy transfer for the payload.
if (in.body() != null) {
try {
bodyLength = in.body().size();
body = in.body().convertToNetty();
isBodyInFrame = in.isBodyInFrame();
} catch (Exception e) {
in.body().release();
if (in instanceof AbstractResponseMessage) {
AbstractResponseMessage resp = (AbstractResponseMessage) in;
// Re-encode this message as a failure response.
String error = e.getMessage() != null ? e.getMessage() : "null";
logger.error(String.format("Error processing %s for client %s",
in, ctx.channel().remoteAddress()), e);
encode(ctx, resp.createFailureResponse(error), out);
} else {
throw e;
}
return;
}
} Message.Type msgType = in.type();
// All messages have the frame length, message type, and message itself. The frame length
// may optionally include the length of the body data, depending on what message is being
// sent.
int headerLength = + msgType.encodedLength() + in.encodedLength();
long frameLength = headerLength + (isBodyInFrame ? bodyLength : );
ByteBuf header = ctx.alloc().heapBuffer(headerLength);
header.writeLong(frameLength);
msgType.encode(header);
in.encode(header);
assert header.writableBytes() == ; if (body != null) {
// We transfer ownership of the reference on in.body() to MessageWithHeader.
// This reference will be freed when MessageWithHeader.deallocate() is called.
out.add(new MessageWithHeader(in.body(), header, body, bodyLength));
} else {
out.add(header);
}
} }
3.2 Message Decoder
public final class MessageDecoder extends MessageToMessageDecoder<ByteBuf>
在decode方法里,直接对ByteBuf进行decode会Message
public void decode(ChannelHandlerContext ctx, ByteBuf in, List<Object> out) {
Message.Type msgType = Message.Type.decode(in);
Message decoded = decode(msgType, in);
assert decoded.type() == msgType;
logger.trace("Received message {}: {}", msgType, decoded);
out.add(decoded);
}
3.3 传递文件
3.3.1 发送文件
respond(new ChunkFetchSuccess(req.streamChunkId, buf));
在buf里的ManagedBuffer是FileSegmentManagedBuffer,而在刚才的encode函数里
body = in.body().convertToNetty();
对ChunkFetchSuccess来说in.body是FileSegmentManagedBuffer,而它封装的方法里
public Object convertToNetty() throws IOException {
if (conf.lazyFileDescriptor()) {
return new DefaultFileRegion(file, offset, length);
} else {
FileChannel fileChannel = new FileInputStream(file).getChannel();
return new DefaultFileRegion(fileChannel, offset, length);
}
}
使用了DefaultFileRegion,这是一个Netty里传递文件使用零拷贝的方式,在FileRegion里是调用TransferTo进行零拷贝复制文件,关于零拷贝在这里不介绍了
public abstract long transferTo(WritableByteChannel paramWritableByteChannel, long paramLong)
throws IOException;
但是问题是encode的方法里返回的MessageWithHeader对象,并不是DefaultFileRegion
if (body != null) {
// We transfer ownership of the reference on in.body() to MessageWithHeader.
// This reference will be freed when MessageWithHeader.deallocate() is called.
out.add(new MessageWithHeader(in.body(), header, body, bodyLength));
}
我们来看看什么是MessageWithHeader
class MessageWithHeader extends AbstractReferenceCounted implements FileRegion
原来是FileRegion,对Netty来说FileRegion最后调用的TransferTo进行传递
public long transferTo(final WritableByteChannel target, final long position) throws IOException {
Preconditions.checkArgument(position == totalBytesTransferred, "Invalid position.");
// Bytes written for header in this call.
long writtenHeader = ;
if (header.readableBytes() > ) {
writtenHeader = copyByteBuf(header, target);
totalBytesTransferred += writtenHeader;
if (header.readableBytes() > ) {
return writtenHeader;
}
} // Bytes written for body in this call.
long writtenBody = ;
if (body instanceof FileRegion) {
writtenBody = ((FileRegion) body).transferTo(target, totalBytesTransferred - headerLength);
} else if (body instanceof ByteBuf) {
writtenBody = copyByteBuf((ByteBuf) body, target);
}
totalBytesTransferred += writtenBody; return writtenHeader + writtenBody;
}
在这里巧妙的将Header和文件封装成了一个文件的region,在TransferTo的函数里先传递头,然后在调用
writtenBody = ((FileRegion) body).transferTo(target, totalBytesTransferred - headerLength);
3.3.2 接收文件
public static ChunkFetchSuccess decode(ByteBuf buf) {
StreamChunkId streamChunkId = StreamChunkId.decode(buf);
buf.retain();
NettyManagedBuffer managedBuf = new NettyManagedBuffer(buf.duplicate());
return new ChunkFetchSuccess(streamChunkId, managedBuf);
}
4. 总结
- Fetch Shuffle data 数据区分本地数据,远端数据,本地数据和远端数据的区分依据是ExecutorID
- 单个Task线程Fetch Shuffle Data数据是以Block为最小单位,串行获取并进行运算
- 远端Fetch的多个Block 数据,是异步发送请求,通过回调函数来异步获取返回结果提交到堵塞的队列中,让Task线程获取、读取,运算
- Fetch的交互协议,并没有使用Java的默认反序列的协议,而是自己独立封装Encode、Decode,进行编码和解码
Spark Shuffle(三)Executor是如何fetch shuffle的数据文件(转载)的更多相关文章
- Spark Shuffle(二)Executor、Driver之间Shuffle结果消息传递、追踪(转载)
1. 前言 在博客里介绍了ShuffleWrite关于shuffleMapTask如何运行,输出Shuffle结果到Shuffle_shuffleId_mapId_0.data数据文件中,每个exec ...
- Spark Shuffle(一)ShuffleWrite:Executor如何将Shuffle的结果进行归并写到数据文件中去(转载)
转载自:https://blog.csdn.net/raintungli/article/details/70807376 当Executor进行reduce运算的时候,生成运算结果的临时Shuffl ...
- Spark性能调优篇八之shuffle调优
1 task的内存缓冲调节参数 2 reduce端聚合内存占比 spark.shuffle.file.buffer map task的内存缓冲调节参数,默认是3 ...
- Apache Spark源码走读之24 -- Sort-based Shuffle的设计与实现
欢迎转载,转载请注明出处. 概要 Spark 1.1中对spark core的一个重大改进就是引入了sort-based shuffle处理机制,本文就该处理机制的实现进行初步的分析. Sort-ba ...
- 【Hadoop】MapReduce笔记(三):MapReduce的Shuffle和Sort阶段详解
一.MapReduce 总体架构 整体的Shuffle过程包含以下几个部分:Map端Shuffle.Sort阶段.Reduce端Shuffle.即是说:Shuffle 过程横跨 map 和 reduc ...
- spark任务在executor端的运行过程分析
CoarseGrainedExecutorBackend 上一篇,我们主要分析了一次作业的提交过程,严格说是在driver端的过程,作业提交之后经过DAGScheduler根据shuffle依赖关系划 ...
- Spark(三)角色和搭建
目录 Spark(三)角色和搭建 一.Spark集群角色介绍 二.集群的搭建 三.history服务 四.使用spark-submit进行计算Pi 五.Spark On Yarn 六.shell脚本 ...
- 大话Spark(5)-三图详述Spark Standalone/Client/Cluster运行模式
之前在 大话Spark(2)里讲过Spark Yarn-Client的运行模式,有同学反馈与Cluster模式没有对比, 这里我重新整理了三张图分别看下Standalone,Yarn-Client 和 ...
- Spark部署三种方式介绍:YARN模式、Standalone模式、HA模式
参考自:Spark部署三种方式介绍:YARN模式.Standalone模式.HA模式http://www.aboutyun.com/forum.php?mod=viewthread&tid=7 ...
随机推荐
- shell中判断一个变量是否为0或者为某个具体的值
需求说明: 在实际写脚本的过程中,需要判断某个变量的值是否为某个数字, 比如,判断某个进程的数量是否为0用来确定进程是否存在,这样的情况. 简单来说,算术比较. 测试过程: 通过以下的脚本来判断mys ...
- C++第15周(春)项目3 - OOP版电子词典(二)
课程首页在:http://blog.csdn.net/sxhelijian/article/details/11890759,内有完整教学方案及资源链接 [项目3-OOP版电子词典](本程序须要的相关 ...
- Python 高斯坐标转经纬度算法
# 高斯坐标转经纬度算法# B=大地坐标X# C=大地坐标Y# IsSix=6度带或3度带def GetLatLon2(B, C,IsSix): #带号 D = math.trunc(C / 1000 ...
- details和summary标签
用于文档说明,有自带收缩.展开功能 <!DOCTYPE HTML> <html> <body> <details> <summary>HTM ...
- Linux下安装配置SVN
1.检查系统上是否安装了SVN rpm -qa subversion 没有安装,则使用以下命令安装 yum -y install subversion 2.配置svn并启动svn服务 (1) 指定s ...
- callable()
callable() 用于判断一个对象是否是可调用的,函数或类都可以被调用 In [1]: callable('a') Out[1]: False In [2]: def fun(): ...: pa ...
- js里面声明变量时候的注意事项
变量名可以是中文,只能有下划线,$,数字和字母组成,开头只能以下划线(不建议使用)和字母开头.
- cocos2dx游戏--欢欢英雄传说--添加游戏背景
经过一段时间的学习cocos2dx,接下来我想要实践开发一个小游戏,我把它命名为“欢欢英雄传说”,项目名将取为HuanHero.环境:cocos2dx环境:cocos2d-x 3.11.1IDE:Co ...
- 图解利用vmware工具进行虚拟机克隆
在vmware上创建一台完整的虚拟机,在该创建的虚拟机上进行克隆,先关闭创建的虚拟机,然后选中你要克隆的虚拟机,右击->管理->克隆,然后点击下一步,如下图所示: 2 然后点击下一步,如下 ...
- 关于IIS Express,集成管道
一直没了解IIS Express是什么,现在也一样 暂时先做个记录 有关IIS Express的config http://www.cnblogs.com/IPrograming/archive/20 ...