Spark2.0 shuffle service

Spark 的shuffle 服务是spark的核心，本文介绍了非ExternalShuffleClient的方式，看BlockService的整个架构。ShuffleClient是整个框架的基础，有init方法和fetchBlock两个方法。

/** Provides an interface for reading shuffle files, either from an Executor or external service. */

public abstract class ShuffleClient implements Closeable {  

  /**

   * Initializes the ShuffleClient, specifying this Executor's appId.

   * Must be called before any other method on the ShuffleClient.

   * 初始化ShuffleClient, 传入本执行器的程序ID，本方法必须在访问ShuffleClient的其它方法前调用。

   */

  public void init(String appId) { }  

  /**

   * Fetch a sequence of blocks from a remote node asynchronously,

   *

   * Note that this API takes a sequence so the implementation can batch requests, and does not

   * return a future so the underlying implementation can invoke onBlockFetchSuccess as soon as

   * the data of a block is fetched, rather than waiting for all blocks to be fetched.

   * 异步的从远程结点取一系列的数据块，并且不返回future对象，所以当取到一个数据块的数据时，底层的实现可以调用onBlockFetchSuccess方法，

   * 并不用等所有的数据块都取完。

   */

  public abstract void fetchBlocks(

      String host,

      int port,

      String execId,

      String[] blockIds,

      BlockFetchingListener listener);

}

BlockFetchingListener接口，onBlockFetchSuccess方法：每次成功取得一个数据块时调用。当本方法返回时，数据必须被自动释放。如果数据被传递给另一个线程，接收者必须自己调用retain()和release()，或者拷贝数据到一个新的缓冲区。onBlockFetchFailure方法，数据块获取失败时，至少被调用一次。

public interface BlockFetchingListener extends EventListener {

  /**

   * Called once per successfully fetched block. After this call returns, data will be released

   * automatically. If the data will be passed to another thread, the receiver should retain()

   * and release() the buffer on their own, or copy the data to a new buffer.

   */

  void onBlockFetchSuccess(String blockId, ManagedBuffer data);  

  /**

   * Called at least once per block upon failures.

   */

  void onBlockFetchFailure(String blockId, Throwable exception);

}

BlockTransferService扩展了ShuffleClient，有一些方法的公共的实现。

private[spark]

abstract class BlockTransferService extends ShuffleClient with Closeable with Logging {  

  /**

   * Initialize the transfer service by giving it the BlockDataManager that can be used to fetch

   * local blocks or put local blocks.

   * 通过传递给他BlockDataManager对象来初始化传输服务，BlockDataManager可以用来存取本地数据块。

   */

  def init(blockDataManager: BlockDataManager): Unit  

  /**

   * Tear down the transfer service.

   * 关闭传输服务。

   */

  def close(): Unit  

  /**

   * Port number the service is listening on, available only after [[init]] is invoked.

   * 传输服务所在的端口号，在调用init方法后可用。

   */

  def port: Int  

  /**

   * Host name the service is listening on, available only after [[init]] is invoked.

   * 传输服务所在的主机名，在调用init方法后可用。

   */

  def hostName: String  

  /**

   * Fetch a sequence of blocks from a remote node asynchronously,

   * available only after [[init]] is invoked.

   *

   * Note that this API takes a sequence so the implementation can batch requests, and does not

   * return a future so the underlying implementation can invoke onBlockFetchSuccess as soon as

   * the data of a block is fetched, rather than waiting for all blocks to be fetched.

   *

   * 异步的从远程结点取一系列的数据块，，仅在调用init方法后可用。

   * 注意本API用一个序列，所以实现可以使用批量请求，并且不返回future对象，所以当取到一个数据块的数据时，底层的实现可以调用onBlockFetchSuccess方法，

   * 并不用等所有的数据块都取完。

 */ 

override def fetchBlocks( host: String, port: Int, execId: String, blockIds: Array[String], listener: BlockFetchingListener): Unit 

/**
　 * Upload a single block to a remote node, available only after [[init]] is invoked.

   * 上传一个数据块到远程结点，仅在调用init方法后可用。

   */

  def uploadBlock(

      hostname: String,

      port: Int,

      execId: String,

      blockId: BlockId,

      blockData: ManagedBuffer,

      level: StorageLevel,

      classTag: ClassTag[_]): Future[Unit]  

  /**

   * A special case of [[fetchBlocks]], as it fetches only one block and is blocking.

   *

   * It is also only available after [[init]] is invoked.

   * fetchBlocks的一个特别方法，他只取一个数据块并且阻塞，仅在调用init方法后可用。

。

   */

  def fetchBlockSync(host: String, port: Int, execId: String, blockId: String): ManagedBuffer = {

    // A monitor for the thread to wait on.

    val result = Promise[ManagedBuffer]()

    fetchBlocks(host, port, execId, Array(blockId),

      new BlockFetchingListener {

        override def onBlockFetchFailure(blockId: String, exception: Throwable): Unit = {

          result.failure(exception)

        }

        override def onBlockFetchSuccess(blockId: String, data: ManagedBuffer): Unit = {

          val ret = ByteBuffer.allocate(data.size.toInt)

          ret.put(data.nioByteBuffer())

          ret.flip()

          result.success(new NioManagedBuffer(ret))

        }

      })

    ThreadUtils.awaitResult(result.future, Duration.Inf)

  }  

  /**

   * Upload a single block to a remote node, available only after [[init]] is invoked.

   *

   * This method is similar to [[uploadBlock]], except this one blocks the thread

   * until the upload finishes.

   * 上传一个数据块到远程结点，仅在调用init方法后可用。

   * 这个方法和uploadBlock方法类似，除了直到上传结点，本方法会一直阻塞。

   */

  def uploadBlockSync(

      hostname: String,

      port: Int,

      execId: String,

      blockId: BlockId,

      blockData: ManagedBuffer,

      level: StorageLevel,

      classTag: ClassTag[_]): Unit = {

    val future = uploadBlock(hostname, port, execId, blockId, blockData, level, classTag)

    ThreadUtils.awaitResult(future, Duration.Inf)

  }

}

NettyBlockTransferService扩展了BlockTransferServie

Spark2.0 shuffle service的更多相关文章

hadoop-2.7.3.tar.gz + spark-2.0.2-bin-hadoop2.7.tgz + zeppelin-0.6.2-incubating-bin-all.tgz（master、slave1和slave2）（博主推荐）（图文详解）
不多说,直接上干货! 我这里,采取的是ubuntu 16.04系统,当然大家也可以在CentOS6.5里,这些都是小事 CentOS 6.5的安装详解 hadoop-2.6.0.tar.gz + sp ...
Ubuntu14.04或16.04下安装JDK1.8+Scala+Hadoop2.7.3+Spark2.0.2
为了将Hadoop和Spark的安装简单化,今日写下此帖. 首先,要看手头有多少机器,要安装伪分布式的Hadoop+Spark还是完全分布式的,这里分别记录. 1. 伪分布式安装伪分布式的Hadoo ...
图文解析Spark2.0核心技术(转载)
导语 Spark2.0于2016-07-27正式发布,伴随着更简单.更快速.更智慧的新特性,spark 已经逐步替代 hadoop 在大数据中的地位,成为大数据处理的主流标准.本文主要以代码和绘图的方 ...
Spark2.0机器学习系列之1：聚类算法(LDA）
在Spark2.0版本中(不是基于RDD API的MLlib),共有四种聚类方法: (1)K-means (2)Latent Dirichlet allocation (LDA) ...
在centos7上安装部署hadoop2.7.3和spark2.0.0
一.安装装备下载安装包: vmware workstations pro 12 三台centos7.1 mini 虚拟机网络配置NAT网络如下: 二.创建hadoop用户和hadoop用户组 1. ...
hive on spark (spark2.0.0 hive2.3.3)
hive on spark真的很折腾人啊!!!!!!! 一.软件准备阶段 maven3.3.9 spark2.0.0 hive2.3.3 hadoop2.7.6 二.下载源码spark2.0.0,编译 ...
Spark2.0集成Hive操作的相关配置与注意事项
前言已完成安装Apache Hive,具体安装步骤请参照,Linux基于Hadoop2.8.0集群安装配置Hive2.1.1及基础操作补充说明 Hive中metastore(元数据存储)的三种方式 ...
降本增效利器！趣头条Spark Remote Shuffle Service最佳实践
王振华,趣头条大数据总监,趣头条大数据负责人曹佳清,趣头条大数据离线团队高级研发工程师,曾就职于饿了么大数据INF团队负责存储层和计算层组件研发,目前负责趣头条大数据计算层组件Spark的建设范振 ...
Magnet: Push-based Shuffle Service for Large-scale Data Processing
本文是阅读 LinkedIn 公司2020年发表的论文 Magnet: Push-based Shuffle Service for Large-scale Data Processing 一点笔记. ...

随机推荐

MathType公式保存后为什么字体会变化
在使用MathType数学公式编辑器的时候,很多的用户朋友是新手会遇到一些问题,比如,有时我们保存后却发现MathType公式字体变化了,原本的斜体变成了正体,面对这种问题我们该如何解决呢?下面就来给 ...
sublime text 2 破解
本文是介绍sublime text 2.0.2 build 2221 64位的破解在你使用sublime时可能经常出现下图: 这是在提醒你注册在工具栏上点击help->Enter Lice ...
JQuery--使用autocomplete控件进行自己主动输入完毕（相当于模糊查询）
之前为了实现这个功能花了我几天的时间. 事实上.实现了之后发现也就那么回事,正所谓万事开头难嘛.. 废话不多说了.这里我使用的是JQuery控件库中的一个Autocomplete控件.即Autocom ...
Android Studio右下角不显示当前branch名称
当一个project刚从git server端clone下来并open后,或许你会发如今Android Studio的右下角看不到当前是哪个branch的信息.例如以下图: 原因分析:不显示的原因是由 ...
分享一句话的同时说说遍历map的常用方法
最近在网上看到一句话,鄙人觉得这是比较经典的一句话,首先要给大家分享哈: 当一个人找不到出路的时候,最好的办法就是将当前能做好的事情做到极致,做到无人能及. Map<String, String ...
SQL基础--视图
视图其实就是一条查询SQL语句,用于显示一个或多个表或其它视图中相关数据. 创建视图: CREATE [OR REPLACE] ［FORCE |NOFORCE ］VIEW view_name [al ...
APP的缓存文件放在哪里?
只要是需要进行联网获取数据的APP,都会在本地产生缓存文件.那么,这些缓存文件到底放在什地方合适呢?系统有没有给我们提供建议的缓存位置呢?不同的缓存位置有什么不同呢? 考虑到卸载APP必须删除缓存在 ...
Linux 文件基本属性（转）
Linux 文件基本属性 Linux系统是一种典型的多用户系统,不同的用户处于不同的地位,拥有不同的权限.为了保护系统的安全性,Linux系统对不同的用户访问同一文件(包括目录文件)的权限做了不同的规 ...
Ubuntu 12.04 部署 PostGIS 2.1
首先,卸载掉原有的postgis和postgresql-9.1-postgis,不然你就用1.5版好了~ 1 sudo dpkg --purge postgis postgresql-9.1-post ...
java基础----->TCP和UDP套接字编程
这里简单的总结一下TCP和UDP编程的写法,另外涉及到HttpUrlConnection的用法 . TCP套接字一.项目的流程如下说明: .客户输入一行字符,通过其套接字发送到服务器. .服务器从其 ...

Spark2.0 shuffle service

Spark2.0 shuffle service的更多相关文章

随机推荐

热门专题