spark 笔记 16： BlockManager

先看一下原理性的文章：http://jerryshao.me/architecture/2013/10/08/spark-storage-module-analysis/ ，http://jerryshao.me/architecture/2013/10/08/spark-storage-module-analysis/ , 另外，spark的存储使用了Segment File的概念（http://en.wikipedia.org/wiki/Segmented_file_transfer ），概括的说，它是把文件划分成多个段，分别存储在不同的服务器上；在读取的时候，同时从这些服务器上读取。（这也是BT的基础）。

之前分析shuffle的调用关系的时候，其实已经包含了很多的BlockManager的流程，但还是有必要系统的看一遍它的代码。

getLocalFromDisk这个函数，是前面看shuffleManager的终点，但却是BlockManager的起点。即使是到远端获取block的操作，也是发送一个消息到远端服务器上执行getLocalFromDisk，然后再把结果发送回来。

->diskStore.getValues(blockId, serializer)

============================BlockManager============================

-> BlockManager::getLocalFromDisk

->diskStore.getValues(blockId, serializer)

->getBytes(blockId).map(bytes => blockManager.dataDeserialize(blockId, bytes, serializer))

->val segment = diskManager.getBlockLocation(blockId) --DiskBlockManager的方法，获取block在一个文件中的一个块位置

->if blockId.isShuffle and env.shuffleManager.isInstanceOf[SortShuffleManager] --如果是hash类型shuffle，

->sortShuffleManager.getBlockLocation(blockId.asInstanceOf[ShuffleBlockId], this) --For sort-based shuffle, let it figure out its blocks

->else if blockId.isShuffle and shuffleBlockManager.consolidateShuffleFiles --联合文件模式

->shuffleBlockManager.getBlockLocation(blockId.asInstanceOf[ShuffleBlockId]) --For hash-based shuffle with consolidated files

->val shuffleState = shuffleStates(id.shuffleId) --

->for (fileGroup <- shuffleState.allFileGroups)

->val segment = fileGroup.getFileSegmentFor(id.mapId, id.reduceId) --次函数单独分析

->if (segment.isDefined) { return segment.get }

->else

->val file = getFile(blockId.name)--getFile(filename: String): File

->val hash = Utils.nonNegativeHash(filename)

->val dirId = hash % localDirs.length

->val subDirId = (hash / localDirs.length) % subDirsPerLocalDir

->var subDir = subDirs(dirId)(subDirId)

->new File(subDir, filename)

->new FileSegment(file, 0, file.length())

->val channel = new RandomAccessFile(segment.file, "r").getChannel

->if (segment.length < minMemoryMapBytes)

->channel.position(segment.offset)

->channel.read(buf)

->return buf

->else

->return Some(channel.map(MapMode.READ_ONLY, segment.offset, segment.length))

ShuffleFileGroup：如何通过mapId和reduceId在ShuffleBlockManager 中获取数据：getFileSegmentFor函数

->根据reduceId从ShuffleFileGroup的属性val files: Array[File]里面找到reduce的文件句柄fd

->根据mapId从mapIdToIndex找到index，

->根据reduce找到blockOffset向量和blockLen向量，

->再通过index从向量里面找到offset和len，

->最后通过offset和len从fd里面读取到需要的数据

从远本地取数据

->BlockManager::doGetLocal

->val info = blockInfo.get(blockId).orNull

->val level = info.level

->if (level.useMemory) --Look for the block in memory

->val result = if (asBlockResult)

->memoryStore.getValues(blockId).map(new BlockResult(_, DataReadMethod.Memory, info.size))

->esle

->memoryStore.getBytes(blockId)

->if (level.useOffHeap) -- Look for the block in Tachyon

->tachyonStore.getBytes(blockId)

->if (level.useDisk)

->val bytes: ByteBuffer = diskStore.getBytes(blockId)

->if (!level.useMemory) // If the block shouldn't be stored in memory, we can just return it

->if (asBlockResult)

->return Some(new BlockResult(dataDeserialize(blockId, bytes), DataReadMethod.Disk, info.size))

->else

->return Some(bytes)

->else --memory// Otherwise, we also have to store something in the memory store

->if (!level.deserialized || !asBlockResult) 不序列化或者不block"memory serialized", or if it should be cached as objects in memory

->val copyForMemory = ByteBuffer.allocate(bytes.limit)

->copyForMemory.put(bytes)

->memoryStore.putBytes(blockId, copyForMemory, level)

->if (!asBlockResult)

->return Some(bytes)

->else --需要序列化再写内存

->val values = dataDeserialize(blockId, bytes)

->if (level.deserialized) // Cache the values before returning them

->val putResult = memoryStore.putIterator(blockId, values, level, returnValues = true, allowPersistToDisk = false)

->putResult.data match case Left(it) return Some(new BlockResult(it, DataReadMethod.Disk, info.size))

->else

->return Some(new BlockResult(values, DataReadMethod.Disk, info.size))

->val values = dataDeserialize(blockId, bytes)

从远端获取数据

->BlockManager::doGetRemote

->val locations = Random.shuffle(master.getLocations(blockId)) --随机打散

->for (loc <- locations) --遍历所有地址

->val data = BlockManagerWorker.syncGetBlock(GetBlock(blockId), ConnectionManagerId(loc.host, loc.port))

->val blockMessage = BlockMessage.fromGetBlock(msg)

->val newBlockMessage = new BlockMessage()

->newBlockMessage.set(getBlock)

->typ = BlockMessage.TYPE_GET_BLOCK

->id = getBlock.id

->val blockMessageArray = new BlockMessageArray(blockMessage)

-> val responseMessage = Try(Await.result(connectionManager.sendMessageReliably(toConnManagerId, blockMessageArray.toBufferMessage), Duration.Inf))

->responseMessage match {case Success(message) => val bufferMessage = message.asInstanceOf[BufferMessage]

->logDebug("Response message received " + bufferMessage)

->BlockMessageArray.fromBufferMessage(bufferMessage).foreach(blockMessage =>

->logDebug("Found " + blockMessage)

->return blockMessage.getData

->return Some(data)

===========================end=================================

再次引用这个图：多个map可以对应一个文件，其中每个map对应文件中的某些段。这样做是为了减少文件数量。

（图片来源：http://jerryshao.me/architecture/2014/01/04/spark-shuffle-detail-investigation/ ）

获取block数据返回的数据结构

/* Class for returning a fetched block and associated metrics. */
private[spark] class BlockResult(
    val data: Iterator[Any],
    readMethod: DataReadMethod.Value,
    bytes: Long) {
  val inputMetrics = new InputMetrics(readMethod)
  inputMetrics.bytesRead = bytes
}

private[spark] class BlockManager(
    executorId: String,
    actorSystem: ActorSystem,
    val master: BlockManagerMaster,
    defaultSerializer: Serializer,
    maxMemory: Long,
    val conf: SparkConf,
    securityManager: SecurityManager,
    mapOutputTracker: MapOutputTracker,
    shuffleManager: ShuffleManager)
  extends BlockDataProvider with Logging {

shuffle状态，主要包含了unusedFileGroups、allFileGroups两个属性，记录当前已经使用和未使用的ShuffleFileGroup

/**
 * Contains all the state related to a particular shuffle. This includes a pool of unused
 * ShuffleFileGroups, as well as all ShuffleFileGroups that have been created for the shuffle.
 */
private class ShuffleState(val numBuckets: Int) {
  val nextFileId = new AtomicInteger(0)
  val unusedFileGroups = new ConcurrentLinkedQueue[ShuffleFileGroup]()
  val allFileGroups = new ConcurrentLinkedQueue[ShuffleFileGroup]()

  /**
   * The mapIds of all map tasks completed on this Executor for this shuffle.
   * NB: This is only populated if consolidateShuffleFiles is FALSE. We don't need it otherwise.
   */
  val completedMapTasks = new ConcurrentLinkedQueue[Int]()
}

shuffleStates 是一个基于时间戳的hash table

private val shuffleStates = new TimeStampedHashMap[ShuffleId, ShuffleState]

private val metadataCleaner =
  new MetadataCleaner(MetadataCleanerType.SHUFFLE_BLOCK_MANAGER, this.cleanup, conf)

Used by sort-based shuffle： shuffle结束时将结果注册到shuffleStates

/**
 * Register a completed map without getting a ShuffleWriterGroup. Used by sort-based shuffle
 * because it just writes a single file by itself.
 */
def addCompletedMap(shuffleId: Int, mapId: Int, numBuckets: Int): Unit = {
  shuffleStates.putIfAbsent(shuffleId, new ShuffleState(numBuckets))
  val shuffleState = shuffleStates(shuffleId)
  shuffleState.completedMapTasks.add(mapId)
}

将自己注册给master

/**
 * Initialize the BlockManager. Register to the BlockManagerMaster, and start the
 * BlockManagerWorker actor.
 */
private def initialize(): Unit = {
  master.registerBlockManager(blockManagerId, maxMemory, slaveActor)
  BlockManagerWorker.startBlockManagerWorker(this)
}

从本地磁盘获取一个block数据。为了方便使用

/**
 * A short-circuited method to get blocks directly from disk. This is used for getting
 * shuffle blocks. It is safe to do so without a lock on block info since disk store
 * never deletes (recent) items.
 */
def getLocalFromDisk(blockId: BlockId, serializer: Serializer): Option[Iterator[Any]] = {
  diskStore.getValues(blockId, serializer).orElse {
    throw new BlockException(blockId, s"Block $blockId not found on disk, though it should be")
  }
}

ShuffleWriterGroup：每个shuffleMapTask都有一组shuffleWriter，它给每个reducer分配了一个writer。当前只有HashShufflle使用了，唯一一个实例化是在forMapTask返回的，给HashShuffleWriter的shuffle属性使用：

/** A group of writers for a ShuffleMapTask, one writer per reducer. */
private[spark] trait ShuffleWriterGroup {
  val writers: Array[BlockObjectWriter]

  /** @param success Indicates all writes were successful. If false, no blocks will be recorded. */
  def releaseWriters(success: Boolean)
}

/**
 * Manages assigning disk-based block writers to shuffle tasks. Each shuffle task gets one file
 * per reducer (this set of files is called a ShuffleFileGroup).
 *
 * As an optimization to reduce the number of physical shuffle files produced, multiple shuffle
 * blocks are aggregated into the same file. There is one "combined shuffle file" per reducer
 * per concurrently executing shuffle task. As soon as a task finishes writing to its shuffle
 * files, it releases them for another task.
 * Regarding the implementation of this feature, shuffle files are identified by a 3-tuple:
 *   - shuffleId: The unique id given to the entire shuffle stage.
 *   - bucketId: The id of the output partition (i.e., reducer id)
 *   - fileId: The unique id identifying a group of "combined shuffle files." Only one task at a
 *       time owns a particular fileId, and this id is returned to a pool when the task finishes.
 * Each shuffle file is then mapped to a FileSegment, which is a 3-tuple (file, offset, length)
 * that specifies where in a given file the actual block data is located.
 *
 * Shuffle file metadata is stored in a space-efficient manner. Rather than simply mapping
 * ShuffleBlockIds directly to FileSegments, each ShuffleFileGroup maintains a list of offsets for
 * each block stored in each file. In order to find the location of a shuffle block, we search the
 * files within a ShuffleFileGroups associated with the block's reducer.
 */
// TODO: Factor this into a separate class for each ShuffleManager implementation
private[spark]
class ShuffleBlockManager(blockManager: BlockManager,
                          shuffleManager: ShuffleManager) extends Logging {

ShuffleFileGroup是一组文件，每个reducer对应一个。每个map将会对应一个这个文件（但多个map可以对应一个文件）。多个map对应一个文件时，它们写入是分段写入的（mapId，ReduceId）通过getFileSegmentFor函数获取到这个块的内容

private[spark]
object ShuffleBlockManager {
  /**
   * A group of shuffle files, one per reducer.
   * A particular mapper will be assigned a single ShuffleFileGroup to write its output to.
   */
  private class ShuffleFileGroup(val shuffleId: Int, val fileId: Int, val files: Array[File]) {
    private var numBlocks: Int = 0

    /**
     * Stores the absolute index of each mapId in the files of this group. For instance,
     * if mapId 5 is the first block in each file, mapIdToIndex(5) = 0.
     */
    private val mapIdToIndex = new PrimitiveKeyOpenHashMap[Int, Int]()

    /**
     * Stores consecutive offsets and lengths of blocks into each reducer file, ordered by
     * position in the file.
     * Note: mapIdToIndex(mapId) returns the index of the mapper into the vector for every
     * reducer.
     */
    private val blockOffsetsByReducer = Array.fill[PrimitiveVector[Long]](files.length) {
      new PrimitiveVector[Long]()
    }
    private val blockLengthsByReducer = Array.fill[PrimitiveVector[Long]](files.length) {
      new PrimitiveVector[Long]()
    }

    def apply(bucketId: Int) = files(bucketId)

    def recordMapOutput(mapId: Int, offsets: Array[Long], lengths: Array[Long]) {
      assert(offsets.length == lengths.length)
      mapIdToIndex(mapId) = numBlocks
      numBlocks += 1
      for (i <- 0 until offsets.length) {
        blockOffsetsByReducer(i) += offsets(i)
        blockLengthsByReducer(i) += lengths(i)
      }
    }

    /** Returns the FileSegment associated with the given map task, or None if no entry exists. */
    def getFileSegmentFor(mapId: Int, reducerId: Int): Option[FileSegment] = {
      val file = files(reducerId)
      val blockOffsets = blockOffsetsByReducer(reducerId)
      val blockLengths = blockLengthsByReducer(reducerId)
      val index = mapIdToIndex.getOrElse(mapId, -1)
      if (index >= 0) {
        val offset = blockOffsets(index)
        val length = blockLengths(index)
        Some(new FileSegment(file, offset, length))
      } else {
        None
      }
    }
  }
}

来自为知笔记(Wiz)

spark 笔记 16： BlockManager的更多相关文章

Ext.Net学习笔记16：Ext.Net GridPanel 折叠/展开行
Ext.Net学习笔记16:Ext.Net GridPanel 折叠/展开行 Ext.Net GridPanel的行支持折叠/展开功能,这个功能个人觉得还说很有用处的,尤其是数据中包含图片等内容的时候 ...
安装Hadoop及Spark(Ubuntu 16.04)
安装Hadoop及Spark(Ubuntu 16.04) 安装JDK 下载jdk(以jdk-8u91-linux-x64.tar.gz为例) 新建文件夹 sudo mkdir /usr/lib/jvm ...
SQL反模式学习笔记16 使用随机数排序
目标:随机排序,使用高效的SQL语句查询获取随机数据样本. 反模式:使用RAND()随机函数 SELECT * FROM Employees AS e ORDER BY RAND() Limit 1 ...
golang学习笔记16 beego orm 数据库操作
golang学习笔记16 beego orm 数据库操作 beego ORM 是一个强大的 Go 语言 ORM 框架.她的灵感主要来自 Django ORM 和 SQLAlchemy. 目前该框架仍处 ...
spark的存储系统--BlockManager源码分析
spark的存储系统--BlockManager源码分析根据之前的一系列分析,我们对spark作业从创建到调度分发,到执行,最后结果回传driver的过程有了一个大概的了解.但是在分析源码的过程中也 ...
spark笔记环境配置
spark笔记 spark简介 saprk 有六个核心组件: SparkCore.SparkSQL.SparkStreaming.StructedStreaming.MLlib,Graphx Spar ...
spark 笔记 15: ShuffleManager，shuffle map两端的stage/task的桥梁
无论是Hadoop还是spark,shuffle操作都是决定其性能的重要因素.在不能减少shuffle的情况下,使用一个好的shuffle管理器也是优化性能的重要手段. ShuffleManager的 ...
spark 笔记 12: Executor，task最后的归宿
spark的Executor是执行task的容器.和java的executor概念类似. ===================start executor runs task============ ...
Spark笔记：复杂RDD的API的理解（下）
本篇接着谈谈那些稍微复杂的API. 1) flatMapValues:针对Pair RDD中的每个值应用一个返回迭代器的函数,然后对返回的每个元素都生成一个对应原键的键值对记录这个方法我最开始接 ...

随机推荐

luogu P4755 Beautiful Pair
luogu 这题有坨区间最大值,考虑最值分治.分治时每次取出最大值,然后考虑统计跨过这个位置的区间答案,然后两边递归处理.如果之枚举左端点,因为最大值确定,右端点权值要满足\(a_r\le \frac ...
ccs之经典布局（二）（两栏，三栏布局）
接上篇ccs之经典布局(一)(水平垂直居中) 四.两列布局单列宽度固定,另一列宽度是自适应. 1.float+overflow:auto; 固定端用float进行浮动,自适应的用overflow:a ...
X-Forwarded-For伪造及防御
使用x-Forward_for插件或者burpsuit可以改包,伪造任意的IP地址,使一些管理员后台绕过对IP地址限制的访问. 防护策略: 1.对于直接使用的 Web 应用,必须使用从TCP连接中得到 ...
Redis总结1
一.Redis安装(Linux) 1.在官网上下载Linux版本的Redis(链接https://redis.io/download) 2.在Linux的/usr/local中创建Redis文件夹mk ...
免费使用Google
这里需要借助一下`梯子`,这里有教程点击进入如果没有谷歌浏览器,进入下载最新版谷歌浏览器,进入下载,不要移动它的安装位置,选择默认位置, 如果已经安装了谷歌浏览器,打开赛风之后,选择设置进行安装 ...
Could not determine which “make” command to run. Check the “make” step in the build configuration
环境: QT5.10 VisualStudio2015 错误1: Could not determine which “make” command to run. Check the “make” s ...
linux工具之pmap
1.pmap简介 pmap命令用来报告一个进程或多个进程的内存映射.可以使用这个工具确定系统是如何为服务器上的进程分配内存的. 例如查看ssh进程的内存映射:
Python with open 使用技巧
在使用Python处理文件的是,对于文件的处理,都会经过三个步骤:打开文件->操作文件->关闭文件.但在有些时候,我们会忘记把文件关闭,这就无法释放文件的打开句柄.这可能觉得有些麻烦,每次 ...
npoi c#
没有安装excel docx的情况下操作excel docx
我说CMMI之二：CMMI里有什么？--转载
版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明.本文链接:https://blog.csdn.net/dylanren/article/deta ...

spark 笔记 16： BlockManager

spark 笔记 16： BlockManager的更多相关文章

随机推荐

热门专题