spark DiskBlockManager

RDD本身presist可以是本地存储，本地存储级别的持久化实现方式如下：

DiskBlockManager负责管理和维护block和磁盘存储的映射关系，通过blockId作为文件名称，然后如果是多个目录通过blcokId的hash值进行分发。

包括创建目录，删除，读取文件，以及一些退出删除文件的机制。

/**

  * Creates and maintains the logical mapping between logical blocks and physical on-disk

  * locations. One block is mapped to one file with a name given by its BlockId.

  * 创建和维护blocks和磁盘存储位置的映射关系。每个block对应一个文件。文件名字是bclockId。

  * Block files are hashed among the directories listed in spark.local.dir (or in

  * SPARK_LOCAL_DIRS, if it's set).

  *

  * spark.local.dir目录存储 block 的文件。是通过文件名的hash到各个spark.local.dirs目录里面

  */

private[spark] class DiskBlockManager(conf: SparkConf, deleteFilesOnStop: Boolean) extends Logging {

  private[spark] val subDirsPerLocalDir = conf.getInt("spark.diskStore.subDirectories", 64)

  /* Create one local directory for each path mentioned in spark.local.dir; then, inside this

   * directory, create multiple subdirectories that we will hash files into, in order to avoid

   * having really large inodes at the top level. */

  private[spark] val localDirs: Array[File] = createLocalDirs(conf)

  if (localDirs.isEmpty) {

    logError("Failed to create any local dir.")

    System.exit(ExecutorExitCode.DISK_STORE_FAILED_TO_CREATE_DIR)

  }

  // The content of subDirs is immutable but the content of subDirs(i) is mutable. And the content

  // of subDirs(i) is protected by the lock of subDirs(i)

  private val subDirs = Array.fill(localDirs.length)(new Array[File](subDirsPerLocalDir))

  private val shutdownHook = addShutdownHook()

  /** Looks up a file by hashing it into one of our local subdirectories. */

  // This method should be kept in sync with

  // org.apache.spark.network.shuffle.ExternalShuffleBlockResolver#getFile().

  // 通过文件名的hash在目录中查找文件

  def getFile(filename: String): File = {

    // Figure out which local directory it hashes to, and which subdirectory in that

    val hash = Utils.nonNegativeHash(filename)

    val dirId = hash % localDirs.length

    val subDirId = (hash / localDirs.length) % subDirsPerLocalDir

    // Create the subdirectory if it doesn't already exist

    val subDir = subDirs(dirId).synchronized {

      val old = subDirs(dirId)(subDirId)

      if (old != null) {

        old

      } else {

        val newDir = new File(localDirs(dirId), "%02x".format(subDirId))

        if (!newDir.exists() && !newDir.mkdir()) {

          throw new IOException(s"Failed to create local dir in $newDir.")

        }

        subDirs(dirId)(subDirId) = newDir

        newDir

      }

    }

    new File(subDir, filename)

  }

  def getFile(blockId: BlockId): File = getFile(blockId.name)

  /** Check if disk block manager has a block. */

  def containsBlock(blockId: BlockId): Boolean = {

    getFile(blockId.name).exists()

  }

  /** List all the files currently stored on disk by the disk manager. */

  def getAllFiles(): Seq[File] = {

    // Get all the files inside the array of array of directories

    subDirs.flatMap { dir =>

      dir.synchronized {

        // Copy the content of dir because it may be modified in other threads

        dir.clone()

      }

    }.filter(_ != null).flatMap { dir =>

      val files = dir.listFiles()

      if (files != null) files else Seq.empty

    }

  }

  /** List all the blocks currently stored on disk by the disk manager. */

  def getAllBlocks(): Seq[BlockId] = {

    getAllFiles().map(f => BlockId(f.getName))

  }

  /** Produces a unique block id and File suitable for storing local intermediate results. */

  def createTempLocalBlock(): (TempLocalBlockId, File) = {

    var blockId = new TempLocalBlockId(UUID.randomUUID())

    while (getFile(blockId).exists()) {

      blockId = new TempLocalBlockId(UUID.randomUUID())

    }

    (blockId, getFile(blockId))

  }

  /** Produces a unique block id and File suitable for storing shuffled intermediate results. */

  def createTempShuffleBlock(): (TempShuffleBlockId, File) = {

    var blockId = new TempShuffleBlockId(UUID.randomUUID())

    while (getFile(blockId).exists()) {

      blockId = new TempShuffleBlockId(UUID.randomUUID())

    }

    (blockId, getFile(blockId))

  }

  /**

    * Create local directories for storing block data. These directories are

    * located inside configured local directories and won't

    * be deleted on JVM exit when using the external shuffle service.

    *

    *  在rootDir中创建blockmgr目录，用来存储block数据

    *

    */

  private def createLocalDirs(conf: SparkConf): Array[File] = {

    Utils.getConfiguredLocalDirs(conf).flatMap { rootDir =>

      try {

        val localDir = Utils.createDirectory(rootDir, "blockmgr")

        logInfo(s"Created local directory at $localDir")

        Some(localDir)

      } catch {

        case e: IOException =>

          logError(s"Failed to create local dir in $rootDir. Ignoring this directory.", e)

          None

      }

    }

  }

  private def addShutdownHook(): AnyRef = {

    logDebug("Adding shutdown hook") // force eager creation of logger

    ShutdownHookManager.addShutdownHook(ShutdownHookManager.TEMP_DIR_SHUTDOWN_PRIORITY + 1) { () =>

      logInfo("Shutdown hook called")

      DiskBlockManager.this.doStop()

    }

  }

  /** Cleanup local dirs and stop shuffle sender. */

  private[spark] def stop() {

    // Remove the shutdown hook.  It causes memory leaks if we leave it around.

    try {

      ShutdownHookManager.removeShutdownHook(shutdownHook)

    } catch {

      case e: Exception =>

        logError(s"Exception while removing shutdown hook.", e)

    }

    doStop()

  }

  //删除目录

  private def doStop(): Unit = {

    if (deleteFilesOnStop) {

      localDirs.foreach { localDir =>

        if (localDir.isDirectory() && localDir.exists()) {

          try {

            if (!ShutdownHookManager.hasRootAsShutdownDeleteDir(localDir)) {

              Utils.deleteRecursively(localDir)

            }

          } catch {

            case e: Exception =>

              logError(s"Exception while deleting local spark dir: $localDir", e)

          }

        }

      }

    }

  }

}

具体调用句柄在DiskStore中，调用put方法，将指定的block写到本地。

private[spark] class DiskStore(conf: SparkConf, diskManager: DiskBlockManager) extends Logging {

  private val minMemoryMapBytes = conf.getSizeAsBytes("spark.storage.memoryMapThreshold", "2m")

  def getSize(blockId: BlockId): Long = {

    diskManager.getFile(blockId.name).length

  }

  /**

    * Invokes the provided callback function to write the specific block.

    * 调用提供的回掉方法把指定的block写到磁盘

    *

    * @throws IllegalStateException if the block already exists in the disk store.

    */

  def put(blockId: BlockId)(writeFunc: FileOutputStream => Unit): Unit = {

    if (contains(blockId)) {

      throw new IllegalStateException(s"Block $blockId is already present in the disk store")

    }

    logDebug(s"Attempting to put block $blockId")

    val startTime = System.currentTimeMillis

    //生成block文件，blockid作为文件名，包含一些创建文件夹的操作

    val file = diskManager.getFile(blockId)

    val fileOutputStream = new FileOutputStream(file)

    var threwException: Boolean = true

    try {

      writeFunc(fileOutputStream)

      threwException = false

    } finally {

      try {

        Closeables.close(fileOutputStream, threwException)

      } finally {

        if (threwException) {

          remove(blockId)

        }

      }

    }

    val finishTime = System.currentTimeMillis

    logDebug("Block %s stored as %s file on disk in %d ms".format(

      file.getName,

      Utils.bytesToString(file.length()),

      finishTime - startTime))

  }

  def putBytes(blockId: BlockId, bytes: ChunkedByteBuffer): Unit = {

    put(blockId) { fileOutputStream =>

      val channel = fileOutputStream.getChannel

      Utils.tryWithSafeFinally {

        bytes.writeFully(channel)

      } {

        channel.close()

      }

    }

  }

  //读取出指定的block数据放到内存中

  def getBytes(blockId: BlockId): ChunkedByteBuffer = {

    val file = diskManager.getFile(blockId.name)

    val channel = new RandomAccessFile(file, "r").getChannel

    Utils.tryWithSafeFinally {

      // For small files, directly read rather than memory map

      if (file.length < minMemoryMapBytes) {

        val buf = ByteBuffer.allocate(file.length.toInt)

        channel.position(0)

        while (buf.remaining() != 0) {

          if (channel.read(buf) == -1) {

            throw new IOException("Reached EOF before filling buffer\n" +

              s"offset=0\nfile=${file.getAbsolutePath}\nbuf.remaining=${buf.remaining}")

          }

        }

        buf.flip()

        new ChunkedByteBuffer(buf)

      } else {

        new ChunkedByteBuffer(channel.map(MapMode.READ_ONLY, 0, file.length))

      }

    } {

      channel.close()

    }

  }

  //删除block数据

  def remove(blockId: BlockId): Boolean = {

    val file = diskManager.getFile(blockId.name)

    if (file.exists()) {

      val ret = file.delete()

      if (!ret) {

        logWarning(s"Error deleting ${file.getPath()}")

      }

      ret

    } else {

      false

    }

  }

  def contains(blockId: BlockId): Boolean = {

    val file = diskManager.getFile(blockId.name)

    file.exists()

  }

}

spark DiskBlockManager的更多相关文章

跟我一起数据挖掘（22）——spark入门
Spark简介 Spark是UC Berkeley AMP lab所开源的类Hadoop MapReduce的通用的并行,Spark,拥有Hadoop MapReduce所具有的优点:但不同于MapR ...
搭建Spark的单机版集群
一.创建用户 # useradd spark # passwd spark 二.下载软件 JDK,Scala,SBT,Maven 版本信息如下: JDK jdk-7u79-linux-x64.gz S ...
Spark源码学习1.5——BlockManager.scala
一.BlockResult类该类用来表示返回的匹配的block及其相关的参数.共有三个参数: data:Iterator [Any]. readMethod: DataReadMethod.Valu ...
Spark BlockManager的通信及内存占用分析(源码阅读九）
之前阅读也有总结过Block的RPC服务是通过NettyBlockRpcServer提供打开,即下载Block文件的功能.然后在启动jbo的时候由Driver上的BlockManagerMaster对 ...
《深入理解Spark：核心思想与源码分析》（前言及第1章）
自己牺牲了7个月的周末和下班空闲时间,通过研究Spark源码和原理,总结整理的<深入理解Spark:核心思想与源码分析>一书现在已经正式出版上市,目前亚马逊.京东.当当.天猫等网站均有销售 ...
Spark Idea Maven 开发环境搭建
一.安装jdk jdk版本最好是1.7以上,设置好环境变量,安装过程,略. 二.安装Maven 我选择的Maven版本是3.3.3,安装过程,略. 编辑Maven安装目录conf/settings.x ...
Spark源码系列（六）Shuffle的过程解析
Spark大会上,所有的演讲嘉宾都认为shuffle是最影响性能的地方,但是又无可奈何.之前去百度面试hadoop的时候,也被问到了这个问题,直接回答了不知道. 这篇文章主要是沿着下面几个问题来开展: ...
Spark编译安装和运行
一.环境说明 Mac OSX Java 1.7.0_71 Spark 二.编译安装 tar -zxvf spark-.tgz cd spark- ./sbt/sbt assembly ps:如果之前执 ...
Spark metrics on wordcount example
I read the section Metrics on spark website. I wish to try it on the wordcount example, I can't make ...

随机推荐

JSON.parse() 和 JSON.stringify()使用
1.parse()是用于从一个字符串中解析出json对象定义一个字符串:var str = '{"name":"superman","age&quo ...
linux网络编程系列-TCP/IP模型
### OSI:open system interconnection ### 开放系统互联网模型是由ISO国际标准化组织定义的网络分层模型,共七层 1. 物理层:物理定义了所有电子及物理设备的规范, ...
PowerDesigner使用教程（转）
PowerDesigner是一款功能非常强大的建模工具软件,足以与Rose比肩,同样是当今最著名的建模软件之一.Rose是专攻UML对象模型的建模工具,之后才向数据库建模发展,而PowerDesign ...
转：Android 调试桥(adb)是多种用途的工具
转自:http://my.oschina.net/xuwa/blog/1574 Android 调试桥(adb)是多种用途的工具,该工具可以帮助你你管理设备或模拟器的状态. 可以通过下列几种方法加入 ...
vnc无法显示桌面
转载以下是我的正确配置,解决上述问题,附带说明: 修改后的~/.vnc/xstartup #!/bin/sh # Uncomment the following two lines for n ...
Codeforces #107 DIV2 ABCD
A #include <map> #include <set> #include <list> #include <cmath> #include &l ...
HTML5初学---坦克大战基础
让小球动起来,根据键盘的W(上),D(右),S(下),A(左):键的点击移动小球 <!DOCTYPE html> <html> <head> <meta ch ...
【总结】crontab 使用脚本及直接获取HTTP状态码
一.在crontab里面计划执行的脚本,所有的命令都要写出绝对路径.因为crontab的独立的进程,可能无法直接加载环境变量. 二.在判断网站能否正常访问一般的思路: 1. 判断网站是否能够正常打开. ...
django添加导包路径
在设置文件里: import sys sys.path.insert(0,os.path.join(BASE_DIR,"要导包的目录名")) 用pycharm时,如果导包后没有自动 ...
selenium 滚动条操作（JavaScript操作）
前言一般我们想到的必须使用滚动条的场景是:注册时的法律条文的阅读.判断用户是否阅读完的标准是:滚动条是否拉到页面底部.当然,有时候为使操作更接近用户行为也会使用滚动条,例如用户要操作的元素在页面的第 ...

spark DiskBlockManager

spark DiskBlockManager的更多相关文章

随机推荐

热门专题