spark DiskBlockManager
RDD本身presist可以是本地存储,本地存储级别的持久化实现方式如下:
DiskBlockManager负责管理和维护block和磁盘存储的映射关系,通过blockId作为文件名称,然后如果是多个目录通过blcokId的hash值进行分发。
包括创建目录,删除,读取文件,以及一些退出删除文件的机制。
/**
* Creates and maintains the logical mapping between logical blocks and physical on-disk
* locations. One block is mapped to one file with a name given by its BlockId.
* 创建和维护blocks和磁盘存储位置的映射关系。每个block对应一个文件。文件名字是bclockId。
* Block files are hashed among the directories listed in spark.local.dir (or in
* SPARK_LOCAL_DIRS, if it's set).
*
* spark.local.dir目录存储 block 的文件。是通过文件名的hash到各个spark.local.dirs目录里面
*/
private[spark] class DiskBlockManager(conf: SparkConf, deleteFilesOnStop: Boolean) extends Logging { private[spark] val subDirsPerLocalDir = conf.getInt("spark.diskStore.subDirectories", 64) /* Create one local directory for each path mentioned in spark.local.dir; then, inside this
* directory, create multiple subdirectories that we will hash files into, in order to avoid
* having really large inodes at the top level. */
private[spark] val localDirs: Array[File] = createLocalDirs(conf)
if (localDirs.isEmpty) {
logError("Failed to create any local dir.")
System.exit(ExecutorExitCode.DISK_STORE_FAILED_TO_CREATE_DIR)
}
// The content of subDirs is immutable but the content of subDirs(i) is mutable. And the content
// of subDirs(i) is protected by the lock of subDirs(i)
private val subDirs = Array.fill(localDirs.length)(new Array[File](subDirsPerLocalDir)) private val shutdownHook = addShutdownHook() /** Looks up a file by hashing it into one of our local subdirectories. */
// This method should be kept in sync with
// org.apache.spark.network.shuffle.ExternalShuffleBlockResolver#getFile(). // 通过文件名的hash在目录中查找文件
def getFile(filename: String): File = {
// Figure out which local directory it hashes to, and which subdirectory in that
val hash = Utils.nonNegativeHash(filename)
val dirId = hash % localDirs.length
val subDirId = (hash / localDirs.length) % subDirsPerLocalDir // Create the subdirectory if it doesn't already exist
val subDir = subDirs(dirId).synchronized {
val old = subDirs(dirId)(subDirId)
if (old != null) {
old
} else {
val newDir = new File(localDirs(dirId), "%02x".format(subDirId))
if (!newDir.exists() && !newDir.mkdir()) {
throw new IOException(s"Failed to create local dir in $newDir.")
}
subDirs(dirId)(subDirId) = newDir
newDir
}
} new File(subDir, filename)
} def getFile(blockId: BlockId): File = getFile(blockId.name) /** Check if disk block manager has a block. */
def containsBlock(blockId: BlockId): Boolean = {
getFile(blockId.name).exists()
} /** List all the files currently stored on disk by the disk manager. */
def getAllFiles(): Seq[File] = {
// Get all the files inside the array of array of directories
subDirs.flatMap { dir =>
dir.synchronized {
// Copy the content of dir because it may be modified in other threads
dir.clone()
}
}.filter(_ != null).flatMap { dir =>
val files = dir.listFiles()
if (files != null) files else Seq.empty
}
} /** List all the blocks currently stored on disk by the disk manager. */
def getAllBlocks(): Seq[BlockId] = {
getAllFiles().map(f => BlockId(f.getName))
} /** Produces a unique block id and File suitable for storing local intermediate results. */
def createTempLocalBlock(): (TempLocalBlockId, File) = {
var blockId = new TempLocalBlockId(UUID.randomUUID())
while (getFile(blockId).exists()) {
blockId = new TempLocalBlockId(UUID.randomUUID())
}
(blockId, getFile(blockId))
} /** Produces a unique block id and File suitable for storing shuffled intermediate results. */
def createTempShuffleBlock(): (TempShuffleBlockId, File) = {
var blockId = new TempShuffleBlockId(UUID.randomUUID())
while (getFile(blockId).exists()) {
blockId = new TempShuffleBlockId(UUID.randomUUID())
}
(blockId, getFile(blockId))
} /**
* Create local directories for storing block data. These directories are
* located inside configured local directories and won't
* be deleted on JVM exit when using the external shuffle service.
*
* 在rootDir中创建blockmgr目录,用来存储block数据
*
*/
private def createLocalDirs(conf: SparkConf): Array[File] = {
Utils.getConfiguredLocalDirs(conf).flatMap { rootDir =>
try {
val localDir = Utils.createDirectory(rootDir, "blockmgr")
logInfo(s"Created local directory at $localDir")
Some(localDir)
} catch {
case e: IOException =>
logError(s"Failed to create local dir in $rootDir. Ignoring this directory.", e)
None
}
}
} private def addShutdownHook(): AnyRef = {
logDebug("Adding shutdown hook") // force eager creation of logger
ShutdownHookManager.addShutdownHook(ShutdownHookManager.TEMP_DIR_SHUTDOWN_PRIORITY + 1) { () =>
logInfo("Shutdown hook called")
DiskBlockManager.this.doStop()
}
} /** Cleanup local dirs and stop shuffle sender. */
private[spark] def stop() {
// Remove the shutdown hook. It causes memory leaks if we leave it around.
try {
ShutdownHookManager.removeShutdownHook(shutdownHook)
} catch {
case e: Exception =>
logError(s"Exception while removing shutdown hook.", e)
}
doStop()
} //删除目录
private def doStop(): Unit = {
if (deleteFilesOnStop) {
localDirs.foreach { localDir =>
if (localDir.isDirectory() && localDir.exists()) {
try {
if (!ShutdownHookManager.hasRootAsShutdownDeleteDir(localDir)) {
Utils.deleteRecursively(localDir)
}
} catch {
case e: Exception =>
logError(s"Exception while deleting local spark dir: $localDir", e)
}
}
}
}
}
}
具体调用句柄在DiskStore中,调用put方法,将指定的block写到本地。
private[spark] class DiskStore(conf: SparkConf, diskManager: DiskBlockManager) extends Logging {
private val minMemoryMapBytes = conf.getSizeAsBytes("spark.storage.memoryMapThreshold", "2m")
def getSize(blockId: BlockId): Long = {
diskManager.getFile(blockId.name).length
}
/**
* Invokes the provided callback function to write the specific block.
* 调用提供的回掉方法把指定的block写到磁盘
*
* @throws IllegalStateException if the block already exists in the disk store.
*/
def put(blockId: BlockId)(writeFunc: FileOutputStream => Unit): Unit = {
if (contains(blockId)) {
throw new IllegalStateException(s"Block $blockId is already present in the disk store")
}
logDebug(s"Attempting to put block $blockId")
val startTime = System.currentTimeMillis
//生成block文件,blockid作为文件名,包含一些创建文件夹的操作
val file = diskManager.getFile(blockId)
val fileOutputStream = new FileOutputStream(file)
var threwException: Boolean = true
try {
writeFunc(fileOutputStream)
threwException = false
} finally {
try {
Closeables.close(fileOutputStream, threwException)
} finally {
if (threwException) {
remove(blockId)
}
}
}
val finishTime = System.currentTimeMillis
logDebug("Block %s stored as %s file on disk in %d ms".format(
file.getName,
Utils.bytesToString(file.length()),
finishTime - startTime))
}
def putBytes(blockId: BlockId, bytes: ChunkedByteBuffer): Unit = {
put(blockId) { fileOutputStream =>
val channel = fileOutputStream.getChannel
Utils.tryWithSafeFinally {
bytes.writeFully(channel)
} {
channel.close()
}
}
}
//读取出指定的block数据放到内存中
def getBytes(blockId: BlockId): ChunkedByteBuffer = {
val file = diskManager.getFile(blockId.name)
val channel = new RandomAccessFile(file, "r").getChannel
Utils.tryWithSafeFinally {
// For small files, directly read rather than memory map
if (file.length < minMemoryMapBytes) {
val buf = ByteBuffer.allocate(file.length.toInt)
channel.position(0)
while (buf.remaining() != 0) {
if (channel.read(buf) == -1) {
throw new IOException("Reached EOF before filling buffer\n" +
s"offset=0\nfile=${file.getAbsolutePath}\nbuf.remaining=${buf.remaining}")
}
}
buf.flip()
new ChunkedByteBuffer(buf)
} else {
new ChunkedByteBuffer(channel.map(MapMode.READ_ONLY, 0, file.length))
}
} {
channel.close()
}
}
//删除block数据
def remove(blockId: BlockId): Boolean = {
val file = diskManager.getFile(blockId.name)
if (file.exists()) {
val ret = file.delete()
if (!ret) {
logWarning(s"Error deleting ${file.getPath()}")
}
ret
} else {
false
}
}
def contains(blockId: BlockId): Boolean = {
val file = diskManager.getFile(blockId.name)
file.exists()
}
}
spark DiskBlockManager的更多相关文章
- 跟我一起数据挖掘(22)——spark入门
Spark简介 Spark是UC Berkeley AMP lab所开源的类Hadoop MapReduce的通用的并行,Spark,拥有Hadoop MapReduce所具有的优点:但不同于MapR ...
- 搭建Spark的单机版集群
一.创建用户 # useradd spark # passwd spark 二.下载软件 JDK,Scala,SBT,Maven 版本信息如下: JDK jdk-7u79-linux-x64.gz S ...
- Spark源码学习1.5——BlockManager.scala
一.BlockResult类 该类用来表示返回的匹配的block及其相关的参数.共有三个参数: data:Iterator [Any]. readMethod: DataReadMethod.Valu ...
- Spark BlockManager的通信及内存占用分析(源码阅读九)
之前阅读也有总结过Block的RPC服务是通过NettyBlockRpcServer提供打开,即下载Block文件的功能.然后在启动jbo的时候由Driver上的BlockManagerMaster对 ...
- 《深入理解Spark:核心思想与源码分析》(前言及第1章)
自己牺牲了7个月的周末和下班空闲时间,通过研究Spark源码和原理,总结整理的<深入理解Spark:核心思想与源码分析>一书现在已经正式出版上市,目前亚马逊.京东.当当.天猫等网站均有销售 ...
- Spark Idea Maven 开发环境搭建
一.安装jdk jdk版本最好是1.7以上,设置好环境变量,安装过程,略. 二.安装Maven 我选择的Maven版本是3.3.3,安装过程,略. 编辑Maven安装目录conf/settings.x ...
- Spark源码系列(六)Shuffle的过程解析
Spark大会上,所有的演讲嘉宾都认为shuffle是最影响性能的地方,但是又无可奈何.之前去百度面试hadoop的时候,也被问到了这个问题,直接回答了不知道. 这篇文章主要是沿着下面几个问题来开展: ...
- Spark编译安装和运行
一.环境说明 Mac OSX Java 1.7.0_71 Spark 二.编译安装 tar -zxvf spark-.tgz cd spark- ./sbt/sbt assembly ps:如果之前执 ...
- Spark metrics on wordcount example
I read the section Metrics on spark website. I wish to try it on the wordcount example, I can't make ...
随机推荐
- [06] 盒模型 + auto 居中 + 垂直合并
1.盒模型 盒子模型有两种,分别是 ie 盒子模型和标准 w3c 盒子模型. 标准(W3C)模型中:CSS中的宽(width) = 内容 (content)的宽 CSS中的宽(width) = 内容( ...
- JS模块化工具requirejs教程02
基本API require会定义三个变量:define,require,requirejs,其中require === requirejs,一般使用require更简短 define 从名字就可以看出 ...
- bzoj 2245 费用流
比较裸 源点连人,每个人连自己的工作,工作连汇,然后因为人的费用是 分度的,且是随工作数非降的,所以我们拆边,源点连到每个人s+1条边 容量是每段的件数,费用是愤怒 /**************** ...
- linux基础 -nginx和nfs代理 开发脚本自动部署及监控
开发脚本自动部署及监控 1.编写脚本自动部署反向代理.web.nfs: (1).部署nginx反向代理三个web服务,调度算法使用加权轮询: (2).所有web服务使用共享存储nfs,保证所有web ...
- python3 迭代器,生成器
一 .什么是迭代 1. 重复 2.下次重复一定是基于上一次的结果而来 while True: cmd=input(':') print(cmd) l=[1,2,3,4] count=0 while c ...
- Codeforces 219D Choosing Capital for Treeland 2次DP
//选择一个根使得变换最少边的方向使得能够到达所有点#include <map> #include <set> #include <list> #include & ...
- Linux内核线程之深入浅出【转】
转自:http://blog.csdn.net/yiyeguzhou100/article/details/53126626 [-] 线程和进程的差别 线程的分类 1 内核线程 2 轻 ...
- 6.memcached缓存系统
1.memcached的安装和参数 memcached缓存系统一般还是部署在linux服务器上,所以这里只介绍linux上memcache的安装 首先切换到root用户,然后apt-get insta ...
- python Tk()生成的桌面的具体设置方法
rom tkinter import * root = Tk() root['height'] = 300 #设置高 root['width'] = 500 #设置宽 root.title('魔方小站 ...
- selenium+python自动化81-html报告优化(饼图+失败重跑+兼容python2&3)【转载】
优化html报告 为了满足小伙伴的各种变态需求,为了装逼提升逼格,为了让报告更加高大上,测试报告做了以下优化: 测试报告中文显示,优化一些断言失败正文乱码问题 新增错误和失败截图,展示到html报告里 ...