Spark Core(四)用LogQuery的例子来说明Executor是如何运算RDD的算子(转载)
1. 究竟是怎么运行的?
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Log Query")
val sc = new SparkContext(sparkConf)
val dataSet =
if (args.length == ) sc.textFile(args()) else sc.parallelize(exampleApacheLogs)
// scalastyle:off
val apacheLogRegex =
"""^([\d.]+) (\S+) (\S+)
([\w\d:/]+\s[+\-]\d4)
"(.+?)" (\d{}) ([\d\-]+) "([^"]+)" "([^"]+)".*""".r
// scalastyle:on
/** Tracks the total query count and number of aggregate bytes for a particular group. */
class Stats(val count: Int, val numBytes: Int) extends Serializable {
def merge(other: Stats): Stats = {
new Stats(count + other.count, numBytes + other.numBytes)
}
override def toString: String = "bytes=%s\tn=%s".format(numBytes, count)
} def extractKey(line: String): (String, String, String) = {
apacheLogRegex.findFirstIn(line) match {
case Some(apacheLogRegex(ip, _, user, dateTime, query, status, bytes, referer, ua)) =>
if (user != "\"-\"") (ip, user, query)
else (null, null, null)
case _ => (null, null, null)
}
} def extractStats(line: String): Stats = {
apacheLogRegex.findFirstIn(line) match {
case Some(apacheLogRegex(ip, _, user, dateTime, query, status, bytes, referer, ua)) =>
new Stats(, bytes.toInt)
case _ => new Stats(, )
}
} dataSet.map(line => (extractKey(line), extractStats(line)))
.reduceByKey((c, d) => c.merge(d))
.collect().foreach{
case (user, query) => println("%s\t%s".format(user, query))} sc.stop()
}
1.1 RDD,ShuffleDependency
override def runTask(context: TaskContext): MapStatus = {
// Deserialize the RDD using the broadcast variable.
val threadMXBean = ManagementFactory.getThreadMXBean
val deserializeStartTime = System.currentTimeMillis()
val deserializeStartCpuTime = if (threadMXBean.isCurrentThreadCpuTimeSupported) {
threadMXBean.getCurrentThreadCpuTime
} else 0L
val ser = SparkEnv.get.closureSerializer.newInstance()
val (rdd, dep) = ser.deserialize[(RDD[_], ShuffleDependency[_, _, _])](
ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader)
_executorDeserializeTime = System.currentTimeMillis() - deserializeStartTime
_executorDeserializeCpuTime = if (threadMXBean.isCurrentThreadCpuTimeSupported) {
threadMXBean.getCurrentThreadCpuTime - deserializeStartCpuTime
} else 0L var writer: ShuffleWriter[Any, Any] = null
try {
val manager = SparkEnv.get.shuffleManager
writer = manager.getWriter[Any, Any](dep.shuffleHandle, partitionId, context)
writer.write(rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: Product2[Any, Any]]])
writer.stop(success = true).get
} catch {
case e: Exception =>
try {
if (writer != null) {
writer.stop(success = false)
}
} catch {
case e: Exception =>
log.debug("Could not stop writer", e)
}
throw e
}
}
1.1.1 ShuffleWrite
/** Get a writer for a given partition. Called on executors by map tasks. */
override def getWriter[K, V](
handle: ShuffleHandle,
mapId: Int,
context: TaskContext): ShuffleWriter[K, V] = {
numMapsForShuffle.putIfAbsent(
handle.shuffleId, handle.asInstanceOf[BaseShuffleHandle[_, _, _]].numMaps)
val env = SparkEnv.get
handle match {
case unsafeShuffleHandle: SerializedShuffleHandle[K @unchecked, V @unchecked] =>
new UnsafeShuffleWriter(
env.blockManager,
shuffleBlockResolver.asInstanceOf[IndexShuffleBlockResolver],
context.taskMemoryManager(),
unsafeShuffleHandle,
mapId,
context,
env.conf)
case bypassMergeSortHandle: BypassMergeSortShuffleHandle[K @unchecked, V @unchecked] =>
new BypassMergeSortShuffleWriter(
env.blockManager,
shuffleBlockResolver.asInstanceOf[IndexShuffleBlockResolver],
bypassMergeSortHandle,
mapId,
context,
env.conf)
case other: BaseShuffleHandle[K @unchecked, V @unchecked, _] =>
new SortShuffleWriter(shuffleBlockResolver, other, mapId, context)
}
}
- 在Driver DAG 中registerShuffle中dependency决定着使用什么ShuffleHandle
- 在Executor的shuffleManager中是由dependency中的ShuffleHandle来决定什么ShuffleWrite
题外话:Dependency本身就可以直接决定shuffleWrite,整个ShuffleHandle只是在SortShuffleWriter的时候用于获取了dependency, Executor端SortShuffleWriter本身就能获取到Dependency,ShuffleHandle感觉就是一个鸡肋。
1.1.2 RDD.iterator
writer.write(rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: Product2[Any, Any]]])
writer在调用的write函数中传递了rdd.iterator,也就是通过rdd构造的迭代器
final def iterator(split: Partition, context: TaskContext): Iterator[T] = {
if (storageLevel != StorageLevel.NONE) {
getOrCompute(split, context)
} else {
computeOrReadCheckpoint(split, context)
}
}
Map的rdd的构造迭代器MapPartitionsRDD,MapPartitionsRDD并没有设置缓存或者存储,StorageLevel是NONE,调用computerOrReadCheckpoint方法
/**
* Compute an RDD partition or read it from a checkpoint if the RDD is checkpointing.
*/
private[spark] def computeOrReadCheckpoint(split: Partition, context: TaskContext): Iterator[T] =
{
if (isCheckpointedAndMaterialized) {
firstParent[T].iterator(split, context)
} else {
compute(split, context)
}
}
也没有做过checkpointed ,调用compute方法
override def compute(split: Partition, context: TaskContext): Iterator[U] =
f(context, split.index, firstParent[T].iterator(split, context))
先来看fistParent
/** Returns the first parent RDD */
protected[spark] def firstParent[U: ClassTag]: RDD[U] = {
dependencies.head.rdd.asInstanceOf[RDD[U]]
}
每个RDD都会保存一个Dependency的数组,Dependency里有RDD的属性,而Dependency数组的头一个dependency的RDD,就是处理数据的首个RDD,也就是如下的代码里的dataSet
val dataSet =
if (args.length == ) sc.textFile(args()) else sc.parallelize(exampleApacheLogs)
firstParent[T].iterator(split, context))
iterator函数就是前面的RDD函数,StorageLevel依然是NONE,也没有做过checkpointed,依然还是调用compute的方法
override def compute(s: Partition, context: TaskContext): Iterator[T] = {
new InterruptibleIterator(context, s.asInstanceOf[ParallelCollectionPartition[T]].iterator)
}
生成了一个InterruptibleIterator迭代器,迭代器本质只是一个代理的迭代器
@DeveloperApi
class InterruptibleIterator[+T](val context: TaskContext, val delegate: Iterator[T])
extends Iterator[T] { def hasNext: Boolean = {
// TODO(aarondav/rxin): Check Thread.interrupted instead of context.interrupted if interrupt
// is allowed. The assumption is that Thread.interrupted does not have a memory fence in read
// (just a volatile field in C), while context.interrupted is a volatile in the JVM, which
// introduces an expensive read fence.
if (context.isInterrupted) {
throw new TaskKilledException
} else {
delegate.hasNext
}
} def next(): T = delegate.next()
}
当发现有打断命令的时候,直接抛出TaskKilledException的异常,其所代理的iterator 是
s.asInstanceOf[ParallelCollectionPartition[T]].iterator
ParallelCollectionRDD的Partition就是ParallelCollectionPartition
private[spark] class ParallelCollectionPartition[T: ClassTag](
var rddId: Long,
var slice: Int,
var values: Seq[T]
) extends Partition with Serializable { def iterator: Iterator[T] = values.iterator
.......
}
Values是需要支持序列化的数组,在Driver端ParallelCollectionRDD中将数据Data进行了ParallelCollectionPartition的分片,分片的数据Values被保存在了ParallelCollectionPartition里,数据并没有被保存在ParallelCollectionRDD中,所以进行计算的数据并不是通过RDD传递过来的,而是通过反序列化ShuffleMapTask获得的,走的是直接的rpc通道
private[spark] class ShuffleMapTask(
stageId: Int,
stageAttemptId: Int,
taskBinary: Broadcast[Array[Byte]],
partition: Partition,
@transient private var locs: Seq[TaskLocation],
metrics: TaskMetrics,
localProperties: Properties,
jobId: Option[Int] = None,
appId: Option[String] = None,
appAttemptId: Option[String] = None)
extends Task[MapStatus](stageId, stageAttemptId, partition.index, metrics, localProperties, jobId,
appId, appAttemptId)
回到MapPartitionsRDD原来的函数中去:
override def compute(split: Partition, context: TaskContext): Iterator[U] =
f(context, split.index, firstParent[T].iterator(split, context))
要看看f是什么?RDD.map函数
def map[U: ClassTag](f: T => U): RDD[U] = withScope {
val cleanF = sc.clean(f)
new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
}
我们在看看我们是如何调用map函数的:
dataSet.map(line => (extractKey(line), extractStats(line)))
def map[B](f: A => B): Iterator[B] = new AbstractIterator[B] {
def hasNext = self.hasNext
def next() = f(self.next())
}
我们来看ExternalSorter.scala通过迭代器获取Partiton的数据并进行运算的代码
while (records.hasNext) {
addElementsRead()
kv = records.next()
map.changeValue((getPartition(kv._1), kv._1), update)
maybeSpillCollection(usingMap = true)
}
- AbstractIterator.hasNext -> InterruptibleIterator.hasNext -> Elements( Seq.interator).hasNext -> def hasNext: Boolean = index < end
- AbstractIterator.next() -> InterruptibleIterator.next() -> Elements( Seq.interator).next(). -> f(InterruptibleIterator.next()) ->(extractKey(InterruptibleIterator.next()), extractStats(InterruptibleIterator.next()))
1.1.3 reduceByKey算子
.reduceByKey((c, d) => c.merge(d))
我们来看PairRDDFunction.scala中的reduceByKey,为什么PairRDDFunction不是RDD在前面的博客已经描述过
def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope {
combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)
}
combineByKeyWithClassTag函数中
def combineByKeyWithClassTag[C](
createCombiner: V => C,
mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C,
partitioner: Partitioner,
mapSideCombine: Boolean = true,
serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope {
require(mergeCombiners != null, "mergeCombiners must be defined") // required as of Spark 0.9.0
if (keyClass.isArray) {
if (mapSideCombine) {
throw new SparkException("Cannot use map-side combining with array keys.")
}
if (partitioner.isInstanceOf[HashPartitioner]) {
throw new SparkException("HashPartitioner cannot partition array keys.")
}
}
val aggregator = new Aggregator[K, V, C](
self.context.clean(createCombiner),
self.context.clean(mergeValue),
self.context.clean(mergeCombiners))
if (self.partitioner == Some(partitioner)) {
self.mapPartitions(iter => {
val context = TaskContext.get()
new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context))
}, preservesPartitioning = true)
} else {
new ShuffledRDD[K, V, C](self, partitioner)
.setSerializer(serializer)
.setAggregator(aggregator)
.setMapSideCombine(mapSideCombine)
}
}
- createCombiner: 通过Map获得的新KV, 在Key不存在的情况下将V转化为C
- mergeValue: 通过Map获得的新KV, 在已经存在相同的Key情况下,将新获得的V聚合到C
- mergeCombiners: 分布式计算的时候,最后要每个RDD的分区最后汇总,汇总的时候对相同的Key,已经聚合的C和另一个分区已经聚合的C再次聚合
在logquery的例子中,mergeValue, mergeCombiners 就是 (c,d) =>c.merge(d) createCombiner就是 stats不变
val mergeValue = aggregator.get.mergeValue
val createCombiner = aggregator.get.createCombiner
var kv: Product2[K, V] = null
val update = (hadValue: Boolean, oldValue: C) => {
if (hadValue) mergeValue(oldValue, kv._2) else createCombiner(kv._2)
}
while (records.hasNext) {
addElementsRead()
kv = records.next()
map.changeValue((getPartition(kv._1), kv._1), update)
maybeSpillCollection(usingMap = true)
}
我们看到在map.changeValue的时候,通过update的方法更新相同的key
val update = (hadValue: Boolean, oldValue: C) => {
if (hadValue) mergeValue(oldValue, kv._2) else createCombiner(kv._2)
}
mergeValue,createCombiner就是从Aggregator中获取到的,而Aggregator被保存在ShuffledRDD和ShuffledDependency中,ShuffledDependency是通过Driver RPC传递给Executor的,所以可以从ShuffledDependency获取到Aggregator,通过Aggregator里指定的算法进行KV的操作,而mergeValue就是Driver中的c.merge(d),因为c 是stats 对象
class Stats(val count: Int, val numBytes: Int) extends Serializable {
def merge(other: Stats): Stats = {
new Stats(count + other.count, numBytes + other.numBytes)
}
override def toString: String = "bytes=%s\tn=%s".format(numBytes, count)
}
2. 总结
- 通过反序列化RDD(不是ShuffleRDD),通过Dependency的列表获的最初获取数据的RDD的迭代器A
- Map算子对迭代器A重新封装AbstractIterator,在迭代器A获取结果后进行Map算子里的函数调用line => (extractKey(line), extractStats(line)),返回KV的结果
- reduceByKey算子里的函数传递是通过ShuffledDependency里的aggregator进行传递
- Executor 只要对迭代器AbstractIterator进行迭代获取KV,调用aggregator里的方法进行相同的K对V进行操作,完成Driver里面的main函数定义的RDD运算。
Spark Core(四)用LogQuery的例子来说明Executor是如何运算RDD的算子(转载)的更多相关文章
- Spark Core(二)Driver上的Task的生成、分配、调度(转载)
1. 什么是Task? 在前面的章节里描述过几个角色,Driver(Client),Master,Worker(Executor),Driver会提交Application到Master进行Worke ...
- 大数据技术之_27_电商平台数据分析项目_02_预备知识 + Scala + Spark Core + Spark SQL + Spark Streaming + Java 对象池
第0章 预备知识0.1 Scala0.1.1 Scala 操作符0.1.2 拉链操作0.2 Spark Core0.2.1 Spark RDD 持久化0.2.2 Spark 共享变量0.3 Spark ...
- ASP.NET Core 四种释放 IDisposable 对象的方法
本文翻译自<Four ways to dispose IDisposables in ASP.NET Core>,由于水平有限,故无法保证翻译完全正确,欢迎指出错误.谢谢! IDispos ...
- Spark Streaming 002 统计单词的例子
1.准备 事先在hdfs上创建两个目录: 保存上传数据的目录:hdfs://alamps:9000/library/SparkStreaming/data checkpoint的目录:hdfs://a ...
- spark core (二)
一.Spark-Shell交互式工具 1.Spark-Shell交互式工具 Spark-Shell提供了一种学习API的简单方式, 以及一个能够交互式分析数据的强大工具. 在Scala语言环境下或Py ...
- Spark Core 资源调度与任务调度(standalone client 流程描述)
Spark Core 资源调度与任务调度(standalone client 流程描述) Spark集群启动: 集群启动后,Worker会向Master汇报资源情况(实际上将Worker的资 ...
- Spark Core知识点复习-2
day1112 1.spark core复习 任务提交 缓存 checkPoint 自定义排序 自定义分区器 自定义累加器 广播变量 Spark Shuffle过程 SparkSQL 一. Spark ...
- Spark(四十八):Spark MetricsSystem信息收集过程分析
MetricsSystem信息收集过程 参考: <Apache Spark源码走读之21 -- WEB UI和Metrics初始化及数据更新过程分析> <Spark Metrics配 ...
- Spark(四十七):Spark UI 数据可视化
导入: 1)Spark Web UI主要依赖于流行的Servlet容器Jetty实现: 2)Spark Web UI(Spark2.3之前)是展示运行状况.资源状态和监控指标的前端,而这些数据都是由度 ...
随机推荐
- jquery on=>keyup无法绑定未来js添加的元素
$("[name=red_count]").on("keyup", countKeyUpBind);即使使用on的,也无法绑定未来元素, 所以直接在动态添加的时 ...
- No.3 PyQt学习
使用box布局,写了 一个系统的主页(非常丑) 代码如下: # -*- coding: utf-8 -*- import sys from PyQt4.QtGui import * from PyQt ...
- 【微服务系列】Spring SpringMVC SpringBoot SpringCloud概念、关系及区别
一.正面解读 Spring主要是基于IOC反转Beans管理Bean类,主要依存于SSH框架(Struts+Spring+Hibernate)这个MVC框架,所以定位很明确,Struts主要负责表示层 ...
- servlet相关 jar包位置 BAE上部署web应用
1手动编译servlet工程: 要编译servlet,则类路径classpath中必须包括Servlet API 的相关类,如果使用的web容器是Tomcat,则这些类通常封装在在tomcat的lib ...
- PCL—低层次视觉—关键点检测(iss&Trajkovic)
关键点检测往往需要和特征提取联合在一起,关键点检测的一个重要性质就是旋转不变性,也就是说,物体旋转后还能够检测出对应的关键点.不过说实话我觉的这个要求对机器人视觉来说是比较鸡肋的.因为机器人采集到的三 ...
- vs2017编译网狐荣耀服务端的心得
1.找不到d3dx9.h 从D:\Microsoft DirectX SDK (June 2010)\Include复制 d3dx9.hd3dx9anim.hd3dx9core.hd3dx9effec ...
- 设置RabbitMQ远程ip登录
由于账号guest具有所有的操作权限,并且又是默认账号,出于安全因素的考虑,guest用户只能通过localhost登陆使用,并建议修改guest用户的密码以及新建其他账号管理使用rabbitmq. ...
- 【黑金原创教程】【FPGA那些事儿-驱动篇I 】实验九:PS/2模块③ — 键盘与多组合键
实验九:PS/2模块③ — 键盘与多组合键 笔者曾经说过,通码除了单字节以外,也有双字节通码,而且双字节通码都是 8’hE0开头,别名又是 E0按键.常见的的E0按键有,<↑>,<↓ ...
- CentOS下mysql安装
一.检查环境 # 切换root 权限 su root # 检查是否安装过mysql rpm -qa|grep mysql # 删除所有mysql yum -y remove mysql* 1.上传文件 ...
- mvc 实现超时弹窗后跳转
为了实现保持登录状态,可以用cookie来解决这一问题 假设过期时间为30分钟,校验发生在服务器,借助过滤器,可以这样写 public class PowerFilter : AuthorizeAtt ...