spark 2.1.1

spark中可以通过RDD.sortBy来对分布式数据进行排序,具体是如何实现的?来看代码:

org.apache.spark.rdd.RDD

  /**
* Return this RDD sorted by the given key function.
*/
def sortBy[K](
f: (T) => K,
ascending: Boolean = true,
numPartitions: Int = this.partitions.length)
(implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T] = withScope {
this.keyBy[K](f)
.sortByKey(ascending, numPartitions)
.values
} /**
* Creates tuples of the elements in this RDD by applying `f`.
*/
def keyBy[K](f: T => K): RDD[(K, T)] = withScope {
val cleanedF = sc.clean(f)
map(x => (cleanedF(x), x))
} /**
* Sort the RDD by key, so that each partition contains a sorted range of the elements. Calling
* `collect` or `save` on the resulting RDD will return or output an ordered list of records
* (in the `save` case, they will be written to multiple `part-X` files in the filesystem, in
* order of the keys).
*/
// TODO: this currently doesn't work on P other than Tuple2!
def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length)
: RDD[(K, V)] = self.withScope
{
val part = new RangePartitioner(numPartitions, self, ascending)
new ShuffledRDD[K, V, V](self, part)
.setKeyOrdering(if (ascending) ordering else ordering.reverse)
}

代码比较简单:sort是一个transformation操作,需要定义一个keyBy,即根据什么排序,然后会做一步map,即 item -> (keyBy(item), item),然后定义一个Partitioner,即分区策略(多少个分区,升序降序等),最后返回一个ShuffledRDD;

ShuffledRDD原理详见 https://www.cnblogs.com/barneywill/p/10158457.html

这里重点说下RangePartitioner:

org.apache.spark.RangePartitioner

/**
* A [[org.apache.spark.Partitioner]] that partitions sortable records by range into roughly
* equal ranges. The ranges are determined by sampling the content of the RDD passed in.
*
* @note The actual number of partitions created by the RangePartitioner might not be the same
* as the `partitions` parameter, in the case where the number of sampled records is less than
* the value of `partitions`.
*/
class RangePartitioner[K : Ordering : ClassTag, V](
partitions: Int,
rdd: RDD[_ <: Product2[K, V]],
private var ascending: Boolean = true)
extends Partitioner { // We allow partitions = 0, which happens when sorting an empty RDD under the default settings.
require(partitions >= 0, s"Number of partitions cannot be negative but found $partitions.") private var ordering = implicitly[Ordering[K]] // An array of upper bounds for the first (partitions - 1) partitions
private var rangeBounds: Array[K] = {
if (partitions <= 1) {
Array.empty
} else {
// This is the sample size we need to have roughly balanced output partitions, capped at 1M.
val sampleSize = math.min(20.0 * partitions, 1e6)
// Assume the input partitions are roughly balanced and over-sample a little bit.
val sampleSizePerPartition = math.ceil(3.0 * sampleSize / rdd.partitions.length).toInt
val (numItems, sketched) = RangePartitioner.sketch(rdd.map(_._1), sampleSizePerPartition)
if (numItems == 0L) {
Array.empty
} else {
// If a partition contains much more than the average number of items, we re-sample from it
// to ensure that enough items are collected from that partition.
val fraction = math.min(sampleSize / math.max(numItems, 1L), 1.0)
val candidates = ArrayBuffer.empty[(K, Float)]
val imbalancedPartitions = mutable.Set.empty[Int]
sketched.foreach { case (idx, n, sample) =>
if (fraction * n > sampleSizePerPartition) {
imbalancedPartitions += idx
} else {
// The weight is 1 over the sampling probability.
val weight = (n.toDouble / sample.length).toFloat
for (key <- sample) {
candidates += ((key, weight))
}
}
}
if (imbalancedPartitions.nonEmpty) {
// Re-sample imbalanced partitions with the desired sampling probability.
val imbalanced = new PartitionPruningRDD(rdd.map(_._1), imbalancedPartitions.contains)
val seed = byteswap32(-rdd.id - 1)
val reSampled = imbalanced.sample(withReplacement = false, fraction, seed).collect()
val weight = (1.0 / fraction).toFloat
candidates ++= reSampled.map(x => (x, weight))
}
RangePartitioner.determineBounds(candidates, partitions)
}
}
} def numPartitions: Int = rangeBounds.length + 1 private var binarySearch: ((Array[K], K) => Int) = CollectionsUtils.makeBinarySearch[K] def getPartition(key: Any): Int = {
val k = key.asInstanceOf[K]
var partition = 0
if (rangeBounds.length <= 128) {
// If we have less than 128 partitions naive search
while (partition < rangeBounds.length && ordering.gt(k, rangeBounds(partition))) {
partition += 1
}
} else {
// Determine which binary search method to use only once.
partition = binarySearch(rangeBounds, k)
// binarySearch either returns the match location or -[insertion point]-1
if (partition < 0) {
partition = -partition-1
}
if (partition > rangeBounds.length) {
partition = rangeBounds.length
}
}
if (ascending) {
partition
} else {
rangeBounds.length - partition
}
}

这里会根据partition的数量确定rangeBounds,rangeBounds很像QuickSort中的pivot,

举例来说:集群现在有10个节点,对1亿数据做排序,partition数量是100,最理想的情况是1亿数据平均分成100份,然后每个节点存放10份,然后各自排序就好,没有数据倾斜;
但是这个很难实现,要注意的是这里平分的过程实际上也是划分边界的过程,即确定每份的最小值和最大值边界,需要对全部数据遍历统计之后才能精确实现;

spark中采用的是一种通过对数据采样了解数据分布并最终达到近似精确的方式,具体实现为在从全部数据中采样sampleSize个数据,每个分区采样sampleSizePerPartition个,如果某些分区很大,会追加采样个数,这样保证采样过程尽可能的平均,然后针对采样数据进行探测划分边界,得到rangeBounds,有了rangeBounds之后就可以知道1亿数据中的每一条具体在哪个新的分区;

还有一个问题:在sort之后如果collect到driver,array数据还会保持排序状态吗?

org.apache.spark.rdd.RDD

  /**
* Return an array that contains all of the elements in this RDD.
*
* @note This method should only be used if the resulting array is expected to be small, as
* all the data is loaded into the driver's memory.
*/
def collect(): Array[T] = withScope {
val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
Array.concat(results: _*)
}

答案是肯定的;

【原创】大数据基础之Spark(6)Spark Rdd Sort实现原理的更多相关文章

  1. 【原创】大数据基础之Hadoop(1)HA实现原理

    有些工作只能在一台server上进行,比如master,这时HA(High Availability)首先要求部署多个server,其次要求多个server自动选举出一个active状态server, ...

  2. 大数据学习系列之七 ----- Hadoop+Spark+Zookeeper+HBase+Hive集群搭建 图文详解

    引言 在之前的大数据学习系列中,搭建了Hadoop+Spark+HBase+Hive 环境以及一些测试.其实要说的话,我开始学习大数据的时候,搭建的就是集群,并不是单机模式和伪分布式.至于为什么先写单 ...

  3. CentOS6安装各种大数据软件 第十章:Spark集群安装和部署

    相关文章链接 CentOS6安装各种大数据软件 第一章:各个软件版本介绍 CentOS6安装各种大数据软件 第二章:Linux各个软件启动命令 CentOS6安装各种大数据软件 第三章:Linux基础 ...

  4. 大数据平台搭建(hadoop+spark)

    大数据平台搭建(hadoop+spark) 一.基本信息 1. 服务器基本信息 主机名 ip地址 安装服务 spark-master 172.16.200.81 jdk.hadoop.spark.sc ...

  5. 大数据系列之并行计算引擎Spark部署及应用

    相关博文: 大数据系列之并行计算引擎Spark介绍 之前介绍过关于Spark的程序运行模式有三种: 1.Local模式: 2.standalone(独立模式) 3.Yarn/mesos模式 本文将介绍 ...

  6. 大数据系列之并行计算引擎Spark介绍

    相关博文:大数据系列之并行计算引擎Spark部署及应用 Spark: Apache Spark 是专为大规模数据处理而设计的快速通用的计算引擎. Spark是UC Berkeley AMP lab ( ...

  7. 【原创】大数据基础之Zookeeper(2)源代码解析

    核心枚举 public enum ServerState { LOOKING, FOLLOWING, LEADING, OBSERVING; } zookeeper服务器状态:刚启动LOOKING,f ...

  8. 【原创】大数据基础之Spark(4)RDD原理及代码解析

    一 简介 spark核心是RDD,官方文档地址:https://spark.apache.org/docs/latest/rdd-programming-guide.html#resilient-di ...

  9. 【原创】大数据基础之Spark(1)Spark Submit即Spark任务提交过程

    Spark2.1.1 一 Spark Submit本地解析 1.1 现象 提交命令: spark-submit --master local[10] --driver-memory 30g --cla ...

  10. 【原创】大数据基础之Hive(5)hive on spark

    hive 2.3.4 on spark 2.4.0 Hive on Spark provides Hive with the ability to utilize Apache Spark as it ...

随机推荐

  1. 保护 .NET Core 项目的敏感信息

    我们的项目中几乎都会有配置文件,里面可能会存储一些敏感信息,比如数据库连接字符串.第三方 API 的 AppKey 和 SecretKey 等. 对于开源项目,这些敏感信息肯定不能随着源代码一起提交到 ...

  2. [Alpha阶段]第五次Scrum Meeting

    Scrum Meeting博客目录 [Alpha阶段]第五次Scrum Meeting 基本信息 名称 时间 地点 时长 第五次Scrum Meeting 19/04/09 教1_2楼教室 65min ...

  3. mysql监控每一条执行的sql语句

    查看日志配置是否打开 SHOW VARIABLES LIKE "general_log%"; SET GLOBAL general_log = 'ON';   打开日志 SET G ...

  4. android调用plus报错plus is not defined

    由于plus 加载需要时间,我们使用plus 的时候应该提前判断是否已经加载好,添加监听 document.addEventListener("plusready",functio ...

  5. 时间插件--daterangepicker使用和配置详解

    1.序言: daterangepicker是Bootstrap的一个时间组件,使用很方便 用于选择日期范围的JavaScript组件. 设计用于Bootstrap CSS框架. 它最初是为了改善报表而 ...

  6. Python学习之路——函数的参数分类

    今日内容 '''实参:调用函数,在括号内传入的实际值,值可以为常量.变量.表达式或三者的组合​*****形参:定义函数,在括号内声明的变量名,用来接受外界传来的值​'''​'''注:形参随着函数的调用 ...

  7. ORM之自关联、add、set方法、聚合函数、F、Q查询和事务

    一.外键自关联(一对多) 1.建表 # 评论表 class Comment(models.Model): id = models.AutoField(primary_key=True) content ...

  8. 【XSY2962】作业 数学

    题目描述 有一个递推式: \[ \begin{align} f_0&=1-\frac{1}{e}\\ f_n&=1-nf_{i-1} \end{align} \] 求 \(f_n\) ...

  9. [CTSC2008]网络管理 [整体二分]

    题面 bzoj luogu 所有事件按时间排序 按值划分下放 把每一个修改 改成一个删除一个插入 对于一个查询 直接查这个段区间有多少合法点 如果查询值大于等于目标值 进入左区间 如果一个查询无解 那 ...

  10. linux下sort命令详解大全

    工作原理: Sort将文件的每一行作为一个单位,相互比较,比较原则是从首字符向后,依次按ASCII码值进行比较,最后将他们按升序输出. 第一部分: 1. sort:(不带参数) [rocrocket@ ...