spark 2.1.1

spark中可以通过RDD.sortBy来对分布式数据进行排序,具体是如何实现的?来看代码:

org.apache.spark.rdd.RDD

  /**
* Return this RDD sorted by the given key function.
*/
def sortBy[K](
f: (T) => K,
ascending: Boolean = true,
numPartitions: Int = this.partitions.length)
(implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T] = withScope {
this.keyBy[K](f)
.sortByKey(ascending, numPartitions)
.values
} /**
* Creates tuples of the elements in this RDD by applying `f`.
*/
def keyBy[K](f: T => K): RDD[(K, T)] = withScope {
val cleanedF = sc.clean(f)
map(x => (cleanedF(x), x))
} /**
* Sort the RDD by key, so that each partition contains a sorted range of the elements. Calling
* `collect` or `save` on the resulting RDD will return or output an ordered list of records
* (in the `save` case, they will be written to multiple `part-X` files in the filesystem, in
* order of the keys).
*/
// TODO: this currently doesn't work on P other than Tuple2!
def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length)
: RDD[(K, V)] = self.withScope
{
val part = new RangePartitioner(numPartitions, self, ascending)
new ShuffledRDD[K, V, V](self, part)
.setKeyOrdering(if (ascending) ordering else ordering.reverse)
}

代码比较简单:sort是一个transformation操作,需要定义一个keyBy,即根据什么排序,然后会做一步map,即 item -> (keyBy(item), item),然后定义一个Partitioner,即分区策略(多少个分区,升序降序等),最后返回一个ShuffledRDD;

ShuffledRDD原理详见 https://www.cnblogs.com/barneywill/p/10158457.html

这里重点说下RangePartitioner:

org.apache.spark.RangePartitioner

/**
* A [[org.apache.spark.Partitioner]] that partitions sortable records by range into roughly
* equal ranges. The ranges are determined by sampling the content of the RDD passed in.
*
* @note The actual number of partitions created by the RangePartitioner might not be the same
* as the `partitions` parameter, in the case where the number of sampled records is less than
* the value of `partitions`.
*/
class RangePartitioner[K : Ordering : ClassTag, V](
partitions: Int,
rdd: RDD[_ <: Product2[K, V]],
private var ascending: Boolean = true)
extends Partitioner { // We allow partitions = 0, which happens when sorting an empty RDD under the default settings.
require(partitions >= 0, s"Number of partitions cannot be negative but found $partitions.") private var ordering = implicitly[Ordering[K]] // An array of upper bounds for the first (partitions - 1) partitions
private var rangeBounds: Array[K] = {
if (partitions <= 1) {
Array.empty
} else {
// This is the sample size we need to have roughly balanced output partitions, capped at 1M.
val sampleSize = math.min(20.0 * partitions, 1e6)
// Assume the input partitions are roughly balanced and over-sample a little bit.
val sampleSizePerPartition = math.ceil(3.0 * sampleSize / rdd.partitions.length).toInt
val (numItems, sketched) = RangePartitioner.sketch(rdd.map(_._1), sampleSizePerPartition)
if (numItems == 0L) {
Array.empty
} else {
// If a partition contains much more than the average number of items, we re-sample from it
// to ensure that enough items are collected from that partition.
val fraction = math.min(sampleSize / math.max(numItems, 1L), 1.0)
val candidates = ArrayBuffer.empty[(K, Float)]
val imbalancedPartitions = mutable.Set.empty[Int]
sketched.foreach { case (idx, n, sample) =>
if (fraction * n > sampleSizePerPartition) {
imbalancedPartitions += idx
} else {
// The weight is 1 over the sampling probability.
val weight = (n.toDouble / sample.length).toFloat
for (key <- sample) {
candidates += ((key, weight))
}
}
}
if (imbalancedPartitions.nonEmpty) {
// Re-sample imbalanced partitions with the desired sampling probability.
val imbalanced = new PartitionPruningRDD(rdd.map(_._1), imbalancedPartitions.contains)
val seed = byteswap32(-rdd.id - 1)
val reSampled = imbalanced.sample(withReplacement = false, fraction, seed).collect()
val weight = (1.0 / fraction).toFloat
candidates ++= reSampled.map(x => (x, weight))
}
RangePartitioner.determineBounds(candidates, partitions)
}
}
} def numPartitions: Int = rangeBounds.length + 1 private var binarySearch: ((Array[K], K) => Int) = CollectionsUtils.makeBinarySearch[K] def getPartition(key: Any): Int = {
val k = key.asInstanceOf[K]
var partition = 0
if (rangeBounds.length <= 128) {
// If we have less than 128 partitions naive search
while (partition < rangeBounds.length && ordering.gt(k, rangeBounds(partition))) {
partition += 1
}
} else {
// Determine which binary search method to use only once.
partition = binarySearch(rangeBounds, k)
// binarySearch either returns the match location or -[insertion point]-1
if (partition < 0) {
partition = -partition-1
}
if (partition > rangeBounds.length) {
partition = rangeBounds.length
}
}
if (ascending) {
partition
} else {
rangeBounds.length - partition
}
}

这里会根据partition的数量确定rangeBounds,rangeBounds很像QuickSort中的pivot,

举例来说:集群现在有10个节点,对1亿数据做排序,partition数量是100,最理想的情况是1亿数据平均分成100份,然后每个节点存放10份,然后各自排序就好,没有数据倾斜;
但是这个很难实现,要注意的是这里平分的过程实际上也是划分边界的过程,即确定每份的最小值和最大值边界,需要对全部数据遍历统计之后才能精确实现;

spark中采用的是一种通过对数据采样了解数据分布并最终达到近似精确的方式,具体实现为在从全部数据中采样sampleSize个数据,每个分区采样sampleSizePerPartition个,如果某些分区很大,会追加采样个数,这样保证采样过程尽可能的平均,然后针对采样数据进行探测划分边界,得到rangeBounds,有了rangeBounds之后就可以知道1亿数据中的每一条具体在哪个新的分区;

还有一个问题:在sort之后如果collect到driver,array数据还会保持排序状态吗?

org.apache.spark.rdd.RDD

  /**
* Return an array that contains all of the elements in this RDD.
*
* @note This method should only be used if the resulting array is expected to be small, as
* all the data is loaded into the driver's memory.
*/
def collect(): Array[T] = withScope {
val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
Array.concat(results: _*)
}

答案是肯定的;

【原创】大数据基础之Spark(6)Spark Rdd Sort实现原理的更多相关文章

  1. 【原创】大数据基础之Hadoop(1)HA实现原理

    有些工作只能在一台server上进行,比如master,这时HA(High Availability)首先要求部署多个server,其次要求多个server自动选举出一个active状态server, ...

  2. 大数据学习系列之七 ----- Hadoop+Spark+Zookeeper+HBase+Hive集群搭建 图文详解

    引言 在之前的大数据学习系列中,搭建了Hadoop+Spark+HBase+Hive 环境以及一些测试.其实要说的话,我开始学习大数据的时候,搭建的就是集群,并不是单机模式和伪分布式.至于为什么先写单 ...

  3. CentOS6安装各种大数据软件 第十章:Spark集群安装和部署

    相关文章链接 CentOS6安装各种大数据软件 第一章:各个软件版本介绍 CentOS6安装各种大数据软件 第二章:Linux各个软件启动命令 CentOS6安装各种大数据软件 第三章:Linux基础 ...

  4. 大数据平台搭建(hadoop+spark)

    大数据平台搭建(hadoop+spark) 一.基本信息 1. 服务器基本信息 主机名 ip地址 安装服务 spark-master 172.16.200.81 jdk.hadoop.spark.sc ...

  5. 大数据系列之并行计算引擎Spark部署及应用

    相关博文: 大数据系列之并行计算引擎Spark介绍 之前介绍过关于Spark的程序运行模式有三种: 1.Local模式: 2.standalone(独立模式) 3.Yarn/mesos模式 本文将介绍 ...

  6. 大数据系列之并行计算引擎Spark介绍

    相关博文:大数据系列之并行计算引擎Spark部署及应用 Spark: Apache Spark 是专为大规模数据处理而设计的快速通用的计算引擎. Spark是UC Berkeley AMP lab ( ...

  7. 【原创】大数据基础之Zookeeper(2)源代码解析

    核心枚举 public enum ServerState { LOOKING, FOLLOWING, LEADING, OBSERVING; } zookeeper服务器状态:刚启动LOOKING,f ...

  8. 【原创】大数据基础之Spark(4)RDD原理及代码解析

    一 简介 spark核心是RDD,官方文档地址:https://spark.apache.org/docs/latest/rdd-programming-guide.html#resilient-di ...

  9. 【原创】大数据基础之Spark(1)Spark Submit即Spark任务提交过程

    Spark2.1.1 一 Spark Submit本地解析 1.1 现象 提交命令: spark-submit --master local[10] --driver-memory 30g --cla ...

  10. 【原创】大数据基础之Hive(5)hive on spark

    hive 2.3.4 on spark 2.4.0 Hive on Spark provides Hive with the ability to utilize Apache Spark as it ...

随机推荐

  1. 用bisect维护一个排序的序列

    import bisect list1 = [] bisect.insort(list1, 5) bisect.insort(list1, 1) bisect.insort(list1, 3) bis ...

  2. Java JPA @Transient 在Hibernate中应用

    jpa @Transient - 走过程序员的路 - CSDN博客https://blog.csdn.net/lafengwnagzi/article/details/55511066 Hiberna ...

  3. android_模拟器调试

    找到adb_server adb_server connect

  4. php函数 array_change_key_cash

    array_change_key_case ( array $array [, int $case = CASE_LOWER ] ) : array array_change_key_case() 将 ...

  5. sql-josn

    1 select fname,fdistrict ,famount from sale for json auto---最简单方式[{"name":"name1" ...

  6. Python——爬虫——爬虫的原理与数据抓取

    一.使用Fiddler抓取HTTPS设置 (1)菜单栏 Tools > Telerik Fiddler Options 打开“Fiddler Options”对话框 (2)HTTPS设置:选中C ...

  7. Cookie笔记

    1.Cookie HTTP Cookie(也叫Web Cookie或浏览器Cookie)是服务器发送到用户浏览器并保存在浏览器的一小块数据,它会在浏览器下次向同一服务器再发起请求时被携带并发送到服务器 ...

  8. LOJ #2731. 「JOISC 2016 Day 1」棋盘游戏(dp)

    题意 JOI 君有一个棋盘,棋盘上有 \(N\) 行 \(3\) 列 的格子.JOI 君有若干棋子,并想用它们来玩一个游戏.初始状态棋盘上至少有一个棋子,也至少有一个空位. 游戏的目标是:在还没有放棋 ...

  9. 【mysql】mysql存储引擎

    了解存储引擎我们先看下mysql的体系架构. 上图是mysql的逻辑架构图,可以看到分了几层. 第一层是大部分网路客户端工具,比如php,python  ,JDBC等,主要功能就是连接处理,授权认证等 ...

  10. POJ--3974 Palindrome(回文串,hash)

    链接:点击这里 #include<iostream> #include<algorithm> #include<stdio.h> #include<cstri ...