【原创】大数据基础之Spark（6）Spark Rdd Sort实现原理

spark 2.1.1

spark中可以通过RDD.sortBy来对分布式数据进行排序，具体是如何实现的？来看代码：

org.apache.spark.rdd.RDD

  /**

   * Return this RDD sorted by the given key function.

   */

  def sortBy[K](

      f: (T) => K,

      ascending: Boolean = true,

      numPartitions: Int = this.partitions.length)

      (implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T] = withScope {

    this.keyBy[K](f)

        .sortByKey(ascending, numPartitions)

        .values

  }

  /**

   * Creates tuples of the elements in this RDD by applying `f`.

   */

  def keyBy[K](f: T => K): RDD[(K, T)] = withScope {

    val cleanedF = sc.clean(f)

    map(x => (cleanedF(x), x))

  }

  /**

   * Sort the RDD by key, so that each partition contains a sorted range of the elements. Calling

   * `collect` or `save` on the resulting RDD will return or output an ordered list of records

   * (in the `save` case, they will be written to multiple `part-X` files in the filesystem, in

   * order of the keys).

   */

  // TODO: this currently doesn't work on P other than Tuple2!

  def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length)

      : RDD[(K, V)] = self.withScope

  {

    val part = new RangePartitioner(numPartitions, self, ascending)

    new ShuffledRDD[K, V, V](self, part)

      .setKeyOrdering(if (ascending) ordering else ordering.reverse)

  }

代码比较简单：sort是一个transformation操作，需要定义一个keyBy，即根据什么排序，然后会做一步map，即 item -> (keyBy(item), item)，然后定义一个Partitioner，即分区策略（多少个分区，升序降序等），最后返回一个ShuffledRDD；

ShuffledRDD原理详见 https://www.cnblogs.com/barneywill/p/10158457.html

这里重点说下RangePartitioner：

org.apache.spark.RangePartitioner

/**

 * A [[org.apache.spark.Partitioner]] that partitions sortable records by range into roughly

 * equal ranges. The ranges are determined by sampling the content of the RDD passed in.

 *

 * @note The actual number of partitions created by the RangePartitioner might not be the same

 * as the `partitions` parameter, in the case where the number of sampled records is less than

 * the value of `partitions`.

 */

class RangePartitioner[K : Ordering : ClassTag, V](

    partitions: Int,

    rdd: RDD[_ <: Product2[K, V]],

    private var ascending: Boolean = true)

  extends Partitioner {

  // We allow partitions = 0, which happens when sorting an empty RDD under the default settings.

  require(partitions >= 0, s"Number of partitions cannot be negative but found $partitions.")

  private var ordering = implicitly[Ordering[K]]

  // An array of upper bounds for the first (partitions - 1) partitions

  private var rangeBounds: Array[K] = {

    if (partitions <= 1) {

      Array.empty

    } else {

      // This is the sample size we need to have roughly balanced output partitions, capped at 1M.

      val sampleSize = math.min(20.0 * partitions, 1e6)

      // Assume the input partitions are roughly balanced and over-sample a little bit.

      val sampleSizePerPartition = math.ceil(3.0 * sampleSize / rdd.partitions.length).toInt

      val (numItems, sketched) = RangePartitioner.sketch(rdd.map(_._1), sampleSizePerPartition)

      if (numItems == 0L) {

        Array.empty

      } else {

        // If a partition contains much more than the average number of items, we re-sample from it

        // to ensure that enough items are collected from that partition.

        val fraction = math.min(sampleSize / math.max(numItems, 1L), 1.0)

        val candidates = ArrayBuffer.empty[(K, Float)]

        val imbalancedPartitions = mutable.Set.empty[Int]

        sketched.foreach { case (idx, n, sample) =>

          if (fraction * n > sampleSizePerPartition) {

            imbalancedPartitions += idx

          } else {

            // The weight is 1 over the sampling probability.

            val weight = (n.toDouble / sample.length).toFloat

            for (key <- sample) {

              candidates += ((key, weight))

            }

          }

        }

        if (imbalancedPartitions.nonEmpty) {

          // Re-sample imbalanced partitions with the desired sampling probability.

          val imbalanced = new PartitionPruningRDD(rdd.map(_._1), imbalancedPartitions.contains)

          val seed = byteswap32(-rdd.id - 1)

          val reSampled = imbalanced.sample(withReplacement = false, fraction, seed).collect()

          val weight = (1.0 / fraction).toFloat

          candidates ++= reSampled.map(x => (x, weight))

        }

        RangePartitioner.determineBounds(candidates, partitions)

      }

    }

  }

  def numPartitions: Int = rangeBounds.length + 1

  private var binarySearch: ((Array[K], K) => Int) = CollectionsUtils.makeBinarySearch[K]

  def getPartition(key: Any): Int = {

    val k = key.asInstanceOf[K]

    var partition = 0

    if (rangeBounds.length <= 128) {

      // If we have less than 128 partitions naive search

      while (partition < rangeBounds.length && ordering.gt(k, rangeBounds(partition))) {

        partition += 1

      }

    } else {

      // Determine which binary search method to use only once.

      partition = binarySearch(rangeBounds, k)

      // binarySearch either returns the match location or -[insertion point]-1

      if (partition < 0) {

        partition = -partition-1

      }

      if (partition > rangeBounds.length) {

        partition = rangeBounds.length

      }

    }

    if (ascending) {

      partition

    } else {

      rangeBounds.length - partition

    }

  }

这里会根据partition的数量确定rangeBounds，rangeBounds很像QuickSort中的pivot，

举例来说：集群现在有10个节点，对1亿数据做排序，partition数量是100，最理想的情况是1亿数据平均分成100份，然后每个节点存放10份，然后各自排序就好，没有数据倾斜；
但是这个很难实现，要注意的是这里平分的过程实际上也是划分边界的过程，即确定每份的最小值和最大值边界，需要对全部数据遍历统计之后才能精确实现；

spark中采用的是一种通过对数据采样了解数据分布并最终达到近似精确的方式，具体实现为在从全部数据中采样sampleSize个数据，每个分区采样sampleSizePerPartition个，如果某些分区很大，会追加采样个数，这样保证采样过程尽可能的平均，然后针对采样数据进行探测划分边界，得到rangeBounds，有了rangeBounds之后就可以知道1亿数据中的每一条具体在哪个新的分区；

还有一个问题：在sort之后如果collect到driver，array数据还会保持排序状态吗？

org.apache.spark.rdd.RDD

  /**

   * Return an array that contains all of the elements in this RDD.

   *

   * @note This method should only be used if the resulting array is expected to be small, as

   * all the data is loaded into the driver's memory.

   */

  def collect(): Array[T] = withScope {

    val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)

    Array.concat(results: _*)

  }

答案是肯定的；

【原创】大数据基础之Spark（6）Spark Rdd Sort实现原理的更多相关文章

【原创】大数据基础之Hadoop（1）HA实现原理
有些工作只能在一台server上进行,比如master,这时HA(High Availability)首先要求部署多个server,其次要求多个server自动选举出一个active状态server, ...
大数据学习系列之七 ----- Hadoop+Spark+Zookeeper+HBase+Hive集群搭建图文详解
引言在之前的大数据学习系列中,搭建了Hadoop+Spark+HBase+Hive 环境以及一些测试.其实要说的话,我开始学习大数据的时候,搭建的就是集群,并不是单机模式和伪分布式.至于为什么先写单 ...
CentOS6安装各种大数据软件第十章：Spark集群安装和部署
相关文章链接 CentOS6安装各种大数据软件第一章:各个软件版本介绍 CentOS6安装各种大数据软件第二章:Linux各个软件启动命令 CentOS6安装各种大数据软件第三章:Linux基础 ...
大数据平台搭建（hadoop+spark）
大数据平台搭建(hadoop+spark) 一.基本信息 1. 服务器基本信息主机名 ip地址安装服务 spark-master 172.16.200.81 jdk.hadoop.spark.sc ...
大数据系列之并行计算引擎Spark部署及应用
相关博文: 大数据系列之并行计算引擎Spark介绍之前介绍过关于Spark的程序运行模式有三种: 1.Local模式: 2.standalone(独立模式) 3.Yarn/mesos模式本文将介绍 ...
大数据系列之并行计算引擎Spark介绍
相关博文:大数据系列之并行计算引擎Spark部署及应用 Spark: Apache Spark 是专为大规模数据处理而设计的快速通用的计算引擎. Spark是UC Berkeley AMP lab ( ...
【原创】大数据基础之Zookeeper（2）源代码解析
核心枚举 public enum ServerState { LOOKING, FOLLOWING, LEADING, OBSERVING; } zookeeper服务器状态:刚启动LOOKING,f ...
【原创】大数据基础之Spark（4）RDD原理及代码解析
一简介 spark核心是RDD,官方文档地址:https://spark.apache.org/docs/latest/rdd-programming-guide.html#resilient-di ...
【原创】大数据基础之Spark（1）Spark Submit即Spark任务提交过程
Spark2.1.1 一 Spark Submit本地解析 1.1 现象提交命令: spark-submit --master local[10] --driver-memory 30g --cla ...
【原创】大数据基础之Hive（5）hive on spark
hive 2.3.4 on spark 2.4.0 Hive on Spark provides Hive with the ability to utilize Apache Spark as it ...

随机推荐

Django之 Form和ModelForm组件
01-Form介绍我们之前在HTML页面中利用form表单向后端提交数据时,都会写一些获取用户输入的标签并且用form标签把它们包起来. 与此同时我们在好多场景下都需要对用户的输入做校验,比如校验用 ...
HTML5新增特性
1. 语义化标签 2. 增强型表单 (1)新的表单输入类型 (2)新表单元素 (3)新表单属性 3. 视频和音频 4. Canvas绘图(图形.路径.文本.渐变.图像) 5. SVG绘图 (与Canv ...
golang lua使用示例
package main import ( "fmt" "github.com/yuin/gopher-lua" ) func hello(L *lua.LSt ...
svnsync同步svn
使用svnsync实现已有版本库的镜像svn不支持分布式开发,所以把svn版本库保存在一台服务器上是不安全的.制作一个镜像svn版本库有多种方式,我采用subversion自带的svnsync程序. ...
《Effective C++》模板与泛型编程：条款32-条款40
条款41:了解隐式接口和编译期多态 class支持显示接口和运行期多态 class的显示接口由函数的名签式构成(函数名称.参数类型.返回类型) class的多态通过virtual函数发生在运行期 te ...
AngularJS 1.x系列：AngularJS服务-Service、Factory、Provider、Value及Constant（5）
1. AngularJS服务 AngularJS可注入类型包括:Service.Factory.Provider.Value及Constant. 2. Service AngularJS Servic ...
java9最新发布
链接:http://pan.baidu.com/s/1slbRFa9 密码:hcdj 给大家分享可以去下载已接受的特性 1. Jigsaw 项目:模块化JDK源码 Jigsaw项目即JEP201是为 ...
P1033 自由落体
原题链接 https://www.luogu.org/problemnew/show/P1033 不得不说,这个题太坑了!!!主要是题目说得不明确...... 先来看图: 看一下用红圈圈起来的部分,就 ...
20165223 《信息安全系统设计基础》实现mypwd
一.学习pwd命令 1. pwd命令简介英文原名:Print Working Directory 指令功能:打印出当前工作目录执行权限:All User 指令所在路径:/usr/bin/pwd 或 ...
Windows编写的shell脚本，在linux上无法执行
前两天由于要查一个数据库的binlog日志,经常用命令写比较麻烦,想着写一个简单的脚本,自动去刷一下数据库的binlog日志,就直接在windows上面写了,然后拷贝到linux中去运行,其实很简单的 ...

【原创】大数据基础之Spark（6）Spark Rdd Sort实现原理

【原创】大数据基础之Spark（6）Spark Rdd Sort实现原理的更多相关文章

随机推荐

热门专题