【原创】大数据基础之Spark（6）Spark Rdd Sort实现原理

spark 2.1.1

spark中可以通过RDD.sortBy来对分布式数据进行排序，具体是如何实现的？来看代码：

org.apache.spark.rdd.RDD

  /**

   * Return this RDD sorted by the given key function.

   */

  def sortBy[K](

      f: (T) => K,

      ascending: Boolean = true,

      numPartitions: Int = this.partitions.length)

      (implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T] = withScope {

    this.keyBy[K](f)

        .sortByKey(ascending, numPartitions)

        .values

  }

  /**

   * Creates tuples of the elements in this RDD by applying `f`.

   */

  def keyBy[K](f: T => K): RDD[(K, T)] = withScope {

    val cleanedF = sc.clean(f)

    map(x => (cleanedF(x), x))

  }

  /**

   * Sort the RDD by key, so that each partition contains a sorted range of the elements. Calling

   * `collect` or `save` on the resulting RDD will return or output an ordered list of records

   * (in the `save` case, they will be written to multiple `part-X` files in the filesystem, in

   * order of the keys).

   */

  // TODO: this currently doesn't work on P other than Tuple2!

  def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length)

      : RDD[(K, V)] = self.withScope

  {

    val part = new RangePartitioner(numPartitions, self, ascending)

    new ShuffledRDD[K, V, V](self, part)

      .setKeyOrdering(if (ascending) ordering else ordering.reverse)

  }

代码比较简单：sort是一个transformation操作，需要定义一个keyBy，即根据什么排序，然后会做一步map，即 item -> (keyBy(item), item)，然后定义一个Partitioner，即分区策略（多少个分区，升序降序等），最后返回一个ShuffledRDD；

ShuffledRDD原理详见 https://www.cnblogs.com/barneywill/p/10158457.html

这里重点说下RangePartitioner：

org.apache.spark.RangePartitioner

/**

 * A [[org.apache.spark.Partitioner]] that partitions sortable records by range into roughly

 * equal ranges. The ranges are determined by sampling the content of the RDD passed in.

 *

 * @note The actual number of partitions created by the RangePartitioner might not be the same

 * as the `partitions` parameter, in the case where the number of sampled records is less than

 * the value of `partitions`.

 */

class RangePartitioner[K : Ordering : ClassTag, V](

    partitions: Int,

    rdd: RDD[_ <: Product2[K, V]],

    private var ascending: Boolean = true)

  extends Partitioner {

  // We allow partitions = 0, which happens when sorting an empty RDD under the default settings.

  require(partitions >= 0, s"Number of partitions cannot be negative but found $partitions.")

  private var ordering = implicitly[Ordering[K]]

  // An array of upper bounds for the first (partitions - 1) partitions

  private var rangeBounds: Array[K] = {

    if (partitions <= 1) {

      Array.empty

    } else {

      // This is the sample size we need to have roughly balanced output partitions, capped at 1M.

      val sampleSize = math.min(20.0 * partitions, 1e6)

      // Assume the input partitions are roughly balanced and over-sample a little bit.

      val sampleSizePerPartition = math.ceil(3.0 * sampleSize / rdd.partitions.length).toInt

      val (numItems, sketched) = RangePartitioner.sketch(rdd.map(_._1), sampleSizePerPartition)

      if (numItems == 0L) {

        Array.empty

      } else {

        // If a partition contains much more than the average number of items, we re-sample from it

        // to ensure that enough items are collected from that partition.

        val fraction = math.min(sampleSize / math.max(numItems, 1L), 1.0)

        val candidates = ArrayBuffer.empty[(K, Float)]

        val imbalancedPartitions = mutable.Set.empty[Int]

        sketched.foreach { case (idx, n, sample) =>

          if (fraction * n > sampleSizePerPartition) {

            imbalancedPartitions += idx

          } else {

            // The weight is 1 over the sampling probability.

            val weight = (n.toDouble / sample.length).toFloat

            for (key <- sample) {

              candidates += ((key, weight))

            }

          }

        }

        if (imbalancedPartitions.nonEmpty) {

          // Re-sample imbalanced partitions with the desired sampling probability.

          val imbalanced = new PartitionPruningRDD(rdd.map(_._1), imbalancedPartitions.contains)

          val seed = byteswap32(-rdd.id - 1)

          val reSampled = imbalanced.sample(withReplacement = false, fraction, seed).collect()

          val weight = (1.0 / fraction).toFloat

          candidates ++= reSampled.map(x => (x, weight))

        }

        RangePartitioner.determineBounds(candidates, partitions)

      }

    }

  }

  def numPartitions: Int = rangeBounds.length + 1

  private var binarySearch: ((Array[K], K) => Int) = CollectionsUtils.makeBinarySearch[K]

  def getPartition(key: Any): Int = {

    val k = key.asInstanceOf[K]

    var partition = 0

    if (rangeBounds.length <= 128) {

      // If we have less than 128 partitions naive search

      while (partition < rangeBounds.length && ordering.gt(k, rangeBounds(partition))) {

        partition += 1

      }

    } else {

      // Determine which binary search method to use only once.

      partition = binarySearch(rangeBounds, k)

      // binarySearch either returns the match location or -[insertion point]-1

      if (partition < 0) {

        partition = -partition-1

      }

      if (partition > rangeBounds.length) {

        partition = rangeBounds.length

      }

    }

    if (ascending) {

      partition

    } else {

      rangeBounds.length - partition

    }

  }

这里会根据partition的数量确定rangeBounds，rangeBounds很像QuickSort中的pivot，

举例来说：集群现在有10个节点，对1亿数据做排序，partition数量是100，最理想的情况是1亿数据平均分成100份，然后每个节点存放10份，然后各自排序就好，没有数据倾斜；
但是这个很难实现，要注意的是这里平分的过程实际上也是划分边界的过程，即确定每份的最小值和最大值边界，需要对全部数据遍历统计之后才能精确实现；

spark中采用的是一种通过对数据采样了解数据分布并最终达到近似精确的方式，具体实现为在从全部数据中采样sampleSize个数据，每个分区采样sampleSizePerPartition个，如果某些分区很大，会追加采样个数，这样保证采样过程尽可能的平均，然后针对采样数据进行探测划分边界，得到rangeBounds，有了rangeBounds之后就可以知道1亿数据中的每一条具体在哪个新的分区；

还有一个问题：在sort之后如果collect到driver，array数据还会保持排序状态吗？

org.apache.spark.rdd.RDD

  /**

   * Return an array that contains all of the elements in this RDD.

   *

   * @note This method should only be used if the resulting array is expected to be small, as

   * all the data is loaded into the driver's memory.

   */

  def collect(): Array[T] = withScope {

    val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)

    Array.concat(results: _*)

  }

答案是肯定的；

【原创】大数据基础之Spark（6）Spark Rdd Sort实现原理的更多相关文章

【原创】大数据基础之Hadoop（1）HA实现原理
有些工作只能在一台server上进行,比如master,这时HA(High Availability)首先要求部署多个server,其次要求多个server自动选举出一个active状态server, ...
大数据学习系列之七 ----- Hadoop+Spark+Zookeeper+HBase+Hive集群搭建图文详解
引言在之前的大数据学习系列中,搭建了Hadoop+Spark+HBase+Hive 环境以及一些测试.其实要说的话,我开始学习大数据的时候,搭建的就是集群,并不是单机模式和伪分布式.至于为什么先写单 ...
CentOS6安装各种大数据软件第十章：Spark集群安装和部署
相关文章链接 CentOS6安装各种大数据软件第一章:各个软件版本介绍 CentOS6安装各种大数据软件第二章:Linux各个软件启动命令 CentOS6安装各种大数据软件第三章:Linux基础 ...
大数据平台搭建（hadoop+spark）
大数据平台搭建(hadoop+spark) 一.基本信息 1. 服务器基本信息主机名 ip地址安装服务 spark-master 172.16.200.81 jdk.hadoop.spark.sc ...
大数据系列之并行计算引擎Spark部署及应用
相关博文: 大数据系列之并行计算引擎Spark介绍之前介绍过关于Spark的程序运行模式有三种: 1.Local模式: 2.standalone(独立模式) 3.Yarn/mesos模式本文将介绍 ...
大数据系列之并行计算引擎Spark介绍
相关博文:大数据系列之并行计算引擎Spark部署及应用 Spark: Apache Spark 是专为大规模数据处理而设计的快速通用的计算引擎. Spark是UC Berkeley AMP lab ( ...
【原创】大数据基础之Zookeeper（2）源代码解析
核心枚举 public enum ServerState { LOOKING, FOLLOWING, LEADING, OBSERVING; } zookeeper服务器状态:刚启动LOOKING,f ...
【原创】大数据基础之Spark（4）RDD原理及代码解析
一简介 spark核心是RDD,官方文档地址:https://spark.apache.org/docs/latest/rdd-programming-guide.html#resilient-di ...
【原创】大数据基础之Spark（1）Spark Submit即Spark任务提交过程
Spark2.1.1 一 Spark Submit本地解析 1.1 现象提交命令: spark-submit --master local[10] --driver-memory 30g --cla ...
【原创】大数据基础之Hive（5）hive on spark
hive 2.3.4 on spark 2.4.0 Hive on Spark provides Hive with the ability to utilize Apache Spark as it ...

随机推荐

Django 之 admin管理工具
-------------------------------------------------------------------------妄尝恶果,苦果自来. admin组件使用 Django ...
Practical Mathematical Handwriting
In this article, I discuss the handwriting of $\mathbb{A}, \mathcal{A}, \mathscr{A}, \mathfrak{A}$'s ...
js模块化世界
前言我们经常见到一些这样的写法,require('xxx').import xx from '../components/data'.export const data....也听见一些这样的说法 ...
sql 日常使用记录
sql 某个字段在哪些表中存在: select sysobjects.name from syscolumns inner join sysobjects on syscolumns.id = sys ...
通过注解配置Bean
之前说的三种配置方式,都是使用XML配置,现在我们说说使用注解配置Bean. 这部分内容主要分为两个部分:使用注解配置Bean,使用注解配置Bean属性. 在classpath中扫描组件组件扫描:S ...
Unable to preventDefault inside passive event listener
最近做项目经常在 chrome 的控制台看到如下提示: Unable to preventDefault inside passive event listener due to target bei ...
三、数据API-3
预备返回格式需要包括: // Code 状态码(200,400等) // Msg 提示信息(邮箱格式不正确:数据返回成功等) // Result 返回数据一.WebAPI与传统MVC的区别是 MV ...
Mysql——Navicat 连接MySQL 8.0.11 出现2059错误
原因 mysql8 之前的版本中加密规则是mysql_native_password,而在mysql8之后,加密规则是caching_sha2_password 解决更改加密规则: mysql -u ...
Axis2创建WebService服务端接口+SoupUI以及Client端demo测试调用
第一步:引入axis2相关jar包,如果是pom项目,直接在pom文件中引入依赖就好 <dependency> <groupId>org.apache.axis2</gr ...
【UOJ386】【UNR #3】鸽子固定器链表
题目描述有 $n$ 个物品,每个物品有两个属性:权值 $v$ 和大小 $s$. 你要选出 $m$ 个物品,使得你选出的物品的权值的和的 $d_v$ 次方减掉大小的极差的 \(d_ ...

【原创】大数据基础之Spark（6）Spark Rdd Sort实现原理

【原创】大数据基础之Spark（6）Spark Rdd Sort实现原理的更多相关文章

随机推荐

热门专题