RangePartitioner 实现简记

摘要：

　　1.背景

　　2.rangeBounds 上边界数组源码走读

　　3.RangePartitioner的sketch 源码走读

　　4.determineBounds 源码走读

　　5.关于RangePartitioner和sortByKey实验

内容：　　

1.背景：这是一个填之前Spark RDD 核心总结这篇博文中RangePartitioner留下的坑，没想到又发现一个坑（XORShiftRandom：生成随机数的一个算法，有时间再来总结）

RangePartitioner 是Spark Partitioner 中的一种分区方式，在排序算子（sortByKey）中使用；相比HashPartitioner，RangePartitioner分区会尽量保证每个分区中数据量的均匀

2.rangeBounds 上边界数组源码走读

rangeBounds是一个Array,保存着每个分区的上界（upper bounds）值；

一般是过采样抽样大小的3倍来保证采样样本是基本平衡的；

然后调用sketch(rdd.map(_._1), sampleSizePerPartition) 方法进行抽样，下文会详细说明；

如果一个分区抽样的样本数比平均抽样的样本数还多，会调用rdd.sample再次对不平衡样本进行采样。

之后调用determineBounds(candidates, partitions)来返回分区对用的rangeBounds，下文也会详细介绍这个方法

// An array of upper bounds for the first (partitions - 1) partitions

  private var rangeBounds: Array[K] = {

    if (partitions <= 1) {

      Array.empty

    } else {

      // This is the sample size we need to have roughly balanced output partitions, capped at 1M.

      val sampleSize = math.min(20.0 * partitions, 1e6)

      // Assume the input partitions are roughly balanced and over-sample a little bit.

      val sampleSizePerPartition = math.ceil(3.0 * sampleSize / rdd.partitions.length).toInt

      val (numItems, sketched) = RangePartitioner.sketch(rdd.map(_._1), sampleSizePerPartition)

      if (numItems == 0L) {

        Array.empty

      } else {

        // If a partition contains much more than the average number of items, we re-sample from it

        // to ensure that enough items are collected from that partition.

        val fraction = math.min(sampleSize / math.max(numItems, 1L), 1.0)

        val candidates = ArrayBuffer.empty[(K, Float)]

        val imbalancedPartitions = mutable.Set.empty[Int]

        sketched.foreach { case (idx, n, sample) =>

          if (fraction * n > sampleSizePerPartition) {

            imbalancedPartitions += idx

          } else {

            // The weight is 1 over the sampling probability.

            val weight = (n.toDouble / sample.length).toFloat

            for (key <- sample) {

              candidates += ((key, weight))

            }

          }

        }

        if (imbalancedPartitions.nonEmpty) {

          // Re-sample imbalanced partitions with the desired sampling probability.

          val imbalanced = new PartitionPruningRDD(rdd.map(_._1), imbalancedPartitions.contains)

          val seed = byteswap32(-rdd.id - 1)

          val reSampled = imbalanced.sample(withReplacement = false, fraction, seed).collect()

          val weight = (1.0 / fraction).toFloat

          candidates ++= reSampled.map(x => (x, weight))

        }

        RangePartitioner.determineBounds(candidates, partitions)

      }

    }

  }

3.RangePartitioner的sketch 源码走读

下面代码跟到了RangePartitioner这个伴生对象，其主要包括如下两个方法：

sketch(rdd.map(_._1), sampleSizePerPartition) 这个方法会返回抽样的总数和一个元素为（分区id，分区总数，以及抽样到的所有Key）的三元组的Array，其中使用到了水塘抽样算法，可以查看蓄水池（Reservoir_sampling）抽样算法简记

private[spark] object RangePartitioner {

  /**

   * Sketches the input RDD via reservoir sampling on each partition.

   *

   * @param rdd the input RDD to sketch

   * @param sampleSizePerPartition max sample size per partition

   * @return (total number of items, an array of (partitionId, number of items, sample))

   */

  def sketch[K : ClassTag](

      rdd: RDD[K],

      sampleSizePerPartition: Int): (Long, Array[(Int, Long, Array[K])]) = {

    val shift = rdd.id

    val sketched = rdd.mapPartitionsWithIndex { (idx, iter) =>

      val seed = byteswap32(idx ^ (shift << 16))

      val (sample, n) = SamplingUtils.reservoirSampleAndCount(

        iter, sampleSizePerPartition, seed)

      Iterator((idx, n, sample))

    }.collect()

    val numItems = sketched.map(_._2).sum

    (numItems, sketched)

  }

4.determineBounds 源码走读:

determineBounds(candidates, partitions)这个方法返回实际Key对应的分区上界值，其中candidates包含Key和Key所占的比例（weight）

/**

   * Determines the bounds for range partitioning from candidates with weights indicating how many

   * items each represents. Usually this is 1 over the probability used to sample this candidate.

   *

   * @param candidates unordered candidates with weights

   * @param partitions number of partitions

   * @return selected bounds

   */

  def determineBounds[K : Ordering : ClassTag](

      candidates: ArrayBuffer[(K, Float)],

      partitions: Int): Array[K] = {

    val ordering = implicitly[Ordering[K]]

    val ordered = candidates.sortBy(_._1)

    val numCandidates = ordered.size

    val sumWeights = ordered.map(_._2.toDouble).sum

    val step = sumWeights / partitions

    var cumWeight = 0.0

    var target = step

    val bounds = ArrayBuffer.empty[K]

    var i = 0

    var j = 0

    var previousBound = Option.empty[K]

    while ((i < numCandidates) && (j < partitions - 1)) {

      val (key, weight) = ordered(i)

      cumWeight += weight

      if (cumWeight >= target) {

        // Skip duplicate values.

        if (previousBound.isEmpty || ordering.gt(key, previousBound.get)) {

          bounds += key

          target += step

          j += 1

          previousBound = Some(key)

        }

      }

      i += 1

    }

    bounds.toArray

  }

5.关于RangePartitioner和sortByKey实验

RangePartitioner在SortByKey中的应用：

返回的就是一个以RangePartitioner作为分区函数的ShuffledRDD

def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length): RDD[(K, V)] = self.withScope

  {

    val part = new RangePartitioner(numPartitions, self, ascending)

    new ShuffledRDD[K, V, V](self, part)

      .setKeyOrdering(if (ascending) ordering else ordering.reverse)

  }

以下是做的有关RangePartition和SortByKey的实验：

自己实现的sortByKey

RangePartitioner 实现简记的更多相关文章

Spark RDD 核心总结
摘要: 1.RDD的五大属性 1.1 partitions(分区) 1.2 partitioner(分区方法) 1.3 dependencies(依赖关系) 1.4 compute(获取分区迭代列表) ...
Spark 学习总结
摘要: 1.spark_core 2.spark_sql 3.spark_ml 内容: 1.spark_core 原理篇: Spark RDD 核心总结 RangePartitioner 实现简记 S ...
蓄水池（Reservoir_sampling）抽样算法简记
摘要 1.适用场合 2.算法简介 3.代码例子 4.Spark RangePartitioner 中的应用(待补充) 内容 1.适用场合:从包含n个项目的集合S中选取k个样本,其中n为一很大或未知的数 ...
xlslib库使用简记
xlslib库使用简记 1 前言最近需要使用C++结合xlslib库来生成Excel文件,但发现这个库的文档还真难找,找来找去发现唯一的线索是有一个test/目录里面的几个例子而已. 想到以后要不断 ...
Eclipse 使用简记
Eclipse 使用简记本文针对 Eclipse Neon (4.6)版本进行说明,具体而言是 Eclipse IDE for Java EE Developers . 下载 Eclipse ecl ...
SLF4J 使用简记
SLF4J 使用简记使用 SLF4J有一段时间了,在此作上些许记录,以作提示. 本文使用的实际实现的日志框架是 Log4j,所以使用 log4j.properties 文件 SLF4J 需要引入的j ...
make 要点简记
make 要点简记 1.隐式推导 make可以自动推导文件及其文件依赖关系后面的命令,所以我们没有必要在每一个.o文件后面都写上类似的命令,因为make 会自动识别并且自动推导命令. objects ...
[Spark] - HashPartitioner & RangePartitioner 区别
Spark RDD的宽依赖中存在Shuffle过程,Spark的Shuffle过程同MapReduce,也依赖于Partitioner数据分区器,Partitioner类的代码依赖结构主要如下所示: ...
Hive简记
在大数据工作中难免遇到数据仓库(OLAP)架构,以及通过Hive SQL简化分布式计算的场景.所以想通过这篇博客对Hive使用有一个大致总结,希望道友多多指教! 摘要: 1.Hive安装 2.Hive ...

随机推荐

expect用法
1. ［#!/usr/bin/expect］这一行告诉操作系统脚本里的代码使用那一个shell来执行.这里的expect其实和linux下的bash.windows下的cmd是一类东西. 注意: ...
TypeScript: Angular 2 的秘密武器（译）
本文整理自Dan Wahlin在ng-conf上的talk.原视频地址: https://www.youtube.com/watch?v=e3djIqAGqZo 开场白开场白主要分为三部分: 感谢了 ...
在开启DRS的集群中修复VMware虚拟主机启动问题
通过iSCSI方式连接到ESXi主机上的外挂存储意外失联了一段时间,导致部分虚拟主机在集群中呈现出孤立的状态,单独登陆到每台ESXi上可以看到这些虚拟主机都变成了unknow状态.因为有过上一次(VM ...
node中的Stream－Readable和Writeable解读
在node中,只要涉及到文件IO的场景一般都会涉及到一个类-Stream.Stream是对IO设备的抽象表示,其在JAVA中也有涉及,主要体现在四个类-InputStream.Reader.Outpu ...
ASP.NET MVC关于Ajax以及Jquery的无限级联动
---恢复内容开始--- 第一次发表博文,发表博文的目的是巩固自己的技术,也能够共享给大家.写的不好的地方,希望大家多给给意见.老司机勿喷数据结构() NewsTypeId 新闻ID, NewsTy ...
通过自定义特性，使用EF6拦截器完成创建人、创建时间、更新人、更新时间的统一赋值(使用数据库服务器时间赋值，接上一篇)
目录: 前言设计(完成扩展) 实现效果扩展设计方案扩展后代码结构集思广益(问题) 前言: 在上一篇文章我写了如何重建IDbCommandTreeInterceptor来实现创建人.创建时间.更 ...
thinkphp数据的查询和截取
public function NewsList(){ $this->assign('title','news'); $p = I('page',1); $listRows = 6; $News ...
一个简单的网站web项目的详解
有不对的术语,或者不好理解的部分,欢迎大家批评指正,谢谢大家! 近期做的网站web项目,实现登录功能,查询功能.首先把这个项目分为几个模块来处理,当前用户模块,历史用户模块,历史记录模块,数据库模块, ...
使用nginx反向代理，一个80端口下，配置多个微信项目
我们要接入微信公众号平台开发,需要填写服务器配置,然后依据接口文档才能实现业务逻辑.但是微信公众号接口只支持80接口(80端口).我们因业务需求需要在一个公众号域名下面,发布两个需要微信授权的项目,怎 ...
记录在Windows上安装和使用Oracle数据库过程中的坑
1.安装Oracle Oracle软件是免费的,可以去官网下载相应的安装包.但是如果用于商业用途需要购买License.官网上针对各种平台,32位和64位都有,如果在Windows一般会下载到两个文件 ...

RangePartitioner 实现简记

RangePartitioner 实现简记的更多相关文章

随机推荐

热门专题