repartition

def repartition(numPartitions: Int): JavaRDD[T]

   /**

    * Return a new RDD that has exactly numPartitions partitions.

    *

    * Can increase or decrease the level of parallelism in this RDD. Internally, this uses

    * a shuffle to redistribute data.

    *

    * If you are decreasing the number of partitions in this RDD, consider using `coalesce`,

    * which can avoid performing a shuffle.

    */

   def repartition(numPartitions: Int): JavaRDD[T] = rdd.repartition(numPartitions)

返回一个新的RDD，该RDD恰好具有numPartitions分区。

repartition这个方法可以增加或减少此RDD中的并行度。在内部，这使用shuffle来重新分配数据。

如果要减少RDD中的分区数量，请考虑使用“coalesce”，这样可以避免执行shuffle。

这个方法在org.apache.spark.api.java.JavaRDD里面

真正调用的是org.apache.spark.rdd.RDD里面的repartition

   /**

    * Return a new RDD that has exactly numPartitions partitions.

    *

    * Can increase or decrease the level of parallelism in this RDD. Internally, this uses

    * a shuffle to redistribute data.

    *

    * If you are decreasing the number of partitions in this RDD, consider using `coalesce`,

    * which can avoid performing a shuffle.

    */

   def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {

     coalesce(numPartitions, shuffle = true)

   }

从上面可以看出，在此处还不是方法最终的，还调用了coalesce(numPartitions, shuffle = true) 这个方法，这个方法实现如下：

   /**

    * Return a new RDD that is reduced into `numPartitions` partitions.

    *

    * This results in a narrow dependency, e.g. if you go from 1000 partitions

    * to 100 partitions, there will not be a shuffle, instead each of the 100

    * new partitions will claim 10 of the current partitions.

    *

    * However, if you're doing a drastic coalesce, e.g. to numPartitions = 1,

    * this may result in your computation taking place on fewer nodes than

    * you like (e.g. one node in the case of numPartitions = 1). To avoid this,

    * you can pass shuffle = true. This will add a shuffle step, but means the

    * current upstream partitions will be executed in parallel (per whatever

    * the current partitioning is).

    *

    * Note: With shuffle = true, you can actually coalesce to a larger number

    * of partitions. This is useful if you have a small number of partitions,

    * say 100, potentially with a few partitions being abnormally large. Calling

    * coalesce(1000, shuffle = true) will result in 1000 partitions with the

    * data distributed using a hash partitioner.

    */

   def coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord: Ordering[T] = null)

       : RDD[T] = withScope {

     if (shuffle) {

       /** Distributes elements evenly across output partitions, starting from a random partition. 注意，键的哈希代码就是键本身。HashPartitioner将用分区的总数对它进行修改。*/

       val distributePartition = (index: Int, items: Iterator[T]) => {

         var position = (new Random(index)).nextInt(numPartitions)

         items.map { t =>

           // Note that the hash code of the key will just be the key itself. The HashPartitioner

           // will mod it with the number of total partitions.

           position = position + 1

           (position, t)

         }

       } : Iterator[(Int, T)]

       // include a shuffle step so that our upstream tasks are still distributed 包含一个shuffle步骤，以便我们的上游任务仍然是分布式的。

       new CoalescedRDD(

         new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition),

         new HashPartitioner(numPartitions)),

         numPartitions).values

     } else {

       new CoalescedRDD(this, numPartitions)

     }

   }

这个方法返回一个新的RDD，它被简化为"numpartition"分区。

这导致了一个狭窄的依赖关系，例如，如果从1000个分区到100个分区，将不会有一个shuffle，而是100个新分区中的每一个都会声明10个当前分区。

然而，如果你正在做一个剧烈的合并，例如当numPartitions = 1时，这可能导致您的计算发生在比您期待的更少的节点上(例如numpartition=1的情况下只有一个节点)，即可能导致并行度下降，无法充分利用分布式环境的优势。

为了避免这种情况，可以传递shuffle = true。这将添加一个shuffle步骤，但意味着当前的上游分区将并行执行(无论当前分区是什么)。

注意:使用shuffle = true，您实际上可以合并到更多的分区。

如果您有少量的分区(比如100个)，可能有一些分区非常大，那么这是非常有用的，调用coalesce(1000, shuffle = true)将产生1000个分区，使用散列分区器分发数据。

从上面的源码可以看到，def repartition(numPartitions: Int): JavaRDD[T] 其实调用的是coalesce(numPartitions, shuffle = true)这个方法，而且这个方法产生shuffle操作，分区的规则采用的个是哈希分区。

coalesce

def coalesce(numPartitions: Int): JavaRDD[T]

  /**

    * Return a new RDD that is reduced into `numPartitions` partitions.

    */

   def coalesce(numPartitions: Int): JavaRDD[T] = rdd.coalesce(numPartitions)

而这个方法调用的是org.apache.spark.rdd.RDD里面的def coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord: Ordering[T] = null) : RDD[T]。

这个方法和上面repartitions的是一样的，只不过此处的shuffle参数是默认的false。

真正调用的是new CoalescedRDD(this, numPartitions)此时不会触发shuffle。

def coalesce(numPartitions: Int, shuffle: Boolean): JavaRDD[T]

 /**

    * Return a new RDD that is reduced into `numPartitions` partitions.

    */

   def coalesce(numPartitions: Int, shuffle: Boolean): JavaRDD[T] =

     rdd.coalesce(numPartitions, shuffle)

这个和上面的coalesce(numPartitions: Int)类似，只是此处的shuffle参数不再是默认的false，而是自己指定的了，当shuffle为true是会触发shuffle，反之不会。

演示

 scala> var rdd1=sc.textFile("hdfs://file.txt")

 rdd1: org.apache.spark.rdd.RDD[String] = hdfs://file.txt MapPartitionsRDD[20] at textFile at <console>:27

 //默认分区数量为177

 scala> rdd1.partitions.size

 res12: Int = 177

 //调用coalesce(10) 减少分区数量

 scala> var rdd2 = rdd1.coalesce(10)

 rdd2: org.apache.spark.rdd.RDD[String] = CoalescedRDD[21] at coalesce at <console>:29

 //分区数量减少到10个

 scala> rdd2.partitions.size

 res13: Int = 10

 //直接增加分区数量到200

 scala> var rdd2 = rdd1.coalesce(200)

 rdd2: org.apache.spark.rdd.RDD[String] = CoalescedRDD[22] at coalesce at <console>:29

 //方法没有生效

 scala> rdd2.partitions.size

 res14: Int = 177

 //将shuffle设置为true，增加分区到200

 scala> var rdd2 = rdd1.coalesce(200,true)

 rdd2: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[26] at coalesce at <console>:29

 //重新分区生效

 scala> rdd2.partitions.size

 res15: Int = 200

 ------------------------------------------------------------------------------------------------

 //对于repartition增加分区到200

 scala> var rdd2 = rdd1.repartition 直接增加o(200)

 rdd2: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[30] at repartition at <console>:29

 //增加分区生效

 scala> rdd2.partitions.size

 res16: Int = 200

 //对于repartition减少分区到10

 scala> var rdd2 = rdd1.repartition(10)

 rdd2: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[34] at repartition at <console>:29

 //减少分区生效

 scala> rdd2.partitions.size

 res17: Int = 10

总结

coalesce(numPartitions: Int)

当新的分区数小于原来的分区时，分区生效切并且不会触发shuffle；

当新的分区数大于原来的分区时，分区无效还是原来的数量。

coalesce(numPartitions: Int, shuffle: Boolean)

当shuffle为true时候，无论新的分区比原来的大还是小，分区均生效，并且触发shuffle操作，此时等同于repartition(numPartitions: Int)；

当shuffle为false时候，等同于coalesce(numPartitions: Int)。

def repartition(numPartitions: Int)

无论新的分区比原来的大还是小，分区均生效，并且触发shuffle操作；

很明显repartition就是当shuffle为true时候的coalesce(numPartitions: Int, shuffle: Boolean)方法。

此为本人学习工作总结，转载请注明出处！！！！

Spark源码系列:RDD repartition、coalesce 对比的更多相关文章

Spark源码系列:DataFrame repartition、coalesce 对比
在Spark开发中,有时为了更好的效率,特别是涉及到关联操作的时候,对数据进行重新分区操作可以提高程序运行效率(很多时候效率的提升远远高于重新分区的消耗,所以进行重新分区还是很有价值的).在Spark ...
Spark源码系列（五）分布式缓存
这一章想讲一下Spark的缓存是如何实现的.这个persist方法是在RDD里面的,所以我们直接打开RDD这个类. def persist(newLevel: StorageLevel): this. ...
Spark 源码分析 -- RDD
关于RDD, 详细可以参考Spark的论文, 下面看下源码 A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. ...
Spark源码系列（一）spark-submit提交作业过程
前言折腾了很久,终于开始学习Spark的源码了,第一篇我打算讲一下Spark作业的提交过程. 这个是Spark的App运行图,它通过一个Driver来和集群通信,集群负责作业的分配.今天我要讲的是如 ...
Spark源码系列（二）RDD详解
1.什么是RDD? 上一章讲了Spark提交作业的过程,这一章我们要讲RDD.简单的讲,RDD就是Spark的input,知道input是啥吧,就是输入的数据. RDD的全名是Resilient Di ...
Spark源码系列（九）Spark SQL初体验之解析过程详解
好久没更新博客了,之前学了一些R语言和机器学习的内容,做了一些笔记,之后也会放到博客上面来给大家共享.一个月前就打算更新Spark Sql的内容了,因为一些别的事情耽误了,今天就简单写点,Spark1 ...
Spark源码系列（八）Spark Streaming实例分析
这一章要讲Spark Streaming,讲之前首先回顾下它的用法,具体用法请参照<Spark Streaming编程指南>. Example代码分析 val ssc = )); // 获 ...
Spark源码系列（七）Spark on yarn具体实现
本来不打算写的了,但是真的是闲来无事,整天看美剧也没啥意思.这一章打算讲一下Spark on yarn的实现,1.0.0里面已经是一个stable的版本了,可是1.0.1也出来了,离1.0.0发布才一 ...
Spark源码系列（六）Shuffle的过程解析
Spark大会上,所有的演讲嘉宾都认为shuffle是最影响性能的地方,但是又无可奈何.之前去百度面试hadoop的时候,也被问到了这个问题,直接回答了不知道. 这篇文章主要是沿着下面几个问题来开展: ...

随机推荐

linux下C获取文件的大小
获取文件大小这里有两种方法: 方法一. 范例: unsigned long get_file_size(const char *path) { unsigned long filesize = -1; ...
在Linux CentOS6系统中安装开源CMS程序OpenCart的教程
OpenCart是一个开放源码的店面,旨在为您提供灵活和细粒度的在线店面管理.在开始之前,您应该已经在您的Linode上设置了一个LAMP堆栈.您还应该设置主机名. PHP设置为了使用OpenCar ...
jvm参数及分析工具
-Xmx4G 设置堆的最大内存大小为4GB,也可通过-XX:MaxHeapSize=4GB进行设置 -Xms256m 设置堆的初始内存大小为256兆,如果未设置此选项,则初始大小将设置为新生代和年老代 ...
反射Dll注入分析
(源码作者:(HarmanySecurity)Stephen Fewer) 0x01 反射Dll注入的优点 1.反射Dll注入的主要优点是它没有以主机系统的任何方式(例如LoadLibrary和L ...
机器学习之朴素贝叶斯&贝叶斯网络
贝叶斯决决策论在所有相关概率都理想的情况下,贝叶斯决策论考虑基于这些概率和误判损失来选择最优标记,基本思想如下: (1)已知先验概率和类条件概率密度(似然) (2)利用贝叶斯转化为后验概 ...
java基础知识—继承
1.不能被继承的父类成员: private成员.子类与父类不在同包,使用默认访问权限的成员.构造方法. 2.访问修饰符: 访问修饰符本类同包子类其它 ...
js事件、事件流以及target、currentTarget、this那些事
你是如此简单我却将你给遗忘前面面试被问到js的事件机制 target.currentTarget.碰巧今天有时间来拔一拔,顺便记下.
PHP多进程引发的msyql连接数问题
PHP多进程引发的msyql连接数问题业务中有一块采用了PHP的pcntl_fork多进程,希望能提高效率,但是在执行的时候数据库报错 PDO::prepare(): Premature end o ...
python之三级目录
#python之三级目录低配版 menu = { '北京':{ '朝阳':{ '国贸':{ 'CICC':{ }, 'HP':{ }, '渣打银行':{ }, 'CCTV':{ }, }, '望京': ...
Labview-vi的可重入性
VI可重入性: labview多线程中同时对一个子vi访问时,可能会造成同时对同一块内存地址读写所造成的数据混乱,当选择 vi属性(Ctrl+i)中执行选项卡允许可重入时,labview会分配不同的 ...

Spark源码系列:RDD repartition、coalesce 对比

repartition

coalesce

演示

Spark源码系列:RDD repartition、coalesce 对比的更多相关文章

随机推荐

热门专题