Spark中repartition和partitionBy的区别

repartition 和 partitionBy 都是对数据进行重新分区，默认都是使用 HashPartitioner，区别在于partitionBy 只能用于 PairRDD，但是当它们同时都用于 PairRDD时，结果却不一样：

不难发现，其实 partitionBy 的结果才是我们所预期的，我们打开 repartition 的源码进行查看：

/**

   * Return a new RDD that has exactly numPartitions partitions.

   *

   * Can increase or decrease the level of parallelism in this RDD. Internally, this uses

   * a shuffle to redistribute data.

   *

   * If you are decreasing the number of partitions in this RDD, consider using `coalesce`,

   * which can avoid performing a shuffle.

   *

   * TODO Fix the Shuffle+Repartition data loss issue described in SPARK-23207.

   */

  def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {

    coalesce(numPartitions, shuffle = true)

  }

  /**

   * Return a new RDD that is reduced into `numPartitions` partitions.

   *

   * This results in a narrow dependency, e.g. if you go from 1000 partitions

   * to 100 partitions, there will not be a shuffle, instead each of the 100

   * new partitions will claim 10 of the current partitions. If a larger number

   * of partitions is requested, it will stay at the current number of partitions.

   *

   * However, if you're doing a drastic coalesce, e.g. to numPartitions = 1,

   * this may result in your computation taking place on fewer nodes than

   * you like (e.g. one node in the case of numPartitions = 1). To avoid this,

   * you can pass shuffle = true. This will add a shuffle step, but means the

   * current upstream partitions will be executed in parallel (per whatever

   * the current partitioning is).

   *

   * @note With shuffle = true, you can actually coalesce to a larger number

   * of partitions. This is useful if you have a small number of partitions,

   * say 100, potentially with a few partitions being abnormally large. Calling

   * coalesce(1000, shuffle = true) will result in 1000 partitions with the

   * data distributed using a hash partitioner. The optional partition coalescer

   * passed in must be serializable.

   */

  def coalesce(numPartitions: Int, shuffle: Boolean = false,

               partitionCoalescer: Option[PartitionCoalescer] = Option.empty)

              (implicit ord: Ordering[T] = null)

      : RDD[T] = withScope {

    require(numPartitions > , s"Number of partitions ($numPartitions) must be positive.")

    if (shuffle) {

      /** Distributes elements evenly across output partitions, starting from a random partition. */

      val distributePartition = (index: Int, items: Iterator[T]) => {

        var position = new Random(hashing.byteswap32(index)).nextInt(numPartitions)

        items.map { t =>

          // Note that the hash code of the key will just be the key itself. The HashPartitioner

          // will mod it with the number of total partitions.

          position = position +

          (position, t)

        }

      } : Iterator[(Int, T)]

      // include a shuffle step so that our upstream tasks are still distributed

      new CoalescedRDD(

        new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition),

        new HashPartitioner(numPartitions)),

        numPartitions,

        partitionCoalescer).values

    } else {

      new CoalescedRDD(this, numPartitions, partitionCoalescer)

    }

  }

即使是RairRDD也不会使用自己的key，repartition 其实使用了一个随机生成的数来当做 Key，而不是使用原来的 Key！！

Spark中repartition和partitionBy的区别的更多相关文章

Spark中ml和mllib的区别
转载自:https://vimsky.com/article/3403.html Spark中ml和mllib的主要区别和联系如下: ml和mllib都是Spark中的机器学习库,目前常用的机器学习功 ...
spark中map与flatMap的区别
作为spark初学者对,一直对map与flatMap两个函数比较难以理解,这几天看了和写了不少例子,终于把它们搞清楚了两者的区别主要在于action后得到的值例子: import org.apac ...
Spark中cache和persist的区别
cache和persist都是用于将一个RDD进行缓存的,这样在之后使用的过程中就不需要重新计算了,可以大大节省程序运行时间. cache和persist的区别基于Spark 1.6.1 的源码,可 ...
Spark中groupBy groupByKey reduceByKey的区别
groupBy 和SQL中groupby一样,只是后面必须结合聚合函数使用才可以. 例如: hour.filter($"version".isin(version: _*)).gr ...
spark中map与mapPartitions区别
在spark中,map与mapPartitions两个函数都是比较常用,这里使用代码来解释一下两者区别 import org.apache.spark.{SparkConf, SparkContext ...
大数据学习day19-----spark02-------0 零碎知识点（分区，分区和分区器的区别） 1. RDD的使用（RDD的概念，特点，创建rdd的方式以及常见rdd的算子） 2.Spark中的一些重要概念
0. 零碎概念 (1) 这个有点疑惑,有可能是错误的. (2) 此处就算地址写错了也不会报错,因为此操作只是读取数据的操作(元数据),表示从此地址读取数据但并没有进行读取数据的操作 (3)分区(有时间 ...
Scala中sortBy和Spark中sortBy区别
Scala中sortBy是以方法的形式存在的,并且是作用在Array或List集合排序上,并且这个sortBy默认只能升序,除非实现隐式转换或调用reverse方法才能实现降序,Spark中sortB ...
Spark中Task，Partition，RDD、节点数、Executor数、core数目的关系和Application，Driver，Job，Task，Stage理解
梳理一下Spark中关于并发度涉及的几个概念File,Block,Split,Task,Partition,RDD以及节点数.Executor数.core数目的关系. 输入可能以多个文件的形式存储在H ...
spark中的scalaAPI之RDDAPI常用操作
package com.XXX import org.apache.spark.storage.StorageLevel import org.apache.spark.{SparkConf, Spa ...

随机推荐

c++ typedef和#define的作用范围
typedef: 如果放在所有函数之外,它的作用域就是从它定义开始直到文件尾: 如果放在某个函数内,定义域就是从定义开始直到该函数结尾: #define: 不管是在某个函数内,还是在所有函数之外,作用 ...
Ajax框架---dwr的用法
通常使用Ajax时用的都是jQuery框架,现在公司的框架里用的都是dwr.我觉得dwr和jQuery中的ajax用法差不多,看起来也很像. 一.简介百度百科上对dwr的描述: DWR采取了一个类似 ...
vue中的iviewUI导出1W条列表数据每次只导出2000条的逻辑
导出弹窗的html <template> <Modal v-model="exportModal" width=400 :closable="false ...
Datatables js 复杂表头合并单元格
x →Datatables官网← x 项目中用到的Table都是用Datatables插件来搞得: 以前都是生成一般性的table: 近期要生成一些复杂表头,合并单元格之类的: 研究了一下. x 去官 ...
[No0000176]Git常用命令速查表（收藏大全）
名词 master: 默认开发分支 origin: 默认远程版本库 Index / Stage:暂存区 Workspace:工作区 Repository:仓库区(或本地仓库) Remote:远程仓库 ...
机器学习使用sklearn进行模型训练、预测和评价
cross_val_score(model_name, x_samples, y_labels, cv=k) 作用:验证某个模型在某个训练集上的稳定性,输出k个预测精度. K折交叉验证(k-fold) ...
基于Docker搭建MySQL主从复制
摘要: 本篇博文相对简单,因为是初次使用Docker,MySQL的主从复制之前也在Centos环境下搭建过,但是也忘的也差不多了,因此本次尝试在Docker中搭建. 本篇博文相对简单,因为是初次使用D ...
a mechanism for code reuse in single inheritance languages
php.net <?php class Base { public function sayHello() { echo 'Hello'; } } trait SayWorld { public ...
EF Code First模型约束
总之,EF比较复杂.如果不想深究,建议简单用用.基本对应就行,大项目标准开发还是ModelFirst(先建立DB各种约束),然后再c#类约束.定义. 当然写原型时用ef很快.
Page6：关于能控性、能观性、能测性及其判据（1）[Linear System Theory]
内容包含能控性和能测性的定义,连续时间线性时不变系统能控性和能观测性判据

Spark中repartition和partitionBy的区别

Spark中repartition和partitionBy的区别的更多相关文章

随机推荐

热门专题