Spark中repartition和partitionBy的区别

repartition 和 partitionBy 都是对数据进行重新分区，默认都是使用 HashPartitioner，区别在于partitionBy 只能用于 PairRDD，但是当它们同时都用于 PairRDD时，结果却不一样：

不难发现，其实 partitionBy 的结果才是我们所预期的，我们打开 repartition 的源码进行查看：

/**

   * Return a new RDD that has exactly numPartitions partitions.

   *

   * Can increase or decrease the level of parallelism in this RDD. Internally, this uses

   * a shuffle to redistribute data.

   *

   * If you are decreasing the number of partitions in this RDD, consider using `coalesce`,

   * which can avoid performing a shuffle.

   *

   * TODO Fix the Shuffle+Repartition data loss issue described in SPARK-23207.

   */

  def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {

    coalesce(numPartitions, shuffle = true)

  }

  /**

   * Return a new RDD that is reduced into `numPartitions` partitions.

   *

   * This results in a narrow dependency, e.g. if you go from 1000 partitions

   * to 100 partitions, there will not be a shuffle, instead each of the 100

   * new partitions will claim 10 of the current partitions. If a larger number

   * of partitions is requested, it will stay at the current number of partitions.

   *

   * However, if you're doing a drastic coalesce, e.g. to numPartitions = 1,

   * this may result in your computation taking place on fewer nodes than

   * you like (e.g. one node in the case of numPartitions = 1). To avoid this,

   * you can pass shuffle = true. This will add a shuffle step, but means the

   * current upstream partitions will be executed in parallel (per whatever

   * the current partitioning is).

   *

   * @note With shuffle = true, you can actually coalesce to a larger number

   * of partitions. This is useful if you have a small number of partitions,

   * say 100, potentially with a few partitions being abnormally large. Calling

   * coalesce(1000, shuffle = true) will result in 1000 partitions with the

   * data distributed using a hash partitioner. The optional partition coalescer

   * passed in must be serializable.

   */

  def coalesce(numPartitions: Int, shuffle: Boolean = false,

               partitionCoalescer: Option[PartitionCoalescer] = Option.empty)

              (implicit ord: Ordering[T] = null)

      : RDD[T] = withScope {

    require(numPartitions > , s"Number of partitions ($numPartitions) must be positive.")

    if (shuffle) {

      /** Distributes elements evenly across output partitions, starting from a random partition. */

      val distributePartition = (index: Int, items: Iterator[T]) => {

        var position = new Random(hashing.byteswap32(index)).nextInt(numPartitions)

        items.map { t =>

          // Note that the hash code of the key will just be the key itself. The HashPartitioner

          // will mod it with the number of total partitions.

          position = position +

          (position, t)

        }

      } : Iterator[(Int, T)]

      // include a shuffle step so that our upstream tasks are still distributed

      new CoalescedRDD(

        new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition),

        new HashPartitioner(numPartitions)),

        numPartitions,

        partitionCoalescer).values

    } else {

      new CoalescedRDD(this, numPartitions, partitionCoalescer)

    }

  }

即使是RairRDD也不会使用自己的key，repartition 其实使用了一个随机生成的数来当做 Key，而不是使用原来的 Key！！

Spark中repartition和partitionBy的区别的更多相关文章

Spark中ml和mllib的区别
转载自:https://vimsky.com/article/3403.html Spark中ml和mllib的主要区别和联系如下: ml和mllib都是Spark中的机器学习库,目前常用的机器学习功 ...
spark中map与flatMap的区别
作为spark初学者对,一直对map与flatMap两个函数比较难以理解,这几天看了和写了不少例子,终于把它们搞清楚了两者的区别主要在于action后得到的值例子: import org.apac ...
Spark中cache和persist的区别
cache和persist都是用于将一个RDD进行缓存的,这样在之后使用的过程中就不需要重新计算了,可以大大节省程序运行时间. cache和persist的区别基于Spark 1.6.1 的源码,可 ...
Spark中groupBy groupByKey reduceByKey的区别
groupBy 和SQL中groupby一样,只是后面必须结合聚合函数使用才可以. 例如: hour.filter($"version".isin(version: _*)).gr ...
spark中map与mapPartitions区别
在spark中,map与mapPartitions两个函数都是比较常用,这里使用代码来解释一下两者区别 import org.apache.spark.{SparkConf, SparkContext ...
大数据学习day19-----spark02-------0 零碎知识点（分区，分区和分区器的区别） 1. RDD的使用（RDD的概念，特点，创建rdd的方式以及常见rdd的算子） 2.Spark中的一些重要概念
0. 零碎概念 (1) 这个有点疑惑,有可能是错误的. (2) 此处就算地址写错了也不会报错,因为此操作只是读取数据的操作(元数据),表示从此地址读取数据但并没有进行读取数据的操作 (3)分区(有时间 ...
Scala中sortBy和Spark中sortBy区别
Scala中sortBy是以方法的形式存在的,并且是作用在Array或List集合排序上,并且这个sortBy默认只能升序,除非实现隐式转换或调用reverse方法才能实现降序,Spark中sortB ...
Spark中Task，Partition，RDD、节点数、Executor数、core数目的关系和Application，Driver，Job，Task，Stage理解
梳理一下Spark中关于并发度涉及的几个概念File,Block,Split,Task,Partition,RDD以及节点数.Executor数.core数目的关系. 输入可能以多个文件的形式存储在H ...
spark中的scalaAPI之RDDAPI常用操作
package com.XXX import org.apache.spark.storage.StorageLevel import org.apache.spark.{SparkConf, Spa ...

随机推荐

[Python]编程之美
Task 1 : 首字母大写 import re #python 正则表达式包:re s='hello world' s=re.sub(r"\w+",lambda match:ma ...
sencha touch/Ext Js 6 + 自定义扩展的用法
app.js中加入以下代码 //指定ux起调目录 Ext.Loader.setPath({ 'ux': 'app/ux' }); 在app目录中创建一个ux文件夹假如我们使用这个扩展,扩展地址:ht ...
C - Building Fence
Long long ago, there is a famous farmer named John. He owns a big farm and many cows. There are two ...
J - Vertical Histogram(1.5.7)
J - Vertical Histogram(1.5.7) Time Limit:1000MS Memory Limit:65536KB 64bit IO Format:%I64d &am ...
转载：浅析@PathVariable 和 @RequestParam
在网上看了一篇很好的文章,讲的很清楚明了,说到了点子上(转自:https://blog.csdn.net/chuck_kui/article/details/55506723): 首先上两个地址: ...
理解 vm.$nextTick
有同学在看 Vue 官方文档时,对 API 文档中的 Vue.nextTick 和 vm.$nextTick 的作用不太理解. 其实如果看一下深入响应式原理 - vue.js中的有关内容,可能会有所理 ...
{MySQL的库、表的详细操作}一库操作二表操作三行操作
MySQL的库.表的详细操作 MySQL数据库本节目录一库操作二表操作三行操作一库操作 1.创建数据库 1.1 语法 CREATE DATABASE 数据库名 charset utf ...
[No000012E]WPF(6/7)：概念绑定
WPF 的体系结构,标记扩展,依赖属性,逻辑树/可视化树,布局,转换等.今天,我们将讨论 WPF 最重要的一部分——绑定.WPF 带来了优秀的数据绑定方式,可以让我们绑定数据对象,这样每次对象发生更改 ...
pgadmin4 python
安装安装包 # sudo apt-get install build-essential libssl-dev libffi-dev libgmp3-dev virtualenv python-pip ...
LeetCode 561 Array Partition I 解题报告
题目要求 Given an array of 2n integers, your task is to group these integers into n pairs of integer, sa ...

Spark中repartition和partitionBy的区别

Spark中repartition和partitionBy的区别的更多相关文章

随机推荐

热门专题