在Spark开发中，有时为了更好的效率，特别是涉及到关联操作的时候，对数据进行重新分区操作可以提高程序运行效率（很多时候效率的提升远远高于重新分区的消耗，所以进行重新分区还是很有价值的）。
在SparkSQL中，对数据重新分区主要有两个方法 repartition 和 coalesce ，下面将对两个方法比较

repartition

repartition 有三个重载的函数：

def repartition(numPartitions: Int): DataFrame

 /**

    * Returns a new [[DataFrame]] that has exactly `numPartitions` partitions.

    * @group dfops

    * @since 1.3.0

    */

   def repartition(numPartitions: Int): DataFrame = withPlan {

     Repartition(numPartitions, shuffle = true, logicalPlan)

   }

此方法返回一个新的[[DataFrame]]，该[[DataFrame]]具有确切的 'numpartition' 分区。

def repartition(partitionExprs: Column*): DataFrame

 /**

    * Returns a new [[DataFrame]] partitioned by the given partitioning expressions preserving

    * the existing number of partitions. The resulting DataFrame is hash partitioned.

    *

    * This is the same operation as "DISTRIBUTE BY" in SQL (Hive QL).

    *

    * @group dfops

    * @since 1.6.0

    */

   @scala.annotation.varargs

   def repartition(partitionExprs: Column*): DataFrame = withPlan {

     RepartitionByExpression(partitionExprs.map(_.expr), logicalPlan, numPartitions = None)

   }

此方法返回一个新的[[DataFrame]]分区，它由保留现有分区数量的给定分区表达式划分。得到的DataFrame是哈希分区的。

这与SQL (Hive QL)中的“distribution BY”操作相同。

def repartition(numPartitions: Int, partitionExprs: Column*): DataFrame

   /**

    * Returns a new [[DataFrame]] partitioned by the given partitioning expressions into

    * `numPartitions`. The resulting DataFrame is hash partitioned.

    *

    * This is the same operation as "DISTRIBUTE BY" in SQL (Hive QL).

    *

    * @group dfops

    * @since 1.6.0

    */

   @scala.annotation.varargs

   def repartition(numPartitions: Int, partitionExprs: Column*): DataFrame = withPlan {

     RepartitionByExpression(partitionExprs.map(_.expr), logicalPlan, Some(numPartitions))

   }

此方法返回一个新的[[DataFrame]]，由给定的分区表达式划分为 'numpartition' 。得到的DataFrame是哈希分区的。

这与SQL (Hive QL)中的“distribution BY”操作相同。

coalesce

coalesce(numPartitions: Int): DataFrame

   /**

    * Returns a new [[DataFrame]] that has exactly `numPartitions` partitions.

    * Similar to coalesce defined on an [[RDD]], this operation results in a narrow dependency, e.g.

    * if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of

    * the 100 new partitions will claim 10 of the current partitions.

    * @group rdd

    * @since 1.4.0

    */

   def coalesce(numPartitions: Int): DataFrame = withPlan {

     Repartition(numPartitions, shuffle = false, logicalPlan)

   }

返回一个新的[[DataFrame]]，该[[DataFrame]]具有确切的 'numpartition' 分区。类似于在[[RDD]]上定义的coalesce，这种操作会导致一个狭窄的依赖关系，例如：

如果从1000个分区到100个分区，就不会出现shuffle，而是100个新分区中的每一个都会声明10个当前分区。

反过来从100个分区到1000个分区，将会出现shuffle。

注：coalesce(numPartitions: Int): DataFrame 和 repartition(numPartitions: Int): DataFrame 底层调用的都是 class Repartition(numPartitions: Int, shuffle: Boolean, child: LogicalPlan)

 /**

  * Returns a new RDD that has exactly `numPartitions` partitions. Differs from

  * [[RepartitionByExpression]] as this method is called directly by DataFrame's, because the user

  * asked for `coalesce` or `repartition`. [[RepartitionByExpression]] is used when the consumer

  * of the output requires some specific ordering or distribution of the data.

  */

 case class Repartition(numPartitions: Int, shuffle: Boolean, child: LogicalPlan)

   extends UnaryNode {

   override def output: Seq[Attribute] = child.output

 }

返回一个新的RDD，该RDD恰好具有“numpartition”分区。与[[RepartitionByExpression]]不同的是，这个方法直接由DataFrame调用，因为用户需要' coalesce '或' repartition '。

当输出的使用者需要特定的数据排序或分布时使用[[RepartitionByExpression]]。（源码里面说的是RDD，但是返回类型写的是DataFrame，感觉没差）。

而repartition(partitionExprs: Column*): DataFrame 和repartition(numPartitions: Int, partitionExprs: Column*): DataFrame 底层调用是

class RepartitionByExpression(partitionExpressions:Seq[Expression],child:LogicalPlan,numPartitions:Option[Int]=None) extends RedistributeData

 /**

  * This method repartitions data using [[Expression]]s into `numPartitions`, and receives

  * information about the number of partitions during execution. Used when a specific ordering or

  * distribution is expected by the consumer of the query result. Use [[Repartition]] for RDD-like

  * `coalesce` and `repartition`.

  * If `numPartitions` is not specified, the number of partitions will be the number set by

  * `spark.sql.shuffle.partitions`.

  */

 case class RepartitionByExpression(

     partitionExpressions: Seq[Expression],

     child: LogicalPlan,

     numPartitions: Option[Int] = None) extends RedistributeData {

   numPartitions match {

     case Some(n) => require(n > 0, "numPartitions must be greater than 0.")

     case None => // Ok

   }

 }

该方法使用[[Expression]]将数据重新划分为 'numpartition'，并在执行期间接收关于分区数量的信息。当用户期望某个特定的排序或分布时使用。使用[[Repartition]]用于类rdd的 'coalesce' 和 'Repartition'。

如果没有指定 'numpartition'，那么分区的数量将由 "spark.sql.shuffle.partition" 设置。

使用示例

def repartition(numPartitions: Int): DataFrame

 //    获取一个测试的DataFrame 里面包含一个user字段

     val testDataFrame: DataFrame = readMysqlTable(sqlContext, "MYSQLTABLE", proPath)

 //    获得10个分区的DataFrame

     testDataFrame.repartition(10)

def repartition(partitionExprs: Column*): DataFrame

 //    获取一个测试的DataFrame 里面包含一个user字段

     val testDataFrame: DataFrame = readMysqlTable(sqlContext, "MYSQLTABLE", proPath)

 //    根据 user 字段进行分区，分区数量由 spark.sql.shuffle.partition 决定

     testDataFrame.repartition($"user")

def repartition(numPartitions: Int, partitionExprs: Column*): DataFrame

 //    获取一个测试的DataFrame 里面包含一个user字段

     val testDataFrame: DataFrame = readMysqlTable(sqlContext, "MYSQLTABLE", proPath)

 //    根据 user 字段进行分区，将获得10个分区的DataFrame，此方法有时候在join的时候可以极大的提高效率，但是得注意出现数据倾斜的问题

     testDataFrame.repartition(10,$"user")

coalesce(numPartitions: Int): DataFrame

 val testDataFrame1: DataFrame = readMysqlTable(sqlContext, "MYSQLTABLE", proPath)

     val testDataFrame2=testDataFrame1.repartition(10)

 //    不会触发shuffle

     testDataFrame2.coalesce(5)

 //    触发shuffle 返回一个100分区的DataFrame

     testDataFrame2.coalesce(100)

至于分区的数据设定，得根据自己的实际情况来，多了浪费少了负优化。

现在的只是初步探讨，具体的底层代码实现，后续去研究一下。

此文为本人工作学习整理笔记，转载请注明出处！！！！！！

Spark源码系列:DataFrame repartition、coalesce 对比的更多相关文章

Spark源码系列:RDD repartition、coalesce 对比
在上一篇文章中 Spark源码系列:DataFrame repartition.coalesce 对比对DataFrame的repartition.coalesce进行了对比,在这篇文章中,将会对R ...
Spark源码系列（五）分布式缓存
这一章想讲一下Spark的缓存是如何实现的.这个persist方法是在RDD里面的,所以我们直接打开RDD这个类. def persist(newLevel: StorageLevel): this. ...
Spark源码系列（一）spark-submit提交作业过程
前言折腾了很久,终于开始学习Spark的源码了,第一篇我打算讲一下Spark作业的提交过程. 这个是Spark的App运行图,它通过一个Driver来和集群通信,集群负责作业的分配.今天我要讲的是如 ...
Spark源码系列（九）Spark SQL初体验之解析过程详解
好久没更新博客了,之前学了一些R语言和机器学习的内容,做了一些笔记,之后也会放到博客上面来给大家共享.一个月前就打算更新Spark Sql的内容了,因为一些别的事情耽误了,今天就简单写点,Spark1 ...
Spark源码系列（八）Spark Streaming实例分析
这一章要讲Spark Streaming,讲之前首先回顾下它的用法,具体用法请参照<Spark Streaming编程指南>. Example代码分析 val ssc = )); // 获 ...
Spark源码系列（七）Spark on yarn具体实现
本来不打算写的了,但是真的是闲来无事,整天看美剧也没啥意思.这一章打算讲一下Spark on yarn的实现,1.0.0里面已经是一个stable的版本了,可是1.0.1也出来了,离1.0.0发布才一 ...
Spark源码系列（六）Shuffle的过程解析
Spark大会上,所有的演讲嘉宾都认为shuffle是最影响性能的地方,但是又无可奈何.之前去百度面试hadoop的时候,也被问到了这个问题,直接回答了不知道. 这篇文章主要是沿着下面几个问题来开展: ...
Spark源码系列（四）图解作业生命周期
这一章我们探索了Spark作业的运行过程,但是没把整个过程描绘出来,好,跟着我走吧,let you know! 我们先回顾一下这个图,Driver Program是我们写的那个程序,它的核心是Spar ...
Spark源码系列（三）作业运行过程
作业执行上一章讲了RDD的转换,但是没讲作业的运行,它和Driver Program的关系是啥,和RDD的关系是啥? 官方给的例子里面,一执行collect方法就能出结果,那我们就从collect开 ...

随机推荐

python字符串内置方法
网上已经有很多,自己操作一遍,加深印象. dir dir会返回一个内置方法与属性列表,用字符串'a,b,cdefg'测试一下 dir('a,b,cdefg') 得到一个列表 ['__add__', ' ...
linux ubuntu 安装后没有root密码
终端中输入:sudo passwd root 此时重新设置原登录用户的密码. 设置成功后在终端继续输入: su root 则出现#号,原用户名得到root权限.此时可以进行超级用户操作.
spring cloud 随笔记录（1）-
最近随着微服务的火热,我也开始对我服务进行了解了毕竟程序员这一行需要及时更新自己的技能,才能更好的生存. 我理解的微服务项目由多个独立运行的程序组成,每个服务运行在自己的进程中,服务间采用轻量 ...
连接MySQL常用工具
database.properties 如下:url中coursesystem为将要连接的数据库名:username为该数据库设置权限时的用户名:如果设置了密码,再添一项password=你的密码 d ...
Django中Form验证
Django的Form主要具有一下几大功能: 生成HTML标签验证用户数据(显示错误信息) HTML Form提交保留上次提交数据初始化页面显示内容一,Form验证第一种操作:主要是这三个函数 ...
centos 系统上如何把python升级为3
第一种方式: SCL 源目前由 CentOS SIG 维护,除了重新编译构建 Red Hat 的 Software Collections 外,还额外提供一些它们自己的软件包. 该源中包含不少程序的更 ...
集合List的排序
自从出现了泛型和LINQ,对于集合的排序变得更简单了. //倒序 list.OrderByDescending(p=> p.a).ThenByDescending(p => p.b); / ...
继承 in her it
''' in her it 继承 de rive 派生 python2 (经典类|新式类) python3 (新式类) 1. What is inheritance? 什么是继承? 继承是一种新建类的 ...
UVA548 tree的思路
唔,首先这题给出了中序遍历和后序遍历要求我们求出, 一个叶子节点到根的数值总和最小,且这个叶子节点是最小的那个这题的难点在于如何运用中序遍历和后序遍历还原整棵树, 这里有两个方法: 1. 递归构造原 ...
Python_随机序列生成_白噪声
本文介绍如何利用Python自行生成随机序列,实现了 Whichmann / Hill 生成器. 参考: [1]Random Number Generation and Monte Carlo Met ...

Spark源码系列:DataFrame repartition、coalesce 对比

repartition

coalesce

使用示例

Spark源码系列:DataFrame repartition、coalesce 对比的更多相关文章

随机推荐

热门专题