Spark partitionBy

partitionBy 重新分区， repartition默认采用HashPartitioner分区，自己设计合理的分区方法(比如数量比较大的key 加个随机数随机分到更多的分区，这样处理数据倾斜更彻底一些)

/**

 * An object that defines how the elements in a key-value pair RDD are partitioned by key.

 * Maps each key to a partition ID, from 0 to `numPartitions - 1`.

 */

abstract class Partitioner extends Serializable {

  def numPartitions: Int

  def getPartition(key: Any): Int

}

import org.apache.spark.HashPartitioner

import org.apache.spark.sql.SparkSession


//查看rdd中的每个分区元素

object PartitionBy_Test {

  def main(args: Array[String]): Unit = {

    val spark = SparkSession.builder().master("local").appName(this.getClass.getSimpleName).getOrCreate()

    val rdd = spark.sparkContext.parallelize(Array(("a", ), ("a", ), ("b", ), ("b", ), (("c", )), (("e", ))), )

    val result = rdd.mapPartitionsWithIndex {

      (partIdx, iter) => {

        val part_map = scala.collection.mutable.Map[String, List[(String, Int)]]()

        while (iter.hasNext) {

          val part_name = "part_" + partIdx

          var elem = iter.next()

          if (part_map.contains(part_name)) {

            var elems = part_map(part_name)

            elems ::= elem

            part_map(part_name) = elems

          } else {

            part_map(part_name) = List[(String, Int)] {

              elem

            }

          }

        }

        part_map.iterator

      }

    }.collect

    result.foreach(x => println(x._1 + ":" + x._2.toString()))

  }

}

这里的分区方法可以选择，默认的分区就是HashPartition分区，
注意如果多次使用该RDD或者进行join操作，分区后peresist持久化操作

/**

 * A [[org.apache.spark.Partitioner]] that implements hash-based partitioning using

 * Java's `Object.hashCode`.

 *

 * Java arrays have hashCodes that are based on the arrays' identities rather than their contents,

 * so attempting to partition an RDD[Array[_]] or RDD[(Array[_], _)] using a HashPartitioner will

 * produce an unexpected or incorrect result.

 */

class HashPartitioner(partitions: Int) extends Partitioner {

  require(partitions >= , s"Number of partitions ($partitions) cannot be negative.")

  def numPartitions: Int = partitions

  def getPartition(key: Any): Int = key match {

    case null =>

    case _ => Utils.nonNegativeMod(key.hashCode, numPartitions)

  }

  override def equals(other: Any): Boolean = other match {

    case h: HashPartitioner =>

      h.numPartitions == numPartitions

    case _ =>

      false

  }

  override def hashCode: Int = numPartitions

}

范围分区 RangePartitioner ：先键值排序，确定样本大小，采样后不放回总体的随机采样方法，分配键值的分区，通过样本采样避免数据倾斜。

class RangePartitioner[K : Ordering : ClassTag, V](

    partitions: Int,

    rdd: RDD[_ <: Product2[K, V]],

    private var ascending: Boolean = true,

    val samplePointsPerPartitionHint: Int = )

  extends Partitioner {

  // A constructor declared in order to maintain backward compatibility for Java, when we add the

  // 4th constructor parameter samplePointsPerPartitionHint. See SPARK-22160.

  // This is added to make sure from a bytecode point of view, there is still a 3-arg ctor.

  def this(partitions: Int, rdd: RDD[_ <: Product2[K, V]], ascending: Boolean) = {

    this(partitions, rdd, ascending, samplePointsPerPartitionHint = )

  }

  // We allow partitions = 0, which happens when sorting an empty RDD under the default settings.

  require(partitions >= , s"Number of partitions cannot be negative but found $partitions.")

  require(samplePointsPerPartitionHint > ,

    s"Sample points per partition must be greater than 0 but found $samplePointsPerPartitionHint")

  private var ordering = implicitly[Ordering[K]]

  // An array of upper bounds for the first (partitions - 1) partitions

  private var rangeBounds: Array[K] = {

    if (partitions <= ) {

      Array.empty

    } else {

      // This is the sample size we need to have roughly balanced output partitions, capped at 1M.

      // Cast to double to avoid overflowing ints or longs

      val sampleSize = math.min(samplePointsPerPartitionHint.toDouble * partitions, 1e6)

      // Assume the input partitions are roughly balanced and over-sample a little bit.

      val sampleSizePerPartition = math.ceil(3.0 * sampleSize / rdd.partitions.length).toInt

      val (numItems, sketched) = RangePartitioner.sketch(rdd.map(_._1), sampleSizePerPartition)

      if (numItems == 0L) {

        Array.empty

      } else {

        // If a partition contains much more than the average number of items, we re-sample from it

        // to ensure that enough items are collected from that partition.

        val fraction = math.min(sampleSize / math.max(numItems, 1L), 1.0)

        val candidates = ArrayBuffer.empty[(K, Float)]

        val imbalancedPartitions = mutable.Set.empty[Int]

        sketched.foreach { case (idx, n, sample) =>

          if (fraction * n > sampleSizePerPartition) {

            imbalancedPartitions += idx

          } else {

            // The weight is 1 over the sampling probability.

            val weight = (n.toDouble / sample.length).toFloat

            for (key <- sample) {

              candidates += ((key, weight))

            }

          }

        }

        if (imbalancedPartitions.nonEmpty) {

          // Re-sample imbalanced partitions with the desired sampling probability.

          val imbalanced = new PartitionPruningRDD(rdd.map(_._1), imbalancedPartitions.contains)

          val seed = byteswap32(-rdd.id - )

          val reSampled = imbalanced.sample(withReplacement = false, fraction, seed).collect()

          val weight = (1.0 / fraction).toFloat

          candidates ++= reSampled.map(x => (x, weight))

        }

        RangePartitioner.determineBounds(candidates, math.min(partitions, candidates.size))

      }

    }

  }

  def numPartitions: Int = rangeBounds.length + 

  private var binarySearch: ((Array[K], K) => Int) = CollectionsUtils.makeBinarySearch[K]

  def getPartition(key: Any): Int = {

    val k = key.asInstanceOf[K]

    var partition =

    if (rangeBounds.length <= ) {

      // If we have less than 128 partitions naive search

      while (partition < rangeBounds.length && ordering.gt(k, rangeBounds(partition))) {

        partition +=

      }

    } else {

      // Determine which binary search method to use only once.

      partition = binarySearch(rangeBounds, k)

      // binarySearch either returns the match location or -[insertion point]-1

      if (partition < ) {

        partition = -partition-

      }

      if (partition > rangeBounds.length) {

        partition = rangeBounds.length

      }

    }

    if (ascending) {

      partition

    } else {

      rangeBounds.length - partition

    }

  }

  override def equals(other: Any): Boolean = other match {

    case r: RangePartitioner[_, _] =>

      r.rangeBounds.sameElements(rangeBounds) && r.ascending == ascending

    case _ =>

      false

  }

  override def hashCode(): Int = {

    val prime =

    var result =

    var i =

    while (i < rangeBounds.length) {

      result = prime * result + rangeBounds(i).hashCode

      i +=

    }

    result = prime * result + ascending.hashCode

    result

  }

  @throws(classOf[IOException])

  private def writeObject(out: ObjectOutputStream): Unit = Utils.tryOrIOException {

    val sfactory = SparkEnv.get.serializer

    sfactory match {

      case js: JavaSerializer => out.defaultWriteObject()

      case _ =>

        out.writeBoolean(ascending)

        out.writeObject(ordering)

        out.writeObject(binarySearch)

        val ser = sfactory.newInstance()

        Utils.serializeViaNestedStream(out, ser) { stream =>

          stream.writeObject(scala.reflect.classTag[Array[K]])

          stream.writeObject(rangeBounds)

        }

    }

  }

  @throws(classOf[IOException])

  private def readObject(in: ObjectInputStream): Unit = Utils.tryOrIOException {

    val sfactory = SparkEnv.get.serializer

    sfactory match {

      case js: JavaSerializer => in.defaultReadObject()

      case _ =>

        ascending = in.readBoolean()

        ordering = in.readObject().asInstanceOf[Ordering[K]]

        binarySearch = in.readObject().asInstanceOf[(Array[K], K) => Int]

        val ser = sfactory.newInstance()

        Utils.deserializeViaNestedStream(in, ser) { ds =>

          implicit val classTag = ds.readObject[ClassTag[Array[K]]]()

          rangeBounds = ds.readObject[Array[K]]()

        }

    }

  }

}

自定义分区函数自己根据业务数据减缓数据倾斜问题:
要实现自定义的分区器，你需要继承 org.apache.spark.Partitioner 类并实现下面三个方法

numPartitions: Int：返回创建出来的分区数。
getPartition(key: Any): Int：返回给定键的分区编号（ 0 到 numPartitions-1）。

//自定义分区类，需继承Partitioner类

class UsridPartitioner(numParts:Int) extends Partitioner{

  //覆盖分区数

  override def numPartitions: Int = numParts

  //覆盖分区号获取函数

  override def getPartition(key: Any): Int = {

     if(key.toString == "A")

           key.toString.toInt%

     else:

          key.toString.toInt%

  }

}

Spark partitionBy的更多相关文章

spark算子：partitionBy对数据进行分区
def partitionBy(partitioner: Partitioner): RDD[(K, V)] 该函数根据partitioner函数生成新的ShuffleRDD,将原RDD重新分区. s ...
Spark中repartition和partitionBy的区别
repartition 和 partitionBy 都是对数据进行重新分区,默认都是使用 HashPartitioner,区别在于partitionBy 只能用于 PairRDD,但是当它们同时都用于 ...
Spark算子--partitionBy
转载请标明出处http://www.cnblogs.com/haozhengfei/p/923b11fce561e82748baa016bcfb8421.html partitionBy--Trans ...
图解Spark API
初识spark,需要对其API有熟悉的了解才能方便开发上层应用.本文用图形的方式直观表达相关API的工作特点,并提供了解新的API接口使用的方法.例子代码全部使用python实现. 1. 数据源准备 ...
Spark 生态系统组件
摘要: 随着大数据技术的发展,实时流计算.机器学习.图计算等领域成为较热的研究方向,而Spark作为大数据处理的“利器”有着较为成熟的生态圈,能够一站式解决类似场景的问题.那你知道Spark生态系统有 ...
Spark的DataFrame的窗口函数使用
作者:Syn良子出处:http://www.cnblogs.com/cssdongl 转载请注明出处 SparkSQL这块儿从1.4开始支持了很多的窗口分析函数,像row_number这些,平时写程 ...
Learning Spark 第四章——键值对处理
本章主要介绍Spark如何处理键值对.K-V RDDs通常用于聚集操作,使用相同的key聚集或者对不同的RDD进行聚集.部分情况下,需要将spark中的数据记录转换为键值对然后进行聚集处理.我们也会对 ...
[大数据之Spark]——Transformations转换入门经典实例
Spark相比于Mapreduce的一大优势就是提供了很多的方法,可以直接使用:另一个优势就是执行速度快,这要得益于DAG的调度,想要理解这个调度规则,还要理解函数之间的依赖关系. 本篇就着重描述下S ...
【原】Learning Spark (Python版) 学习笔记(二)----键值对、数据读取与保存、共享特性
本来应该上周更新的,结果碰上五一,懒癌发作,就推迟了 = =.以后还是要按时完成任务.废话不多说,第四章-第六章主要讲了三个内容:键值对.数据读取与保存与Spark的两个共享特性(累加器和广播变量). ...

随机推荐

国内常用NTP服务器地址及
210.72.145.44 (国家授时中心服务器IP地址) 133.100.11.8 日本福冈大学 time-a.nist.gov 129.6.15.28 NIST, Gaithersburg, M ...
Linux下的tr编辑器命令详解
通过使用 tr,您可以非常容易地实现 sed 的许多最基本功能.您可以将 tr 看作为 sed 的(极其)简化的变体:它可以用一个字符来替换另一个字符,或者可以完全除去一些字符.您也可以用它来除去重复 ...
celery 原理理解
这里有一篇写的不错的:http://www.jianshu.com/p/1840035cb510 自己的“格式化”后的内容备忘下: 我们总在说c10k的问题, 也做了不少优化, 然后优化总是不够的. ...
iBatis System.ArgumentNullException : 值不能为 null。参数名: path2
System.ArgumentNullException : 值不能为 null. 参数名: path2 在app.config 或 web.config 中加上配置就可以了 <appSetti ...
c语言之要点-泛篇
1.goto goto由goto和标签名组成, 1 1 if(....) 2 2 { 3 3 ..... 4 4 goto part2; 5 5 } 6 6 part2: printf(". ...
在Ubuntu环境下安装eclipse
Eclipse运行需要Java环境,java环境的安装见https://www.cnblogs.com/Sabre/p/10349320.html,本文不再赘述. 1.下载eclipse eclips ...
[No0000CE]检测非空格字符作为密码的密码强度
Regex.Replace(pwd, "^(?:([a-z])|([A-Z])|([0-9])|(.)){6,}|(.)+$", "$1$2$3$4$5").L ...
tensorflow使用pb文件进行模型预测
OWA (Office Web Access)
exchange的web网页,可以enrich的打开,用起来还行outlook一样. 同事的chrome(under windows) 默认就是i这样的.也没装插件,也没有怎样. 我的chrome(u ...
内部排序->插入排序->希尔排序
文字描述希尔排序又称缩小增量排序,也属于插入排序类,但在时间效率上较之前的插入排序有较大的改进. 从之前的直接插入排序的分析得知,时间复杂度为n*n, 有如下两个特点: (1)如果待排序记录本身就是 ...

Spark partitionBy

Spark partitionBy的更多相关文章

随机推荐

热门专题