Spark partitionBy
/**
* An object that defines how the elements in a key-value pair RDD are partitioned by key.
* Maps each key to a partition ID, from 0 to `numPartitions - 1`.
*/
abstract class Partitioner extends Serializable {
def numPartitions: Int
def getPartition(key: Any): Int
}
import org.apache.spark.HashPartitioner
import org.apache.spark.sql.SparkSession
//查看rdd中的每个分区元素
object PartitionBy_Test {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().master("local").appName(this.getClass.getSimpleName).getOrCreate()
val rdd = spark.sparkContext.parallelize(Array(("a", ), ("a", ), ("b", ), ("b", ), (("c", )), (("e", ))), ) val result = rdd.mapPartitionsWithIndex {
(partIdx, iter) => { val part_map = scala.collection.mutable.Map[String, List[(String, Int)]]() while (iter.hasNext) {
val part_name = "part_" + partIdx
var elem = iter.next()
if (part_map.contains(part_name)) {
var elems = part_map(part_name)
elems ::= elem
part_map(part_name) = elems
} else {
part_map(part_name) = List[(String, Int)] {
elem
}
}
}
part_map.iterator
}
}.collect
result.foreach(x => println(x._1 + ":" + x._2.toString())) }
}
这里的分区方法可以选择, 默认的分区就是HashPartition分区,
注意如果多次使用该RDD或者进行join操作, 分区后peresist持久化操作
/**
* A [[org.apache.spark.Partitioner]] that implements hash-based partitioning using
* Java's `Object.hashCode`.
*
* Java arrays have hashCodes that are based on the arrays' identities rather than their contents,
* so attempting to partition an RDD[Array[_]] or RDD[(Array[_], _)] using a HashPartitioner will
* produce an unexpected or incorrect result.
*/
class HashPartitioner(partitions: Int) extends Partitioner {
require(partitions >= , s"Number of partitions ($partitions) cannot be negative.") def numPartitions: Int = partitions def getPartition(key: Any): Int = key match {
case null =>
case _ => Utils.nonNegativeMod(key.hashCode, numPartitions)
} override def equals(other: Any): Boolean = other match {
case h: HashPartitioner =>
h.numPartitions == numPartitions
case _ =>
false
} override def hashCode: Int = numPartitions
}
范围分区 RangePartitioner :先键值排序, 确定样本大小,采样后不放回总体的随机采样方法, 分配键值的分区,通过样本采样避免数据倾斜。
class RangePartitioner[K : Ordering : ClassTag, V](
partitions: Int,
rdd: RDD[_ <: Product2[K, V]],
private var ascending: Boolean = true,
val samplePointsPerPartitionHint: Int = )
extends Partitioner { // A constructor declared in order to maintain backward compatibility for Java, when we add the
// 4th constructor parameter samplePointsPerPartitionHint. See SPARK-22160.
// This is added to make sure from a bytecode point of view, there is still a 3-arg ctor.
def this(partitions: Int, rdd: RDD[_ <: Product2[K, V]], ascending: Boolean) = {
this(partitions, rdd, ascending, samplePointsPerPartitionHint = )
} // We allow partitions = 0, which happens when sorting an empty RDD under the default settings.
require(partitions >= , s"Number of partitions cannot be negative but found $partitions.")
require(samplePointsPerPartitionHint > ,
s"Sample points per partition must be greater than 0 but found $samplePointsPerPartitionHint") private var ordering = implicitly[Ordering[K]] // An array of upper bounds for the first (partitions - 1) partitions
private var rangeBounds: Array[K] = {
if (partitions <= ) {
Array.empty
} else {
// This is the sample size we need to have roughly balanced output partitions, capped at 1M.
// Cast to double to avoid overflowing ints or longs
val sampleSize = math.min(samplePointsPerPartitionHint.toDouble * partitions, 1e6)
// Assume the input partitions are roughly balanced and over-sample a little bit.
val sampleSizePerPartition = math.ceil(3.0 * sampleSize / rdd.partitions.length).toInt
val (numItems, sketched) = RangePartitioner.sketch(rdd.map(_._1), sampleSizePerPartition)
if (numItems == 0L) {
Array.empty
} else {
// If a partition contains much more than the average number of items, we re-sample from it
// to ensure that enough items are collected from that partition.
val fraction = math.min(sampleSize / math.max(numItems, 1L), 1.0)
val candidates = ArrayBuffer.empty[(K, Float)]
val imbalancedPartitions = mutable.Set.empty[Int]
sketched.foreach { case (idx, n, sample) =>
if (fraction * n > sampleSizePerPartition) {
imbalancedPartitions += idx
} else {
// The weight is 1 over the sampling probability.
val weight = (n.toDouble / sample.length).toFloat
for (key <- sample) {
candidates += ((key, weight))
}
}
}
if (imbalancedPartitions.nonEmpty) {
// Re-sample imbalanced partitions with the desired sampling probability.
val imbalanced = new PartitionPruningRDD(rdd.map(_._1), imbalancedPartitions.contains)
val seed = byteswap32(-rdd.id - )
val reSampled = imbalanced.sample(withReplacement = false, fraction, seed).collect()
val weight = (1.0 / fraction).toFloat
candidates ++= reSampled.map(x => (x, weight))
}
RangePartitioner.determineBounds(candidates, math.min(partitions, candidates.size))
}
}
} def numPartitions: Int = rangeBounds.length + private var binarySearch: ((Array[K], K) => Int) = CollectionsUtils.makeBinarySearch[K] def getPartition(key: Any): Int = {
val k = key.asInstanceOf[K]
var partition =
if (rangeBounds.length <= ) {
// If we have less than 128 partitions naive search
while (partition < rangeBounds.length && ordering.gt(k, rangeBounds(partition))) {
partition +=
}
} else {
// Determine which binary search method to use only once.
partition = binarySearch(rangeBounds, k)
// binarySearch either returns the match location or -[insertion point]-1
if (partition < ) {
partition = -partition-
}
if (partition > rangeBounds.length) {
partition = rangeBounds.length
}
}
if (ascending) {
partition
} else {
rangeBounds.length - partition
}
} override def equals(other: Any): Boolean = other match {
case r: RangePartitioner[_, _] =>
r.rangeBounds.sameElements(rangeBounds) && r.ascending == ascending
case _ =>
false
} override def hashCode(): Int = {
val prime =
var result =
var i =
while (i < rangeBounds.length) {
result = prime * result + rangeBounds(i).hashCode
i +=
}
result = prime * result + ascending.hashCode
result
} @throws(classOf[IOException])
private def writeObject(out: ObjectOutputStream): Unit = Utils.tryOrIOException {
val sfactory = SparkEnv.get.serializer
sfactory match {
case js: JavaSerializer => out.defaultWriteObject()
case _ =>
out.writeBoolean(ascending)
out.writeObject(ordering)
out.writeObject(binarySearch) val ser = sfactory.newInstance()
Utils.serializeViaNestedStream(out, ser) { stream =>
stream.writeObject(scala.reflect.classTag[Array[K]])
stream.writeObject(rangeBounds)
}
}
} @throws(classOf[IOException])
private def readObject(in: ObjectInputStream): Unit = Utils.tryOrIOException {
val sfactory = SparkEnv.get.serializer
sfactory match {
case js: JavaSerializer => in.defaultReadObject()
case _ =>
ascending = in.readBoolean()
ordering = in.readObject().asInstanceOf[Ordering[K]]
binarySearch = in.readObject().asInstanceOf[(Array[K], K) => Int] val ser = sfactory.newInstance()
Utils.deserializeViaNestedStream(in, ser) { ds =>
implicit val classTag = ds.readObject[ClassTag[Array[K]]]()
rangeBounds = ds.readObject[Array[K]]()
}
}
}
}
自定义分区函数 自己根据业务数据减缓数据倾斜问题:
要实现自定义的分区器,你需要继承 org.apache.spark.Partitioner 类并实现下面三个方法
- numPartitions: Int:返回创建出来的分区数。
- getPartition(key: Any): Int:返回给定键的分区编号( 0 到 numPartitions-1)。
//自定义分区类,需继承Partitioner类
class UsridPartitioner(numParts:Int) extends Partitioner{
//覆盖分区数
override def numPartitions: Int = numParts //覆盖分区号获取函数
override def getPartition(key: Any): Int = {
if(key.toString == "A")
key.toString.toInt%
else:
key.toString.toInt%
}
}
Spark partitionBy的更多相关文章
- spark算子:partitionBy对数据进行分区
def partitionBy(partitioner: Partitioner): RDD[(K, V)] 该函数根据partitioner函数生成新的ShuffleRDD,将原RDD重新分区. s ...
- Spark中repartition和partitionBy的区别
repartition 和 partitionBy 都是对数据进行重新分区,默认都是使用 HashPartitioner,区别在于partitionBy 只能用于 PairRDD,但是当它们同时都用于 ...
- Spark算子--partitionBy
转载请标明出处http://www.cnblogs.com/haozhengfei/p/923b11fce561e82748baa016bcfb8421.html partitionBy--Trans ...
- 图解Spark API
初识spark,需要对其API有熟悉的了解才能方便开发上层应用.本文用图形的方式直观表达相关API的工作特点,并提供了解新的API接口使用的方法.例子代码全部使用python实现. 1. 数据源准备 ...
- Spark 生态系统组件
摘要: 随着大数据技术的发展,实时流计算.机器学习.图计算等领域成为较热的研究方向,而Spark作为大数据处理的“利器”有着较为成熟的生态圈,能够一站式解决类似场景的问题.那你知道Spark生态系统有 ...
- Spark的DataFrame的窗口函数使用
作者:Syn良子 出处:http://www.cnblogs.com/cssdongl 转载请注明出处 SparkSQL这块儿从1.4开始支持了很多的窗口分析函数,像row_number这些,平时写程 ...
- Learning Spark 第四章——键值对处理
本章主要介绍Spark如何处理键值对.K-V RDDs通常用于聚集操作,使用相同的key聚集或者对不同的RDD进行聚集.部分情况下,需要将spark中的数据记录转换为键值对然后进行聚集处理.我们也会对 ...
- [大数据之Spark]——Transformations转换入门经典实例
Spark相比于Mapreduce的一大优势就是提供了很多的方法,可以直接使用:另一个优势就是执行速度快,这要得益于DAG的调度,想要理解这个调度规则,还要理解函数之间的依赖关系. 本篇就着重描述下S ...
- 【原】Learning Spark (Python版) 学习笔记(二)----键值对、数据读取与保存、共享特性
本来应该上周更新的,结果碰上五一,懒癌发作,就推迟了 = =.以后还是要按时完成任务.废话不多说,第四章-第六章主要讲了三个内容:键值对.数据读取与保存与Spark的两个共享特性(累加器和广播变量). ...
随机推荐
- [原]Jenkins(四)---Jenkins添加密钥对
/** * lihaibo * 文章内容都是根据自己工作情况实践得出. *版权声明:本博客欢迎转发,但请保留原作者信息! http://www.cnblogs.com/horizonli/p/5332 ...
- linux Email 体系
大致了解了DNS与邮件服务器之间的关系后,接下来我们介绍邮件到底是如何传送到目的邮件主机的.下面我们分成“寄信”与“收信”两个主要的邮件服务器使用方式进行介绍.先说明关于“寄信”的部分.通常我们都是使 ...
- javaWeb的基础知识
在服务器中,端口号是比较重要的,要学会查看和修改.win7有cmd和任务管理器两种方法.同时区分include动作和指令. <%@ include file="url"> ...
- MySQL主从同步技术
一,先装mysql 准备两台服务器,主服务器是10.0.0.200 副服务器是10.0.0.128主服务器当作mysql的服务端,下载mysql-serveryum install mysql-ser ...
- 二叉苹果树|codevs5565|luoguP2015|树形DP|Elena
二叉苹果树 题目描述 有一棵苹果树,如果树枝有分叉,一定是分2叉(就是说没有只有1个儿子的结点) 这棵树共有N个结点(叶子点或者树枝分叉点),编号为1-N,树根编号一定是1. 我们用一根树枝两端连接的 ...
- CH 2601 - 电路维修 - [双端队列BFS]
题目链接:传送门 描述 Ha'nyu是来自异世界的魔女,她在漫无目的地四处漂流的时候,遇到了善良的少女Rika,从而被收留在地球上.Rika的家里有一辆飞行车.有一天飞行车的电路板突然出现了故障,导致 ...
- 两层LSTM的使用
一层的lstm效果不是很好,使用两层的lstm,代码如下. with graph.as_default(): inputs_ = tf.placeholder(tf.int32, [None, seq ...
- 彻底卸载tv
1.卸载 2.C:\Program Files (x86),找到teamviewer选项,右击删除 3.开始--输入regedit,打开注册表,找到如下路径:HKEY_LOCAL_MACHINE\SO ...
- openssl编译参数选项
执行Configure是常见参数选项如下: 安装参数: --openssldir=OPENSSLDIR 安装目录,默认是 /usr/local/ssl . --prefix=PREFIX 设置 lib ...
- 用canvas把页面中所有元素的轮廓绘制出来
function plot(){//绘制函数 // 创建一个canvas画布 const canvas=document.createElement("canvas"); canv ...