spark中产生shuffle的算子

Spark中产生shuffle的算子

作用	算子名	能否替换，由谁替换
去重	distinct()	不能
聚合	reduceByKey()	groupByKey
	groupBy()
	groupByKey()	reduceByKey
	aggregateByKey()
	combineByKey()
排序	sortByKey()
排序	sortBy()
重分区	coalesce()
重分区	repartition()
集合或者表操作	Intersection()
	Substract()
	SubstractByKey()
	Join()
	LeftOutJoin()

https://www.cnblogs.com/Alex-zqzy/p/9949117.html

去重

def distinct()

def distinct(numPartitions: Int)

聚合

def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)]

def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)]

def groupBy[K](f: T => K, p: Partitioner):RDD[(K, Iterable[V])]

def groupByKey(partitioner: Partitioner):RDD[(K, Iterable[V])]

def aggregateByKey[U: ClassTag](zeroValue: U, partitioner: Partitioner): RDD[(K, U)]

def aggregateByKey[U: ClassTag](zeroValue: U, numPartitions: Int): RDD[(K, U)]

def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C): RDD[(K, C)]

def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C, numPartitions: Int): RDD[(K, C)]

def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C, partitioner: Partitioner, mapSideCombine: Boolean = true, serializer: Serializer = null): RDD[(K, C)]

排序

def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length): RDD[(K, V)]

def sortBy[K](f: (T) => K, ascending: Boolean = true, numPartitions: Int = this.partitions.length)(implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T]

重分区

def coalesce(numPartitions: Int, shuffle: Boolean = false, partitionCoalescer: Option[PartitionCoalescer] = Option.empty)

def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null)

集合或者表操作

def intersection(other: RDD[T]): RDD[T]

def intersection(other: RDD[T], partitioner: Partitioner)(implicit ord: Ordering[T] = null): RDD[T]

def intersection(other: RDD[T], numPartitions: Int): RDD[T]

def subtract(other: RDD[T], numPartitions: Int): RDD[T]

def subtract(other: RDD[T], p: Partitioner)(implicit ord: Ordering[T] = null): RDD[T]

def subtractByKey[W: ClassTag](other: RDD[(K, W)]): RDD[(K, V)]

def subtractByKey[W: ClassTag](other: RDD[(K, W)], numPartitions: Int): RDD[(K, V)]

def subtractByKey[W: ClassTag](other: RDD[(K, W)], p: Partitioner): RDD[(K, V)]

def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))]

def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]

def join[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K, (V, W))]

def leftOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (V, Option[W]))]

spark中产生shuffle的算子的更多相关文章

Spark中的各种action算子操作（java版）
在我看来,Spark编程中的action算子的作用就像一个触发器,用来触发之前的transformation算子.transformation操作具有懒加载的特性,你定义完操作之后并不会立即加载,只有 ...
Spark会产生shuffle的算子
去重 def distinct() def distinct(numPartitions: Int) 聚合 def reduceByKey(func: (V, V) => V, numParti ...
spark中map和mapPartitions算子的区别
区别: 1.map是对rdd中每一个元素进行操作 2.mapPartitions是对rdd中每个partition的迭代器进行操作 mapPartitions优点: 1.若是普通map,比如一个par ...
[Spark性能调优] 第三章 : Spark 2.1.0 中 Sort-Based Shuffle 产生的内幕
本課主題 Sorted-Based Shuffle 的诞生和介绍 Shuffle 中六大令人费解的问题 Sorted-Based Shuffle 的排序和源码鉴赏 Shuffle 在运行时的内存管理 ...
Spark 2.x 中 Sort-Based Shuffle 产生的内幕
本课主题 Sorted-Based Shuffle 的诞生和介绍 Shuffle 中六大令人费解的问题 Sorted-Based Shuffle 的排序和源码鉴赏 Shuffle 在运行时的内存管理 ...
Spark中shuffle的触发和调度
Spark中的shuffle是在干嘛? Shuffle在Spark中即是把父RDD中的KV对按照Key重新分区,从而得到一个新的RDD.也就是说原本同属于父RDD同一个分区的数据需要进入到子RDD的不 ...
spark性能调优（二）彻底解密spark的Hash Shuffle
装载:http://www.cnblogs.com/jcchoiling/p/6431969.html 引言 Spark HashShuffle 是它以前的版本,现在1.6x 版本默应是 Sort-B ...
spark中数据倾斜解决方案
数据倾斜导致的致命后果: 1 数据倾斜直接会导致一种情况:OOM. 2 运行速度慢,特别慢,非常慢,极端的慢,不可接受的慢. 搞定数据倾斜需要: 1.搞定shuffle 2.搞定业务场景 3 搞定 c ...
spark教程(13)-shuffle介绍
shuffle 简介 shuffle 描述了数据从 map task 输出到 reduce task 输入的过程,shuffle 是连接 map 和 reduce 的桥梁: shuffle 性能的高低 ...

随机推荐

ubuntu16.04 + CUDA 9.0 + opencv3.3 安装
安装前的准备 CUDA 9.0 安装,可以参看Ubuntu16.04 + cuda9.0 + cudnn7.1.4 + tensorflow安装 opencv 3.3.0 下载 ippicv_2017 ...
POJ2274(后缀数组应用)
Long Long Message Time Limit: 4000MS Memory Limit: 131072K Total Submissions: 25272 Accepted: 10 ...
杂项：UI
ylbtech-杂项:UI 1.返回顶部 1. UI即User Interface(用户界面)的简称.泛指用户的操作界面,包含移动APP,网页,智能穿戴设备等.UI设计主要指界面的样式,美观程度.而使 ...
py-day17-jquery
jquery实现案例案例: 1.点赞 <!DOCTYPE html> <html lang="en"> <head> <meta cha ...
二、Chrome开发者工具详解(2)-Network面板
摘自: http://www.cnblogs.com/charliechu/p/5981346.html
Kmeans算法--python实现
一:Kmeans算法基本思想: k-means算法是一种很常见的聚类算法,它的基本思想是:通过迭代寻找k个聚类的一种划分方案,使得用这k个聚类的均值来代表相应各类样本时所得的总体误差最小. k-mea ...
纯java config配置Spring MVC实例
1.首先创建一个Maven工程,项目结构如下: pom.xml添加Spring和servlet依赖,配置如下 <project xmlns="http://maven.apache.o ...
vsftpd总结
1 vsftps配置文件详解 (1)/user/sbin/vsftpd 主程序 (2)/etc/rc.d/init.d/vsftpd 启动脚本 (3)/etc/pam.d/vsftpd (file= ...
11.ClientCredential模式总结
11.ClientCredential模式总结服务端定义的Resource叫做api Resource API默认是被保护的第三方的客户端先去请求 Server拿到access token.带着t ...
51nod 1099【贪心】
思路: 我们可以思考对于仅仅两个元素来说,A,B; 先选A的话是会 A.b+B.a; 先选B的话是会 B.b+A.a; 所以哪个小哪个就放前面; #include <cstdio> #i ...

spark中产生shuffle的算子

spark中产生shuffle的算子的更多相关文章

随机推荐

热门专题