grouped differently across partitions
【熵增】
由无序到有序
http://spark.apache.org/docs/latest/rdd-programming-guide.html#shuffle-operations
Shuffle operations
Certain operations within Spark trigger an event known as the shuffle. The shuffle is Spark’s mechanism for re-distributing data so that it’s grouped differently across partitions. This typically involves copying data across executors and machines, making the shuffle a complex and costly operation.
Background
To understand what happens during the shuffle we can consider the example of the reduceByKey operation. The reduceByKey operation generates a new RDD where all values for a single key are combined into a tuple - the key and the result of executing a reduce function against all values associated with that key. The challenge is that not all values for a single key necessarily reside on the same partition, or even the same machine, but they must be co-located to compute the result.
In Spark, data is generally not distributed across partitions to be in the necessary place for a specific operation. During computations, a single task will operate on a single partition - thus, to organize all the data for a single reduceByKey reduce task to execute, Spark needs to perform an all-to-all operation. It must read from all partitions to find all the values for all keys, and then bring together values across partitions to compute the final result for each key - this is called the shuffle.
Although the set of elements in each partition of newly shuffled data will be deterministic, and so is the ordering of partitions themselves, the ordering of these elements is not. If one desires predictably ordered data following shuffle then it’s possible to use:
mapPartitionsto sort each partition using, for example,.sortedrepartitionAndSortWithinPartitionsto efficiently sort partitions while simultaneously repartitioningsortByto make a globally ordered RDD
Operations which can cause a shuffle include repartition operations like repartition and coalesce, ‘ByKey operations (except for counting) like groupByKey and reduceByKey, and join operations like cogroup and join.
Performance Impact
The Shuffle is an expensive operation since it involves disk I/O, data serialization, and network I/O. To organize data for the shuffle, Spark generates sets of tasks - map tasks to organize the data, and a set of reduce tasks to aggregate it. This nomenclature comes from MapReduce and does not directly relate to Spark’s map and reduce operations.
Internally, results from individual map tasks are kept in memory until they can’t fit. Then, these are sorted based on the target partition and written to a single file. On the reduce side, tasks read the relevant sorted blocks.
Certain shuffle operations can consume significant amounts of heap memory since they employ in-memory data structures to organize records before or after transferring them. Specifically, reduceByKey and aggregateByKey create these structures on the map side, and 'ByKey operations generate these on the reduce side. When data does not fit in memory Spark will spill these tables to disk, incurring the additional overhead of disk I/O and increased garbage collection.
Shuffle also generates a large number of intermediate files on disk. As of Spark 1.3, these files are preserved until the corresponding RDDs are no longer used and are garbage collected. This is done so the shuffle files don’t need to be re-created if the lineage is re-computed. Garbage collection may happen only after a long period of time, if the application retains references to these RDDs or if GC does not kick in frequently. This means that long-running Spark jobs may consume a large amount of disk space. The temporary storage directory is specified by the spark.local.dirconfiguration parameter when configuring the Spark context.
Shuffle behavior can be tuned by adjusting a variety of configuration parameters. See the ‘Shuffle Behavior’ section within the Spark Configuration Guide.
grouped differently across partitions的更多相关文章
- 【原创】大数据基础之Spark(5)Shuffle实现原理及代码解析
一 简介 Shuffle,简而言之,就是对数据进行重新分区,其中会涉及大量的网络io和磁盘io,为什么需要shuffle,以词频统计reduceByKey过程为例, serverA:partition ...
- Spark 学习笔记:(二)编程指引(Scala版)
参考: http://spark.apache.org/docs/latest/programming-guide.html 后面懒得翻译了,英文记的,以后复习时再翻. 摘要:每个Spark appl ...
- HDU-3280 Equal Sum Partitions
http://acm.hdu.edu.cn/showproblem.php?pid=3280 用了简单的枚举. Equal Sum Partitions Time Limit: 2000/1000 M ...
- HDU 3280 Equal Sum Partitions(二分查找)
Equal Sum Partitions Time Limit: 2000/1000 MS (Java/Others) Memory Limit: 32768/32768 K (Java/Oth ...
- Partitioning & Archiving tables in SQL Server (Part 2: Split, Merge and Switch partitions)
Reference: http://blogs.msdn.com/b/felixmar/archive/2011/08/29/partitioning-amp-archiving-tables-in- ...
- Rotate partitions in DB2 on z
Rotating partitions You can use the ALTER TABLE statement to rotate any logical partition to becom ...
- How to choose the number of topics/partitions in a Kafka cluster?
This is a common question asked by many Kafka users. The goal of this post is to explain a few impor ...
- rabbitmq之partitions
集群为了保证数据一致性,在同步数据的同时也会通过节点之间的心跳通信来保证对方存活.那如果集群节点通信异常会发生什么,系统如何保障正常提供服务,使用何种策略回复呢? rabbitmq提供的处理脑裂的方法 ...
- 8 ways rich people view the world differently than the average person
Self-made millionaire Steve Siebold spent 26 years interviewing some of the wealthiest people in the ...
随机推荐
- 转 使用putty从linux主机上面往windows主机下面拷贝文件
更新一下,把putty的包解压以后,想要在dos窗口中直接使用,必须把putty解压的文件的路径添加到环境变量中,这样使用起来就会非常简单了. 郁闷了好久,终于搞定了putty的上传下载文件命令psc ...
- Echarts-之显示百分比
对于使用echarts要显示百分比,要改两个地方,第一个地方时坐标轴显示为百分比的格式,第二个是让值以百分比的形式显示,如50,在图上面显示为50%. yAxis: [ { type: 'value' ...
- 洛谷—— P1875 佳佳的魔法药水
https://www.luogu.org/problemnew/show/1875 题目背景 发完了 k 张照片,佳佳却得到了一个坏消息:他的 MM 得病了!佳佳和大家一样焦急 万分!治好 MM 的 ...
- 搭建高可用服务注册中心-Spring Cloud学习第一天(非原创)
文章大纲 一.Spring Cloud基础知识介绍二.创建单一的服务注册中心三.创建一个服务提供者四.搭建高可用服务注册中心五.项目源码与参考资料下载六.参考文章 一.Spring Cloud基础 ...
- JS中原型链中的prototype与_proto_的个人理解与详细总结(**************************************************************)
一直认为原型链太过复杂,尤其看过某图后被绕晕了一整子,今天清理硬盘空间(渣电脑),偶然又看到这图,勾起了点回忆,于是索性复习一下原型链相关的内容,表达能力欠缺逻辑混乱别见怪(为了防止新人__(此处指我 ...
- spring管理事务
2.1 事务管理器 Spring并不直接管理事务,而是提供了多种事务管理器,他们将事务管理的职责委托给Hibernate或者JTA等持久化机制所提供的相关平台框架的事务来实现. Spring事务管理器 ...
- tensorflow搭建神经网络基本流程
定义添加神经层的函数 1.训练的数据2.定义节点准备接收数据3.定义神经层:隐藏层和预测层4.定义 loss 表达式5.选择 optimizer 使 loss 达到最小 然后对所有变量进行初始化,通过 ...
- standford情感分析代码开源地址
http://nlp.stanford.edu/sentiment/code.html
- centos 升级内核失败回救
在升级 centos6.3上使用, yum -y update ... 灾难出现了!!! 解决方法: 1. 在机器启动的时候, 按F1, 会出现选择内核,选一个原来的. 2. vim /etc/gr ...
- Spring核心(ioc控制反转)
IoC,Inversion Of Control 即控制反转,由容器来管理业务对象之间的依赖关系,而非传统方式中的由代码来管理. 其本质.即将控制权由应用程序代码转到了外部容器,控制权的转移就是 ...