reduce & fold in Spark
fold and reduce both aggregate over a collection by implementing an operation you specify, the major different is the starting point of the aggregation. For fold(), you have to specify the starting value, and for reduce() the starting value is the first (or possibly an arbitrary) element in the collection.
Simple examples - we can sum the numbers in a collection using both functions:
(1 until 10).reduce( (a,b) => a+b )
(1 until 10).fold(0)( (a,b) => a+b )
With fold, we want to start at 0 and cumulatively add each element. In this case, the operation passed to fold() and reduce() were very similar, but it is helpful to think about fold in the following way. For the operation we pass to fold(), imagine its two arguments are (i) the current accumulated value and (ii) the next value in the collection,
(1 until 10).fold(0)( (accumulated_so_far, next_value) => accumulated_so_far + next_value ).
So the result of the operation, accumulated_so_far + next_value, will be passed to the operation again as the first argument, and so on.
In this way, we could count the number of elements in a collection using fold,
(1 until 10).fold(0)( (accumulated_so_far, next_value) => accumulated_so_far + 1 ).
When it comes to Spark, here’s another thing to keep in mind. For both reduce and fold, you need to make sure your operation is both commutative and associative. For RDDs, reduce and fold are implemented on each partition separately, and then the results are combined using the operation. With fold, this could get you into trouble because an empty partition will emit fold’s starting value, so the number of partitions might erroneously affect the result of the calculation, if you’re not careful about the operation. This would occur with the ( (a,b) => a+1) operation from above (see http://stackoverflow.com/questions/29150202/pyspark-fold-method-output).
reduce & fold in Spark的更多相关文章
- Spark计算模型
[TOC] Spark计算模型 Spark程序模型 一个经典的示例模型 SparkContext中的textFile函数从HDFS读取日志文件,输出变量file var file = sc.textF ...
- 【原】Learning Spark (Python版) 学习笔记(一)----RDD 基本概念与命令
<Learning Spark>这本书算是Spark入门的必读书了,中文版是<Spark快速大数据分析>,不过豆瓣书评很有意思的是,英文原版评分7.4,评论都说入门而已深入不足 ...
- (转)Spark 算子系列文章
http://lxw1234.com/archives/2015/07/363.htm Spark算子:RDD基本转换操作(1)–map.flagMap.distinct Spark算子:RDD创建操 ...
- zhihu spark集群,书籍,论文
spark集群中的节点可以只处理自身独立数据库里的数据,然后汇总吗? 修改 我将spark搭建在两台机器上,其中一台既是master又是slave,另一台是slave,两台机器上均装有独立的mongo ...
- [转]Spark学习之路 (三)Spark之RDD
Spark学习之路 (三)Spark之RDD https://www.cnblogs.com/qingyunzong/p/8899715.html 目录 一.RDD的概述 1.1 什么是RDD? ...
- 【spark 深入学习 06】RDD编程之旅基础篇02-Spaek shell
--------------------- 本节内容: · Spark转换 RDD操作实例 · Spark行动 RDD操作实例 · 参考资料 --------------------- 关于学习编程方 ...
- Spark学习之路 (三)Spark之RDD
一.RDD的概述 1.1 什么是RDD? RDD(Resilient Distributed Dataset)叫做弹性分布式数据集,是Spark中最基本的数据抽象,它代表一个不可变.可分区.里面的元素 ...
- <Spark><Programming><RDDs>
Introduction to Core Spark Concepts driver program: 在集群上启动一系列的并行操作 包含应用的main函数,定义集群上的分布式数据集,操作数据集 通过 ...
- Spark(三)RDD与广播变量、累加器
一.RDD的概述 1.1 什么是RDD RDD(Resilient Distributed Dataset)叫做弹性分布式数据集,是Spark中最基本的数据抽象,它代表一个不可变.可分区.里面的元素可 ...
随机推荐
- 如何检查Windows网络通信端口占用
最近本地测试jsp程序发现tomcat启动失败,无法监听8080端口,也没记得别的什么程序占用了8080端口,干脆就改成了8090端口先用着.今天找了找Windows上查看网络通信端口占用的方法. 先 ...
- 解决phpstudy mysql 启动不了的问题
1.端口监测 查看3306 的端口是否被占用,如占用,停止进程 2.服务没有启动.因为学习python 我把phpstudy的mysql升级到了mysql8.0. sc delete mysql 删 ...
- 洛谷 P1567 统计天数
题目背景 统计天数 题目描述 炎热的夏日,KC非常的不爽.他宁可忍受北极的寒冷,也不愿忍受厦门的夏天.最近,他开始研究天气的变化.他希望用研究的结果预测未来的天气. 经历千辛万苦,他收集了连续N(1& ...
- 训练1-X
输入n(n<100)个数,找出其中最小的数,将它与最前面的数交换后输出这些数. Input 输入数据有多组,每组占一行,每行的开始是一个整数n,表示这个测试实例的数值的个数,跟着就是n个整数.n ...
- JAVA的基本数据类型和引用数据类型的区别
引用数据类型: 类.接口类型.数组类型.枚举类型.注解类型: 基本数据类型和引用数据类型的区别: 基本数据类型在被创建时,在栈上给其划分一块内存,将数值直接存储在栈上: 引用数据类型在被创 ...
- 数据库连接池dataesoruce pool深入理解
8.数据库连接池的connection都是长连接的,以方便多次调用,多人连续使用.dataSourcePool9.数据库连接池中的连接,是在你用完之后,返回给数据库连接池的,并不是close()掉,而 ...
- HDOJ 5099 Comparison of Android versions 坑题
现场赛的时候错了十四次. . ... Comparison of Android versions Time Limit: 2000/1000 MS (Java/Others) Memory L ...
- .NET开源的背后:是无奈,还是顺应潮流?
摘要:微软.NET的开源,让很多开发者欣喜若狂.同一时候也有很多人好奇其背后的故事,过去视开源为癌症的微软为什么会突然有这一举措,是出于无奈,还是顺应潮流,而这当中的种种也许能够用文中的六个观点来说明 ...
- BEGINNING SHAREPOINT® 2013 DEVELOPMENT 第10章节--SP2013中OAuth概览 SP2013中的OAuth
BEGINNING SHAREPOINT® 2013 DEVELOPMENT 第10章节--SP2013中OAuth概览 SP2013中的OAuth SP apps使用OAuth授权 ...
- UVA - 1642 Magical GCD 数学
Magical GCD The Magical GCD of a nonempty sequence of positive integer ...