reduce & fold in Spark
fold and reduce both aggregate over a collection by implementing an operation you specify, the major different is the starting point of the aggregation. For fold(), you have to specify the starting value, and for reduce() the starting value is the first (or possibly an arbitrary) element in the collection.
Simple examples - we can sum the numbers in a collection using both functions:
(1 until 10).reduce( (a,b) => a+b )
(1 until 10).fold(0)( (a,b) => a+b )
With fold, we want to start at 0 and cumulatively add each element. In this case, the operation passed to fold() and reduce() were very similar, but it is helpful to think about fold in the following way. For the operation we pass to fold(), imagine its two arguments are (i) the current accumulated value and (ii) the next value in the collection,
(1 until 10).fold(0)( (accumulated_so_far, next_value) => accumulated_so_far + next_value ).
So the result of the operation, accumulated_so_far + next_value, will be passed to the operation again as the first argument, and so on.
In this way, we could count the number of elements in a collection using fold,
(1 until 10).fold(0)( (accumulated_so_far, next_value) => accumulated_so_far + 1 ).
When it comes to Spark, here’s another thing to keep in mind. For both reduce and fold, you need to make sure your operation is both commutative and associative. For RDDs, reduce and fold are implemented on each partition separately, and then the results are combined using the operation. With fold, this could get you into trouble because an empty partition will emit fold’s starting value, so the number of partitions might erroneously affect the result of the calculation, if you’re not careful about the operation. This would occur with the ( (a,b) => a+1) operation from above (see http://stackoverflow.com/questions/29150202/pyspark-fold-method-output).
reduce & fold in Spark的更多相关文章
- Spark计算模型
[TOC] Spark计算模型 Spark程序模型 一个经典的示例模型 SparkContext中的textFile函数从HDFS读取日志文件,输出变量file var file = sc.textF ...
- 【原】Learning Spark (Python版) 学习笔记(一)----RDD 基本概念与命令
<Learning Spark>这本书算是Spark入门的必读书了,中文版是<Spark快速大数据分析>,不过豆瓣书评很有意思的是,英文原版评分7.4,评论都说入门而已深入不足 ...
- (转)Spark 算子系列文章
http://lxw1234.com/archives/2015/07/363.htm Spark算子:RDD基本转换操作(1)–map.flagMap.distinct Spark算子:RDD创建操 ...
- zhihu spark集群,书籍,论文
spark集群中的节点可以只处理自身独立数据库里的数据,然后汇总吗? 修改 我将spark搭建在两台机器上,其中一台既是master又是slave,另一台是slave,两台机器上均装有独立的mongo ...
- [转]Spark学习之路 (三)Spark之RDD
Spark学习之路 (三)Spark之RDD https://www.cnblogs.com/qingyunzong/p/8899715.html 目录 一.RDD的概述 1.1 什么是RDD? ...
- 【spark 深入学习 06】RDD编程之旅基础篇02-Spaek shell
--------------------- 本节内容: · Spark转换 RDD操作实例 · Spark行动 RDD操作实例 · 参考资料 --------------------- 关于学习编程方 ...
- Spark学习之路 (三)Spark之RDD
一.RDD的概述 1.1 什么是RDD? RDD(Resilient Distributed Dataset)叫做弹性分布式数据集,是Spark中最基本的数据抽象,它代表一个不可变.可分区.里面的元素 ...
- <Spark><Programming><RDDs>
Introduction to Core Spark Concepts driver program: 在集群上启动一系列的并行操作 包含应用的main函数,定义集群上的分布式数据集,操作数据集 通过 ...
- Spark(三)RDD与广播变量、累加器
一.RDD的概述 1.1 什么是RDD RDD(Resilient Distributed Dataset)叫做弹性分布式数据集,是Spark中最基本的数据抽象,它代表一个不可变.可分区.里面的元素可 ...
随机推荐
- Codeforces Round #468 (Div. 2, based on Technocup 2018 Final Round)D. Peculiar apple-tree
In Arcady's garden there grows a peculiar apple-tree that fruits one time per year. Its peculiarity ...
- JSON 基础学习1
http://www.360doc.com/content/10/0809/22/2633_44873063.shtml JSON转字符串: json.stringify(jsonobj); 字符串转 ...
- JavaScript中==和===区别
在我们的日常编码中对于===是不常用的,但是它很重要 ===:表示绝对相等(严格) !==:表示不绝对相等 ==:表示相等(不严格) !=:表示不相等 看一下列子: null==undefined ...
- hdu 1166 敌兵布阵——(区间和)树状数组/线段树
pid=1166">here:http://acm.hdu.edu.cn/showproblem.php?pid=1166 Input 第一行一个整数T.表示有T组数据. 每组数据第一 ...
- lpad&rpad
lpad( string, padded_length, [ pad_string ] ) string: 准备被填充的字符串 padded_length: 填充之后的字符串长度 pad_string ...
- 【ruby项目,语言提交检查(一)】怎样高速学习ruby ?
怎样高速学习ruby ? 学习语言最快的思路. 变量,常量,变量类型,操作符. 逻辑语句如 if, else, switch, for, foreach, do while, break, 等等.要学 ...
- Android 推断程序在手机中是否是活动状态或者正在执行状态
沈阳斌子在今天项目需求上碰到个这种问题,在Service中须要推断当前的程序是否是活动状态,换句话说也就是说后台跑的服务中有业务需求检測当前程序是否是该服务的程序 这样好让点击推送通知时跳转到不同的页 ...
- j-link修复 write flash 一直无法点击
write flash 一直无法点击 watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQvcXFfMTcyNDI5NTc=/font/5a6L5L2T/fonts ...
- SVN各种异常解决整理
错误1:工作副本已锁定 由于周末公司server停机维护,今天在用SVN提交业务时,直接报错: 同一时候给了解决的方法:请运行清理命令 直接返回上级文件夹单击右键.运行清除命令后,再次提交! OK! ...
- angularjs1-2,作用域、代码压缩
<!DOCTYPE html> <html> <head> <meta http-equiv="Content-Type" content ...