reduce & fold in Spark

fold and reduce both aggregate over a collection by implementing an operation you specify, the major different is the starting point of the aggregation. For fold(), you have to specify the starting value, and for reduce() the starting value is the first (or possibly an arbitrary) element in the collection.

Simple examples - we can sum the numbers in a collection using both functions:
(1 until 10).reduce( (a,b) => a+b )
(1 until 10).fold(0)( (a,b) => a+b )

With fold, we want to start at 0 and cumulatively add each element. In this case, the operation passed to fold() and reduce() were very similar, but it is helpful to think about fold in the following way. For the operation we pass to fold(), imagine its two arguments are (i) the current accumulated value and (ii) the next value in the collection,

(1 until 10).fold(0)( (accumulated_so_far, next_value) => accumulated_so_far + next_value ).

So the result of the operation, accumulated_so_far + next_value, will be passed to the operation again as the first argument, and so on.

In this way, we could count the number of elements in a collection using fold,

(1 until 10).fold(0)( (accumulated_so_far, next_value) => accumulated_so_far + 1 ).

When it comes to Spark, here’s another thing to keep in mind. For both reduce and fold, you need to make sure your operation is both commutative and associative. For RDDs, reduce and fold are implemented on each partition separately, and then the results are combined using the operation. With fold, this could get you into trouble because an empty partition will emit fold’s starting value, so the number of partitions might erroneously affect the result of the calculation, if you’re not careful about the operation. This would occur with the ( (a,b) => a+1) operation from above (see http://stackoverflow.com/questions/29150202/pyspark-fold-method-output).

reduce & fold in Spark的更多相关文章

Spark计算模型
[TOC] Spark计算模型 Spark程序模型一个经典的示例模型 SparkContext中的textFile函数从HDFS读取日志文件,输出变量file var file = sc.textF ...
【原】Learning Spark (Python版) 学习笔记(一)----RDD 基本概念与命令
<Learning Spark>这本书算是Spark入门的必读书了,中文版是<Spark快速大数据分析>,不过豆瓣书评很有意思的是,英文原版评分7.4,评论都说入门而已深入不足 ...
(转)Spark 算子系列文章
http://lxw1234.com/archives/2015/07/363.htm Spark算子:RDD基本转换操作(1)–map.flagMap.distinct Spark算子:RDD创建操 ...
zhihu spark集群,书籍,论文
spark集群中的节点可以只处理自身独立数据库里的数据,然后汇总吗? 修改我将spark搭建在两台机器上,其中一台既是master又是slave,另一台是slave,两台机器上均装有独立的mongo ...
[转]Spark学习之路（三）Spark之RDD
Spark学习之路 (三)Spark之RDD https://www.cnblogs.com/qingyunzong/p/8899715.html 目录一.RDD的概述 1.1 什么是RDD? ...
【spark 深入学习 06】RDD编程之旅基础篇02-Spaek shell
--------------------- 本节内容: · Spark转换 RDD操作实例 · Spark行动 RDD操作实例 · 参考资料 --------------------- 关于学习编程方 ...
Spark学习之路（三）Spark之RDD
一.RDD的概述 1.1 什么是RDD? RDD(Resilient Distributed Dataset)叫做弹性分布式数据集,是Spark中最基本的数据抽象,它代表一个不可变.可分区.里面的元素 ...
<Spark><Programming><RDDs>
Introduction to Core Spark Concepts driver program: 在集群上启动一系列的并行操作包含应用的main函数,定义集群上的分布式数据集,操作数据集通过 ...
Spark（三）RDD与广播变量、累加器
一.RDD的概述 1.1 什么是RDD RDD(Resilient Distributed Dataset)叫做弹性分布式数据集,是Spark中最基本的数据抽象,它代表一个不可变.可分区.里面的元素可 ...

随机推荐

echarts示例
将做过的echarts图表通过示例形式展示,便于以后使用,基于vue ,echarts,leancloud实现 github源码地址:https://github.com/707293891/echa ...
在asyncio 中跳出正在执行的task
需求描述代码在asyncio的框架中运行, 但是一旦一个task出现了长时间的堵塞,我们要跳过这个task(代码可能是用户输入的,例如用户编写的插件) 代码如下 (其中大部分代码出自官方的 asyn ...
nyoj252-01串
01串时间限制:1000 ms | 内存限制:65535 KB 难度:2 描述 ACM的zyc在研究01串,他知道某一01串的长度,但他想知道不含有"11"子串的这种长度的0 ...
Yii2.0 RESTful API 认证教程
认证介绍和Web应用不同,RESTful APIs 通常是无状态的, 也就意味着不应使用 sessions 或 cookies, 因此每个请求应附带某种授权凭证,因为用户授权状态可能没通过 sess ...
Centos与Ubuntu命令
1.虽然Centos与Ubuntu都是linux的内核,但使用命令还是有所差别 2.如在Centos中跟新插件用的是:yum -y (yum后面有一个空格) 在Ubuntu中跟新插件用的是:apt ...
JAVA循环迭代中删除或添加集合数据报java.util.ConcurrentModificationException错误
1.写出下面的输出结果 public class test{ public static void main(String [] args) List<String> list = new ...
POJ 1320
作弊了--!该题可以通过因式分解得到一个佩尔方程....要不是学着这章,估计想不到.. 得到x1,y1后,就直接代入递推式递推了 x[n]=x[n-1]*x[1]+d*y[n-1]*y[1] y[n] ...
【智能家居篇】wifi网络结构（上）
转载请注明出处:http://blog.csdn.net/Righthek 谢谢! WIFI是什么.相信大家都知道,这里就不作说明了. 我们须要做的是深入了解其工作原理,包含软硬件.网络结构等.先说明 ...
Microsoft Dynamics CRM 2013 for Outlook 的硬件要求
当仅联机或脱机模式下执行 Microsoft Dynamics CRM 2013 for Microsoft Office Outlook 时,下表列出了建议的最低硬件要求 watermark/2/t ...
Qt Installer Framework的学习（三）
Qt Installer Framework的学习(三) Qt Installer Framework的样例中.通常是这种:config目录一般放了一个config.xml文件,包括的是安装配置xml ...

reduce & fold in Spark

reduce & fold in Spark的更多相关文章

随机推荐

热门专题