reduce & fold in Spark
fold and reduce both aggregate over a collection by implementing an operation you specify, the major different is the starting point of the aggregation. For fold(), you have to specify the starting value, and for reduce() the starting value is the first (or possibly an arbitrary) element in the collection.
Simple examples - we can sum the numbers in a collection using both functions:
(1 until 10).reduce( (a,b) => a+b )
(1 until 10).fold(0)( (a,b) => a+b )
With fold, we want to start at 0 and cumulatively add each element. In this case, the operation passed to fold() and reduce() were very similar, but it is helpful to think about fold in the following way. For the operation we pass to fold(), imagine its two arguments are (i) the current accumulated value and (ii) the next value in the collection,
(1 until 10).fold(0)( (accumulated_so_far, next_value) => accumulated_so_far + next_value ).
So the result of the operation, accumulated_so_far + next_value, will be passed to the operation again as the first argument, and so on.
In this way, we could count the number of elements in a collection using fold,
(1 until 10).fold(0)( (accumulated_so_far, next_value) => accumulated_so_far + 1 ).
When it comes to Spark, here’s another thing to keep in mind. For both reduce and fold, you need to make sure your operation is both commutative and associative. For RDDs, reduce and fold are implemented on each partition separately, and then the results are combined using the operation. With fold, this could get you into trouble because an empty partition will emit fold’s starting value, so the number of partitions might erroneously affect the result of the calculation, if you’re not careful about the operation. This would occur with the ( (a,b) => a+1) operation from above (see http://stackoverflow.com/questions/29150202/pyspark-fold-method-output).
reduce & fold in Spark的更多相关文章
- Spark计算模型
[TOC] Spark计算模型 Spark程序模型 一个经典的示例模型 SparkContext中的textFile函数从HDFS读取日志文件,输出变量file var file = sc.textF ...
- 【原】Learning Spark (Python版) 学习笔记(一)----RDD 基本概念与命令
<Learning Spark>这本书算是Spark入门的必读书了,中文版是<Spark快速大数据分析>,不过豆瓣书评很有意思的是,英文原版评分7.4,评论都说入门而已深入不足 ...
- (转)Spark 算子系列文章
http://lxw1234.com/archives/2015/07/363.htm Spark算子:RDD基本转换操作(1)–map.flagMap.distinct Spark算子:RDD创建操 ...
- zhihu spark集群,书籍,论文
spark集群中的节点可以只处理自身独立数据库里的数据,然后汇总吗? 修改 我将spark搭建在两台机器上,其中一台既是master又是slave,另一台是slave,两台机器上均装有独立的mongo ...
- [转]Spark学习之路 (三)Spark之RDD
Spark学习之路 (三)Spark之RDD https://www.cnblogs.com/qingyunzong/p/8899715.html 目录 一.RDD的概述 1.1 什么是RDD? ...
- 【spark 深入学习 06】RDD编程之旅基础篇02-Spaek shell
--------------------- 本节内容: · Spark转换 RDD操作实例 · Spark行动 RDD操作实例 · 参考资料 --------------------- 关于学习编程方 ...
- Spark学习之路 (三)Spark之RDD
一.RDD的概述 1.1 什么是RDD? RDD(Resilient Distributed Dataset)叫做弹性分布式数据集,是Spark中最基本的数据抽象,它代表一个不可变.可分区.里面的元素 ...
- <Spark><Programming><RDDs>
Introduction to Core Spark Concepts driver program: 在集群上启动一系列的并行操作 包含应用的main函数,定义集群上的分布式数据集,操作数据集 通过 ...
- Spark(三)RDD与广播变量、累加器
一.RDD的概述 1.1 什么是RDD RDD(Resilient Distributed Dataset)叫做弹性分布式数据集,是Spark中最基本的数据抽象,它代表一个不可变.可分区.里面的元素可 ...
随机推荐
- 前端html之------>Table实现表头固定
文章来源于:https://www.cnblogs.com/dacuotecuo/p/3657779.html,请尊重原创,转载请注明出处. 说明:这里主要实现了表头的固定和上下滚动的滑动实现:时间的 ...
- Python笔记18-----函数收集参数
1.收集参数(参数前面加*): def test1(param1,*params): print(param1) print(params) 调用:test1(1,2,3,4) 结果:1 (2,3,4 ...
- 佛祖保佑,永不宕机,永无 Bug
转自:http://top.jobbole.com/17580/ 佛祖保佑,永不宕机,永无 Bug 为何服务器频遭黑客攻击?为何系统频频宕机,别人家系统却稳如泰山,坚如磐石?为何运维人员和系统管理员行 ...
- [tyvj-1194]划分大理石 二进制优化多重背包
突然发现这个自己还不会... 其实也不难,就和快速幂感觉很像,把物品数量二进制拆分一下,01背包即可 我是咸鱼 #include <cstdio> #include <cstring ...
- Nginx的反向代理的配置
1.linux下的方向代理(前提域名和P已经映射好了的) 在linux中的输入命令:whereis nginx 查看当前nginx的安装目录 显示 nginx: /usr/local/nginx 命令 ...
- 搞定PHP面试 - 函数知识点整理
一.函数的定义 1. 函数的命名规则 函数名可以包含字母.数字.下划线,不能以数字开头. function Func_1(){ } //合法 function func1(){ } //合法 func ...
- vue 如何动态切换组件,使用is进行切换
日常项目中需要动态去切换组件进行页面展示. 例如:登陆用户是“管理员”或者“普通用户”,需要根据登陆的用户角色切换页面展示的内容.则需要使用 :is 属性进行绑定切换 <template> ...
- 启动 Appium 自带模拟器
1.先在sclipse中新建并打开一个设备 2.启动appium 3.安装apk 打开cmd 并在sdk安装目录的tools文件夹下输入安装命令adb install xxx.apk(在这之前需要把 ...
- ISAM Indexed Sequential Access Method 索引顺序存取方法
ISAM Indexed Sequential Access Method 索引顺序存取方法 学习了:https://baike.baidu.com/item/ISAM/3013855 是IBM发展起 ...
- mybatis批量插入、批量删除
mybatis 批量插入 int addBatch(@Param("list")List<CustInfo> list); <insert id="ad ...