spark aggregate

该函数官方的api，说的不是很明白：

aggregate(zeroValue, seqOp, combOp)

Aggregate the elements of each partition, and then the results for all the partitions, using a given combine functions and a neutral “zero value.”

The functions op(t1, t2) is allowed to modify t1 and return it as its result value to avoid object allocation; however, it should not modify t2.

The first function (seqOp) can return a different result type, U, than the type of this RDD. Thus, we need one operation for merging a T into an U and one operation for merging two U

>>> seqOp=(lambdax,y:(x[0]+y,x[1]+1))

>>> combOp=(lambdax,y:(x[0]+y[0],x[1]+y[1]))

>>> sc.parallelize([1,2,3,4]).aggregate((0,0),seqOp,combOp)

(10, 4)

>>> sc.parallelize([]).aggregate((0,0),seqOp,combOp)

(0, 0)

下面列出，代码的执行流程：

假设[1,2,3,4]被分成两个分区，为分区1（[1,2]），分区2（[3,4]）

首先用seqOp对分区1进行操作：

x=(0,0) y=1 -----> (1,1) #对分区进行第一次seqOp操作时，x为zero value

x=(1,1) y=2 -----> (3,2) #对分区进行的第二次及以后的seqOp操作，x为前一次seqOp的执行结果

同样对分区2进行操作：

x=(0,0) y=3 -----> (3,1)

x=(3,1) y=4 -----> (7,2)

然后用combOp对两个分区seqOp作用后的结果进行操作：

分区1：

x=(0,0) y=(3,2) ------> (3,2) #对第一个分区进行combOp操作时，x为zero value

x=(3,2) y=(7,2) ------> (10,4) #对第二个及以后分区进行combOp操作时，x为前一分区combOp处理后的结果

可以看出，例子实际上即 (rdd.sum(),rdd.count())

来自为知笔记(Wiz)

spark aggregate的更多相关文章

spark aggregate算子
spark aggregate源代码 /** * Aggregate the elements of each partition, and then the results for all the ...
spark aggregate函数详解
aggregate算是spark中比较常用的一个函数,理解起来会比较费劲一些,现在通过几个详细的例子带大家来着重理解一下aggregate的用法. 1.先看看aggregate的函数签名在spark的 ...
spark aggregate函数
aggregate函数将每个分区里面的元素进行聚合,然后用combine函数将每个分区的结果和初始值(zeroValue)进行combine操作.这个函数最终返回的类型不需要和RDD中元素类型一致. ...
转：Spark User Defined Aggregate Function (UDAF) using Java
Sometimes the aggregate functions provided by Spark are not adequate, so Spark has a provision of ac ...
轻松理解 Spark 的 aggregate 方法
2019-04-20 关键字: Spark 的 agrregate 作用.Scala 的 aggregate 是什么 Spark 编程中的 aggregate 方法还是比较常用的.本篇文章站在初学者的 ...
Spark MLlib 之 aggregate和treeAggregate从原理到应用
在阅读spark mllib源码的时候,发现一个出镜率很高的函数--aggregate和treeAggregate,比如matrix.columnSimilarities()中.为了好好理解这两个方法 ...
Spark操作：Aggregate和AggregateByKey
1. Aggregate Aggregate即聚合操作.直接上代码: import org.apache.spark.{SparkConf, SparkContext} object Aggregat ...
Spark笔记之使用UDAF（User Defined Aggregate Function）
一.UDAF简介先解释一下什么是UDAF(User Defined Aggregate Function),即用户定义的聚合函数,聚合函数和普通函数的区别是什么呢,普通函数是接受一行输入产生一个输出 ...
Spark RDD的fold和aggregate为什么是两个API？为什么不是一个foldLeft？
欢迎关注我的新博客地址:http://cuipengfei.me/blog/2014/10/31/spark-fold-aggregate-why-not-foldleft/ 大家都知道Scala标准 ...

随机推荐

maven添加oracle jdbc依赖
maven添加oracle jdbc依赖由于Oracle授权问题,Maven不提供Oracle JDBC driver,为了在Maven项目中应用Oracle JDBC driver,必须手动添加到 ...
Linux系统编程（3）——文件与IO之fcntl函数
linux文件I/O用:open.read.write.lseek以及close函数实现了文件的打开.读写等基本操作.fcntl函数可以根据文件描述词来操作文件. 用法: int fcntl(int ...
CentOS7安全设置 yum-cron系统自动更新,firewalld防火墙简单使用
PermitRootLogin nosystemctl restart sshd.service; yum -y install firewalld; systemctl start firewall ...
什么是FastCGI?
什么是FastCGI? PHP的FastCGI使你的所有php应用软件通过mod_fastci运行,而不是mod_phpsusexec.FastCGI应用速度很快是因为他们持久稳定.不必对每一个请求 ...
JSONObject和JSONArray的简单使用(json-lib)
一. jar包 commons-lang.jar commons-beanutils.jar commons-collections.jar commons-logging.jar ezmorph.j ...
mysql的基本使用方法
创建数据库:create database [if not exist]name [character set 编码方式 collate 校对规则] 显示库的创建信息:show create data ...
jQuery支持移动Mobile的DOM元素移动和缩放插件
jQuery Panzoom是一款很有用的HTML DOM元素平移和缩放jQuery和CSS3插件. Panzoom利用CSS transforms 和 matrix函数来为浏览器进行硬件(GPU)加 ...
Js apply 方法具体解释
Js apply方法具体解释我在一開始看到javascript的函数apply和call时,很的模糊,看也看不懂,近期在网上看到一些文章对apply方法和call的一些演示样例,总算是看的有点眉目了 ...
自动匹配HTTP请求中对应实体参数名的数据（性能不是最优）
/// <summary> /// 获取请求参数字段 /// </summary> /// <typeparam name="T"></t ...
Visual Studio常用的快捷键
每次在网上搜关于VS有哪些常用快捷键的时候,出来的永远是一串长的不能再长的列表,完全没体现出“常用”二字,每次看完前面几个就看不下去了,相信大家都有这种感觉.其实我们平时用的真的只有很少的一部分 ...

spark aggregate

spark aggregate的更多相关文章

随机推荐

热门专题