spark-聚合算子aggregatebykey

Aggregate the values of each key, using given combine functions and a neutral "zero value". This function can return a different result type, U, than the type of the values in this RDD, V. Thus, we need one operation for merging a V into a U and one operation for merging two U's, as in scala.TraversableOnce. The former operation is used for merging values within a partition, and the latter is used for merging values between partitions. To avoid memory allocation, both of these functions are allowed to modify and return their first argument instead of creating a new U.

使用给定的聚合函数和中性的“零值”聚合每个键的值。这个函数可以返回与这个RDD V中的值类型不同的结果类型U。

前一个操作用于合并分区内的值，而后一个操作用于合并分区之间的值。为了避免内存分配，允许这两个函数修改并返回它们的第一个参数，而不是创建一个新的U。

  def aggregateByKey[U: ClassTag](zeroValue: U)(

        seqOp: (U, V) => U,

        combOp: (U, U) => U

        ): RDD[(K, U)] = self.withScope {

    aggregateByKey(zeroValue, defaultPartitioner(self))(seqOp, combOp)

  }

  def aggregateByKey[U: ClassTag](zeroValue: U, partitioner: Partitioner)(

        seqOp: (U, V) => U,

        combOp: (U, U) => U

        ): RDD[(K, U)] = self.withScope {

    // Serialize the zero value to a byte array so that we can get a new clone of it on each key

    val zeroBuffer = SparkEnv.get.serializer.newInstance().serialize(zeroValue)

    val zeroArray = new Array[Byte](zeroBuffer.limit)

    zeroBuffer.get(zeroArray)

    lazy val cachedSerializer = SparkEnv.get.serializer.newInstance()

    val createZero = () => cachedSerializer.deserialize[U](ByteBuffer.wrap(zeroArray))

    // We will clean the combiner closure later in `combineByKey`

    val cleanedSeqOp = self.context.clean(seqOp)

    combineByKeyWithClassTag[U]((v: V) => cleanedSeqOp(createZero(), v),

      cleanedSeqOp, combOp, partitioner)

  }

def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope {

　　combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)

}

def combineByKeyWithClassTag[C](

    createCombiner: V => C,

    mergeValue: (C, V) => C,

    mergeCombiners: (C, C) => C,

    partitioner: Partitioner,

    mapSideCombine: Boolean = true,

    serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)]{

	...

  }

/**

  * 按key聚合Demo

  */

object AggregateByKeyDemo {

    def main(args: Array[String]): Unit = {

        val conf = new SparkConf()

        conf.setAppName("wcDemo")

        conf.setMaster("local[4]")

        val sc = new SparkContext(conf)

        val rdd1 = sc.textFile("file:///e:/wc/1.txt" , 3)

        val rdd2 = rdd1.flatMap(_.split(" ")).mapPartitionsWithIndex((idx, it) => {

            var list: List[(String, String)] = Nil

            for (e <- it) {

                list = (e, e + "_" + idx) :: list

            }

            list.iterator

        })

        rdd2.collect().foreach(println)

        println("=======================")

        val zeroU:String = "[]"

        def seqOp(a:String,b:String) = {

            a + b + " ,"

        }

        def comOp(a:String,b:String) = {

            a + "$" + b

        }

        val rdd3 = rdd2.aggregateByKey(zeroU)(seqOp,comOp)

        rdd3.collect().foreach(println)

    }

}

(hello,hello_0)		=>[hello_0]hello_0,hello_0,hello_0,		=>[hello_0]hello_0,hello_0,hello_0,$[hello_1]hello_1,hello_1,$[hello_2]hello_2,hello_2,

(hello,hello_0)

(hello,hello_0)

(hello,hello_0)

(hello,hello_1)		=>[hello_1]hello_1,hello_1,

(hello,hello_1)

(hello,hello_1)

(hello,hello_2)		=>[hello_2]hello_2,hello_2,

(hello,hello_2)

(hello,hello_2)

(hello,[]hello_0 ,hello_0 ,hello_0 ,hello_0 ,$[]hello_1 ,hello_1 ,hello_1 ,$[]hello_2 ,hello_2 ,hello_2 ,)

(tom2,tom2_0)

(world,world_0)

(tom1,tom1_0)

(world,world_0)

(tom7,tom7_1)

(world,world_1)

(tom6,tom6_1)

(world,world_1)

(tom5,tom5_1)

(world,world_1)

(tom10,tom10_2)

(world,world_2)

(tom9,tom9_2)

(world,world_2)

(tom8,tom8_2)

(world,world_2)

spark PairRDDFunction聚合函数
------------------------------
1.reduceByKey
V类型不变,有map端合成。
2.groupByKey
按照key分组，生成的v是集合，map端不能合成。
3.aggregateByKey
可以改变v的类型，map端还可以合成。
4.combineByKeyWithClassTag
按照key合成，可以指定是否进行map端合成、任意的combiner创建函数，值合并函数以及合成器合并函数。

spark-聚合算子aggregatebykey的更多相关文章

Spark RDD概念学习系列之Spark的算子的分类（十一）
Spark的算子的分类从大方向来说,Spark 算子大致可以分为以下两类: 1)Transformation 变换/转换算子:这种变换并不触发提交作业,完成作业中间过程处理. Transformat ...
Spark操作算子本质-RDD的容错
Spark操作算子本质-RDD的容错spark模式1.standalone master 资源调度 worker2.yarn resourcemanager 资源调度 nodemanager在一个集群 ...
Spark RDD概念学习系列之Spark的算子的作用（十四）
Spark的算子的作用首先,关于spark算子的分类,详细见 http://www.cnblogs.com/zlslch/p/5723857.html 1.Transformation 变换/转换算 ...
对spark算子aggregateByKey的理解
案例 aggregateByKey算子其实相当于是针对不同“key”数据做一个map+reduce规约的操作. 举一个简单的在生产环境中的一段代码有一些整理好的日志字段,经过处理得到了RDD类型为( ...
Spark算子 - aggregateByKey
释义 aggregateByKey逻辑类似 aggregate,但 aggregateByKey针对的是PairRDD,即键值对 RDD,所以返回结果也是 PairRDD,结果形式为:(各个Key, ...
列举spark所有算子
一.RDD概述 1.什么是RDD RDD(Resilient Distributed Dataset)叫做弹性分布式数据集,是Spark中最基本的数据抽象,它代表一个不可 ...
Spark RDD 算子总结
Spark算子总结算子分类 Transformation(转换) 转换算子含义 map(func) 返回一个新的RDD,该RDD由每一个输入元素经过func函数转换后组成 filter(func) ...
Spark RDD算子介绍
Spark学习笔记总结 01. Spark基础 1. 介绍 Spark可以用于批处理.交互式查询(Spark SQL).实时流处理(Spark Streaming).机器学习(Spark MLlib) ...
PairRDD中算子aggregateByKey图解
PairRDD 有几个比较麻烦的算子,常理解了后面又忘记了,自己按照自己的理解记录好,以备查阅 1.aggregateByKey aggregate 是聚合意思,直观理解就是按照Key进行聚合. 转化 ...

随机推荐

执行sudo命令时command not found的解决办法
问题的原因: 在编译sudo包的时候默认开启了- -with-secure-path选项. 方法1: sudo vim /etc/sudoers,并在文件内增加这么一行:Defaults secure ...
C++零散知识笔记本
目录 1.符号 1.1符号输出 1.2运算符 2.基本内置类型 wchar_t 3.内置类型所占字节数内置类型的简写 4.变量的本质变量与指针的故事 (1)malloc函数 (2)new关键字 5 ...
poj3107（树的重心，树形dp）
题目链接:https://vjudge.net/problem/POJ-3107 题意:求树的可能的重心,升序输出. 思路:因为学树形dp之前学过点分治了,而点分治的前提是求树的重心,所以这题就简单水 ...
ESXi 虚拟机提示无法打开本地虚拟机的 xxx.vmx 的本地管道的问题解决.
1. 今天同事与我联系, 说一个虚拟机出现连不上, vcenter控制台关闭虚拟机之后再次打开报错: 2. 自己最开始的解决方法移除虚拟机, 进入服务器的datastore 重新注册, 结果发现问 ...
葡萄城首席架构师：前端开发与Web表格控件技术解读
讲师:Issam Elbaytam,葡萄城集团全球首席架构师(Chief Software Architect of GrapeCity Global).曾任 Data Dynamics.Inc 创始 ...
使用pycharm开发web——django2.1.5（二）创建一个app并做一些配置
这里我学习的呢是刘江老师的站,主要原因在于他这个版本新,还比较细节网址先留一手,约等于在引用http://www.liujiangblog.com/ 开始正题: 1.在pycharm界面终端命令行里 ...
P1040 加分二叉树（区间DP）
(点击此处查看原题) 解题思路题目已经给出了树的中序遍历,因此我的想法是利用中序遍历的特点:若某子树的根结点为k,那么k之前的结点组成这一子树的左子树,k之后的结点组成这一子树的右子树,可以通过不断 ...
LC 202. Happy Number
问题描述 Write an algorithm to determine if a number is "happy". A happy number is a number de ...
解决低版本IE关于html5新特性的兼容性问题html5shiv.js和Respond.js，以及excanvas.js解决低版本IE不支持canvas的问题
插件:html5shiv.js 让IE9以下版本支持html5新标签,git地址https://github.com/aFarkas/html5shiv 用于解决IE9以下版本浏览器对HTML5新增标 ...
2.SpringMVC执行流程
SpringMVC 执行流程: 执行流程简单分析: 1.浏览器提交请求到中央调度器 2.中央调度器直接将请求转给处理器映射器 3.处理器映射器会根据请求,找到处理该请求的处理器,并将其封装为处理器执行 ...

spark-聚合算子aggregatebykey

spark-聚合算子aggregatebykey的更多相关文章

随机推荐

热门专题