spark UDAF

感谢我的同事李震给我讲解UDAF

网上找到的大部分都只有代码，但是缺少讲解，官网的的API有讲解，但是看不太明白。我还是自己记录一下吧，或许对其他人有帮助。

接下来以一个求几何平均数的例子来说明如何实现一个自己的UDAF

首先需要导入这些包：

import org.apache.spark.sql.expressions.MutableAggregationBuffer

import org.apache.spark.sql.expressions.UserDefinedAggregateFunction

import org.apache.spark.sql.Row

import org.apache.spark.sql.types._


需要继承实现这个抽象类

class GeometricMean extends UserDefinedAggregateFunction {

  // This is the input fields for your aggregate function.
  就是需要输入的列的类型，可以有多个列，多个列的写法如下：
/*

StructType(StructField("slot",IntegerType) :: StructField("score",IntegerType)::Nil)

*/

  override def inputSchema: org.apache.spark.sql.types.StructType =

    StructType(StructField("value", DoubleType) :: Nil)

  存储聚合结果的中间buffer

  // This is the internal fields you keep for computing your aggregate.

  override def bufferSchema: StructType = StructType(

    StructField("count", LongType) ::

    StructField("product", DoubleType) :: Nil

  )

  // This is the output type of your aggregatation function.
  返回结果的类型，比如这个集合平均数就是返回一个double值

  override def dataType: DataType = DoubleType


  是每次运行结果都过一样，但是我也不太明白啊

  override def deterministic: Boolean = true

   初始化存储聚合结果的buffer

  // This is the initial value for your buffer schema.

  override def initialize(buffer: MutableAggregationBuffer): Unit = {

    buffer(0) = 0L

    buffer(1) = 1.0

  }


   每次更新怎么更新，比如新来了一行，如何加入更新聚合的结果

  // This is how to update your buffer schema given an input.

  override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {

    buffer(0) = buffer.getAs[Long](0) + 1

    buffer(1) = buffer.getAs[Double](1) * input.getAs[Double](0)

  }


   spark会把数据划分成多个块，每个块都会进行处理，然后把每个块的结果进行合并处理

  // This is how to merge two objects with the bufferSchema type.

  override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {

    buffer1(0) = buffer1.getAs[Long](0) + buffer2.getAs[Long](0)

    buffer1(1) = buffer1.getAs[Double](1) * buffer2.getAs[Double](1)

  }


  返回结果

  // This is where you output the final value, given the final value of your bufferSchema.

  override def evaluate(buffer: Row): Any = {

    math.pow(buffer.getDouble(1), 1.toDouble / buffer.getLong(0))

  }

}

使用方法：

先注册

sqlContext.udf.register("gm", new GeometricMean)

使用自定义的UDAF

%sql

-- Use a group_by statement and call the UDAF.

select group_id, gm(id) from simple group by group_id

参考资料：

https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html

spark UDAF的更多相关文章

Spark UDAF实现举例 -- average pooling
目录 1.UDAF定义 2.向量平均(average pooling) 2.1 average的并行化 2.2 代码实现 2.3 使用参考 1.UDAF定义 spark中的UDF(UserDefin ...
自定义spark UDAF
官网链接样例代码: import java.util.ArrayList; import java.util.List; import org.apache.spark.sql.Dataset; i ...
转：Spark User Defined Aggregate Function (UDAF) using Java
Sometimes the aggregate functions provided by Spark are not adequate, so Spark has a provision of ac ...
Spark SQL 用户自定义函数UDF、用户自定义聚合函数UDAF 教程（Java踩坑教学版）
在Spark中,也支持Hive中的自定义函数.自定义函数大致可以分为三种: UDF(User-Defined-Function),即最基本的自定义函数,类似to_char,to_date等 UDAF( ...
【Spark篇】---SparkSql之UDF函数和UDAF函数
一.前述 SparkSql中自定义函数包括UDF和UDAF UDF:一进一出 UDAF:多进一出 (联想Sum函数) 二.UDF函数 UDF:用户自定义函数,user defined functio ...
Spark SQL UDAF示例
UDAF:用户自定义聚合函数 Scala 2.10.7,spark 2.0.0 package UDF_UDAF import java.util import org.apache.spark.Sp ...
【Spark篇】---SparkSQL中自定义UDF和UDAF，开窗函数的应用
一.前述 SparkSQL中的UDF相当于是1进1出,UDAF相当于是多进一出,类似于聚合函数. 开窗函数一般分组取topn时常用. 二.UDF和UDAF函数 1.UDF函数 java代码: Spar ...
Spark之UDAF
import org.apache.spark.sql.{Row, SparkSession} import org.apache.spark.sql.expressions.{MutableAggr ...
Spark笔记之使用UDAF（User Defined Aggregate Function）
一.UDAF简介先解释一下什么是UDAF(User Defined Aggregate Function),即用户定义的聚合函数,聚合函数和普通函数的区别是什么呢,普通函数是接受一行输入产生一个输出 ...

随机推荐

Golang Frameworks
Web frameworks help developers build applications as easily and quickly as possible. Go is still rel ...
Java基础 - 面向对象 - 类方法传参
调用方法时可以给该方法传递一个或多个值,传给方法的值叫实参,在方法内部,接收实参的变量叫做形参,形参的声明语法与变量的声明语法一样.形参只在方法内部有效. Java中方法的参数主要有3种,分别为值参数 ...
【MonogDB】The description of index(二) Embedded and document Index
In this blog, we will talk about another the index which was called "The embedded ". First ...
docker中制作自己的JDK+tomcat镜像
方式一首先,准备好想要的jdk和tomcat,另外,我们需要创建一个Dockerfile文件.下面展示一个Dockerfile文件的完整内容: FROM ubuntu:14.10 MAINTAINE ...
WebService SOAP WSDL UDDI 使用php的curl、PHP5的SoapClient实现同步
一.基本名词 WebService: WebService是一种跨编程语言和跨操作系统平台的远程调用技术.不同系统,不同语言的数据交换方法都是不同的,这就导致在不同系统,不同语言之间传递数据很麻烦,基 ...
python16_day08【异常、多线程】
一.反射及相关 1.isinstance(obj, cls) 检查是否obj是否是类 cls 的对象 class Foo(object): pass obj = Foo() isinstance(ob ...
python16_day01【介绍、基本语法、流程控制】
一.day01 1.二进制运算 60 & 13 =12 60 | 13 =61 60 ^ 13 =49 60<<2 =240 60>>2 =15 2.逻辑运算符 and ...
Java并发（7）：阻塞队列
在前面我们接触的队列都是非阻塞队列,比如PriorityQueue.LinkedList(LinkedList是双向链表,它实现了Dequeue接口). 使用非阻塞队列的时候有一个很大问题就是:它不会 ...
60. Permutation Sequence（求全排列的第k个排列）
The set [1,2,3,…,n] contains a total of n! unique permutations. By listing and labeling all of the p ...
netty应用
http://www.blogjava.net/yongboy/archive/2013/05/13/399203.html http://shentar.me/tag/netty-2/ 代理 htt ...

spark UDAF

spark UDAF的更多相关文章

随机推荐

热门专题