转:Spark User Defined Aggregate Function (UDAF) using Java
Sometimes the aggregate functions provided by Spark are not adequate, so Spark has a provision of accepting custom user defined aggregate functions. Before diving into code lets first understand some of the methods of class UserDefinedAggregateFunction.
1. inputSchema()
In this method you need to define a StructType that represents the input arguments of this aggregate function.
2. bufferSchema()
In this method you need to define a StructType that represents values in the aggregation buffer. This schema is used to hold the aggregate function value at the time of processing.
3. dataType()
The DataType of the returned value of this aggregate function
4. initialize(MutableAggregationBuffer buffer)
Whenever your “key” changes this method is invoked. You can use this method to reinitalise your variable.
5. evaluate(Row buffer)
This method calculates the final value by refering the aggregation buffer.
6. update(MutableAggregationBuffer buffer, Row input)
This method is used to update the aggregation buffer, it is invoked every time a new input comes for similar key
7. merge(MutableAggregationBuffer buffer, Row input)
This method is used to merge output of two different aggregation buffer.
Below is the pictorial representation of how the methods work in spark.Assumption is, there are 2 aggregation buffers for your task

Lets see how we can write a UDAF that accepts multiple values as input and returns multiple values as output.
My input file is a .txt file which contains 3 columns city, female count and male count.We need to compute total population and the dominant population of each city.
CITIES.TXT
Nashik 40 50
Mumbai 50 60
Pune 70 80
Nashik 40 50
Mumbai 50 60
Pune 170 80
Expected output is as below
+--------+--------+--------+
| city |Dominant| Total |
+--------+--------+--------+
| Mumbai | Male | 220 |
| Pune | Female | 400 |
| Nashik | Male | 180 |
+--------+--------+--------+
Now lets write a UDAF class that extends UserDefinedAggregateFunction class, I have provided the required comments in the code below.
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.expressions.MutableAggregationBuffer;
import org.apache.spark.sql.expressions.UserDefinedAggregateFunction;
import org.apache.spark.sql.types.DataType;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType; public class SparkUDAF extends UserDefinedAggregateFunction
{
private StructType inputSchema;
private StructType bufferSchema;
private DataType returnDataType =
DataTypes.createMapType(DataTypes.StringType, DataTypes.StringType);
MutableAggregationBuffer mutableBuffer; public SparkUDAF()
{
//inputSchema : This UDAF can accept 2 inputs which are of type Integer
List<StructField> inputFields = new ArrayList<StructField>();
StructField inputStructField1 = DataTypes.createStructField(“femaleCount”,DataTypes.IntegerType, true);
inputFields.add(inputStructField1);
StructField inputStructField2 = DataTypes.createStructField(“maleCount”,DataTypes.IntegerType, true);
inputFields.add(inputStructField2);
inputSchema = DataTypes.createStructType(inputFields); //BufferSchema : This UDAF can hold calculated data in below mentioned buffers
List<StructField> bufferFields = new ArrayList<StructField>();
StructField bufferStructField1 = DataTypes.createStructField(“totalCount”,DataTypes.IntegerType, true);
bufferFields.add(bufferStructField1);
StructField bufferStructField2 = DataTypes.createStructField(“femaleCount”,DataTypes.IntegerType, true);
bufferFields.add(bufferStructField2);
StructField bufferStructField3 = DataTypes.createStructField(“maleCount”,DataTypes.IntegerType, true);
bufferFields.add(bufferStructField3);
StructField bufferStructField4 = DataTypes.createStructField(“outputMap”,DataTypes.createMapType(DataTypes.StringType, DataTypes.StringType), true);
bufferFields.add(bufferStructField4);
bufferSchema = DataTypes.createStructType(bufferFields);
} /**
* This method determines which bufferSchema will be used
*/
@Override
public StructType bufferSchema() { return bufferSchema;
} /**
* This method determines the return type of this UDAF
*/
@Override
public DataType dataType() {
return returnDataType;
} /**
* Returns true iff this function is deterministic, i.e. given the same input, always return the same output.
*/
@Override
public boolean deterministic() {
return true;
} /**
* This method will re-initialize the variables to 0 on change of city name
*/
@Override
public void initialize(MutableAggregationBuffer buffer) {
buffer.update(, );
buffer.update(, );
buffer.update(, );
mutableBuffer = buffer;
} /**
* This method is used to increment the count for each city
*/
@Override
public void update(MutableAggregationBuffer buffer, Row input) {
buffer.update(, buffer.getInt() + input.getInt() + input.getInt());
buffer.update(, input.getInt());
buffer.update(, input.getInt());
} /**
* This method will be used to merge data of two buffers
*/
@Override
public void merge(MutableAggregationBuffer buffer, Row input) { buffer.update(, buffer.getInt() + input.getInt());
buffer.update(, buffer.getInt() + input.getInt());
buffer.update(, buffer.getInt() + input.getInt()); } /**
* This method calculates the final value by referring the aggregation buffer
*/
@Override
public Object evaluate(Row buffer) {
//In this method we are preparing a final map that will be returned as output
Map<String,String> op = new HashMap<String,String>();
op.put(“Total”, “” + mutableBuffer.getInt());
op.put(“dominant”, “Male”);
if(buffer.getInt() > mutableBuffer.getInt())
{
op.put(“dominant”, “Female”);
}
mutableBuffer.update(,op); return buffer.getMap();
}
/**
* This method will determine the input schema of this UDAF
*/
@Override
public StructType inputSchema() { return inputSchema;
} } Now lets see how we can access this UDAF using our spark code import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.StringTokenizer; import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.hive.HiveContext;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
public class TestDemo {
public static void main (String args[])
{
//Set up sparkContext and SQLContext
SparkConf conf = new SparkConf().setAppName(“udaf”).setMaster(“local”);
JavaSparkContext sc = new JavaSparkContext(conf);
SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc); //create Row RDD
JavaRDD<String> citiesRdd = sc.textFile(“cities.txt”);
JavaRDD<Row> rowRdd = citiesRdd.map(new Function<String, Row>() {
public Row call(String line) throws Exception {
StringTokenizer st = new StringTokenizer(line,” “);
return RowFactory.create(st.nextToken().trim(),Integer.parseInt(st.nextToken().trim()),Integer.parseInt(st.nextToken().trim()));
}
}); //Create Struct Type
List<StructField> inputFields = new ArrayList<StructField>();
StructField inputStructField = DataTypes.createStructField(“city”,DataTypes.StringType, true);
inputFields.add(inputStructField);
StructField inputStructField2 = DataTypes.createStructField(“Female”,DataTypes.IntegerType, true);
inputFields.add(inputStructField2);
StructField inputStructField3 = DataTypes.createStructField(“Male”,DataTypes.IntegerType, true);
inputFields.add(inputStructField3);
StructType inputSchema = DataTypes.createStructType(inputFields); //Create Data Frame
DataFrame df = sqlContext.createDataFrame(rowRdd, inputSchema); //Register our Spark UDAF
SparkUDAF sparkUDAF = new SparkUDAF();
sqlContext.udf().register(“uf”,sparkUDAF); //Register dataframe as table
df.registerTempTable(“cities”); //Run query
sqlContext.sql(“SELECT city , count[‘dominant’] as Dominant, count[‘Total’] as Total from(select city, uf(Female,Male) as count from cities group by (city)) temp”).show(false); }
}
文章来自:https://blog.augmentiq.in/2016/08/05/spark-multiple-inputoutput-user-defined-aggregate-function-udaf-using-java/
转:Spark User Defined Aggregate Function (UDAF) using Java的更多相关文章
- Spark笔记之使用UDAF(User Defined Aggregate Function)
一.UDAF简介 先解释一下什么是UDAF(User Defined Aggregate Function),即用户定义的聚合函数,聚合函数和普通函数的区别是什么呢,普通函数是接受一行输入产生一个输出 ...
- Spark SQL中UDF和UDAF
转载自:https://blog.csdn.net/u012297062/article/details/52227909 UDF: User Defined Function,用户自定义的函数,函数 ...
- Spark Sql的UDF和UDAF函数
Spark Sql提供了丰富的内置函数供猿友们使用,辣为何还要用户自定义函数呢?实际的业务场景可能很复杂,内置函数hold不住,所以spark sql提供了可扩展的内置函数接口:哥们,你的业务太变态了 ...
- 【理解】column must appear in the GROUP BY clause or be used in an aggregate function
column "ms.xxx_time" must appear in the GROUP BY clause or be used in an aggregate functio ...
- invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause
Column 'dbo.tbm_vie_View.ViewID' is invalid in the select list because it is not contained in either ...
- must appear in the GROUP BY clause or be used in an aggregate function
今天在分组统计的时候pgsql报错 must appear in the GROUP BY clause or be used in an aggregate function,在mysql里面是可以 ...
- 解决spark程序报错:Caused by: java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]
报错信息: 09-05-2017 09:58:44 CST xxxx_job_1494294485570174 INFO - at org.apache.spark.sql.catalyst.erro ...
- spark算子之Aggregate
Aggregate函数 一.源码定义 /** * Aggregate the elements of each partition, and then the results for all the ...
- Spark MLlib 之 aggregate和treeAggregate从原理到应用
在阅读spark mllib源码的时候,发现一个出镜率很高的函数--aggregate和treeAggregate,比如matrix.columnSimilarities()中.为了好好理解这两个方法 ...
随机推荐
- Windows窗口的创建
Windows窗口创建的基本代码: #include <Windows.h> LRESULT CALLBACK WndProc(HWND, UINT, WPARAM, LPARAM); i ...
- LinQ动态排序
LinQ动态排序 首先修复程序中的一个BUG这个BUG在GridPager类中,把sord修改为sort这个名称填写错误,会导致后台一直无法获取datagrid的排序字段 本来是没有这一讲的,为了使2 ...
- 默认python2.6切换成python27
# 安装修改pythonyum -y install python27 python27-devel python -V; python2.6 -V # 查看当前python版本 这两个应该都 ...
- Linux网络编程(六)
网络编程中,使用多路IO复用的典型场合: 1.当客户处理多个描述字时(交互式输入以及网络接口),必须使用IO复用. 2.一个客户同时处理多个套接口. 3.一个tcp服务程序既要处理监听套接口,又要处理 ...
- [转]浅谈PCA的适用范围
线性代数主要讲矩阵,矩阵就是线性变换,也就是把直线变成直线的几何变换,包括过原点的旋转.镜射.伸缩.推移及其组合.特征向量是对一个线性变换很特殊的向量:只有他们在此变换下可保持方向不变,而对应的特征值 ...
- 苹果icloud邮箱抓取
1 icloud登录,与其他网站登录区别 1.1 支持pop抓取的邮箱:pop提供统一接口,抓取简单: 1.2 没有前端js加密的邮箱(139,126,163):只要代码正确模拟登录流程,参数正确 ...
- ASP.NET MVC3开发 - CodeFisrt数据库篇之M层验证之调用远程方法(Remote)验证
本文讲述在作者在使用.net mvc3进行开发的时候用到的两种调用远程验证的方法,第一种方法比较傻瓜,第二种方法方便好用,调用远程验证是个比较常见的验证方法,比如注册用户时的用户名唯一性验证. 作者原 ...
- LinqToXml高级用法介绍
LinqToXml高级用法介绍 一.函数构造 什么是函数构造?其是指通过单个语句构建XML树的能力. 那么它有什么作用呢? 作用1.用单个表达式快速创建复杂的XML树 见实例代码CreateXml( ...
- 需要我们了解的SQL Server阻塞原因与解决方法
需要我们了解的SQL Server阻塞原因与解决方法 上篇说SQL Server应用模式之OLTP系统性能分析.五种角度分析sql性能问题.本章依然是SQL性能 五种角度其一“阻塞与死锁” 这里通过连 ...
- 简单的通讯录(C语言实现)
通讯录实现的功能 --: .添加联系人 .删除联系人 .查找联系人 .修改联系人 .显示联系人 .清空通讯录 .按照姓名进行排序 .退出程序 该通讯录将联系人的信息保存在文件中 在VS2013中打开文 ...