Sometimes the aggregate functions provided by Spark are not adequate, so Spark has a provision of accepting custom user defined aggregate functions. Before diving into code lets first understand some of the methods of class UserDefinedAggregateFunction.

1. inputSchema()

In this method you need to define a StructType that represents the input arguments of this aggregate function.

2. bufferSchema()

In this method you need to define a StructType that represents values in the aggregation buffer. This schema is used to hold the aggregate function value at the time of processing.

3. dataType()

The DataType of the returned value of this aggregate function

4. initialize(MutableAggregationBuffer buffer)

Whenever your “key” changes this method is invoked. You can use this method to reinitalise your variable.

5. evaluate(Row buffer)

This method calculates the final value by refering the aggregation buffer.

6. update(MutableAggregationBuffer buffer, Row input)

This method is used to update the aggregation buffer, it is invoked every time a new input comes for similar key

7. merge(MutableAggregationBuffer buffer, Row input)

This method is used to merge output of two different aggregation buffer.

Below is the pictorial representation of how the methods work in spark.Assumption is, there are 2 aggregation buffers for your task

Lets see how we can write a UDAF that accepts multiple values as input and returns multiple values as output.

My input file is a .txt file which contains 3 columns city, female count and male count.We need to compute total population and the dominant population of each city.

CITIES.TXT

Nashik 40 50
Mumbai 50 60
Pune 70 80
Nashik 40 50
Mumbai 50 60
Pune 170 80

Expected output is as below

+--------+--------+--------+
| city   |Dominant| Total  |
+--------+--------+--------+
| Mumbai | Male   | 220    |
| Pune   | Female | 400    |
| Nashik | Male   | 180    |
+--------+--------+--------+

Now lets write a UDAF class that extends UserDefinedAggregateFunction class, I have provided the required comments in the code below.

import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.expressions.MutableAggregationBuffer;
import org.apache.spark.sql.expressions.UserDefinedAggregateFunction;
import org.apache.spark.sql.types.DataType;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType; public class SparkUDAF extends UserDefinedAggregateFunction
{
private StructType inputSchema;
private StructType bufferSchema;
private DataType returnDataType =
DataTypes.createMapType(DataTypes.StringType, DataTypes.StringType);
MutableAggregationBuffer mutableBuffer; public SparkUDAF()
{
//inputSchema : This UDAF can accept 2 inputs which are of type Integer
List<StructField> inputFields = new ArrayList<StructField>();
StructField inputStructField1 = DataTypes.createStructField(“femaleCount”,DataTypes.IntegerType, true);
inputFields.add(inputStructField1);
StructField inputStructField2 = DataTypes.createStructField(“maleCount”,DataTypes.IntegerType, true);
inputFields.add(inputStructField2);
inputSchema = DataTypes.createStructType(inputFields); //BufferSchema : This UDAF can hold calculated data in below mentioned buffers
List<StructField> bufferFields = new ArrayList<StructField>();
StructField bufferStructField1 = DataTypes.createStructField(“totalCount”,DataTypes.IntegerType, true);
bufferFields.add(bufferStructField1);
StructField bufferStructField2 = DataTypes.createStructField(“femaleCount”,DataTypes.IntegerType, true);
bufferFields.add(bufferStructField2);
StructField bufferStructField3 = DataTypes.createStructField(“maleCount”,DataTypes.IntegerType, true);
bufferFields.add(bufferStructField3);
StructField bufferStructField4 = DataTypes.createStructField(“outputMap”,DataTypes.createMapType(DataTypes.StringType, DataTypes.StringType), true);
bufferFields.add(bufferStructField4);
bufferSchema = DataTypes.createStructType(bufferFields);
} /**
* This method determines which bufferSchema will be used
*/
@Override
public StructType bufferSchema() { return bufferSchema;
} /**
* This method determines the return type of this UDAF
*/
@Override
public DataType dataType() {
return returnDataType;
} /**
* Returns true iff this function is deterministic, i.e. given the same input, always return the same output.
*/
@Override
public boolean deterministic() {
return true;
} /**
* This method will re-initialize the variables to 0 on change of city name
*/
@Override
public void initialize(MutableAggregationBuffer buffer) {
buffer.update(, );
buffer.update(, );
buffer.update(, );
mutableBuffer = buffer;
} /**
* This method is used to increment the count for each city
*/
@Override
public void update(MutableAggregationBuffer buffer, Row input) {
buffer.update(, buffer.getInt() + input.getInt() + input.getInt());
buffer.update(, input.getInt());
buffer.update(, input.getInt());
} /**
* This method will be used to merge data of two buffers
*/
@Override
public void merge(MutableAggregationBuffer buffer, Row input) { buffer.update(, buffer.getInt() + input.getInt());
buffer.update(, buffer.getInt() + input.getInt());
buffer.update(, buffer.getInt() + input.getInt()); } /**
* This method calculates the final value by referring the aggregation buffer
*/
@Override
public Object evaluate(Row buffer) {
//In this method we are preparing a final map that will be returned as output
Map<String,String> op = new HashMap<String,String>();
op.put(“Total”, “” + mutableBuffer.getInt());
op.put(“dominant”, “Male”);
if(buffer.getInt() > mutableBuffer.getInt())
{
op.put(“dominant”, “Female”);
}
mutableBuffer.update(,op); return buffer.getMap();
}
/**
* This method will determine the input schema of this UDAF
*/
@Override
public StructType inputSchema() { return inputSchema;
} } Now lets see how we can access this UDAF using our spark code import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.StringTokenizer; import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.hive.HiveContext;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
public class TestDemo {
public static void main (String args[])
{
//Set up sparkContext and SQLContext
SparkConf conf = new SparkConf().setAppName(“udaf”).setMaster(“local”);
JavaSparkContext sc = new JavaSparkContext(conf);
SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc); //create Row RDD
JavaRDD<String> citiesRdd = sc.textFile(“cities.txt”);
JavaRDD<Row> rowRdd = citiesRdd.map(new Function<String, Row>() {
public Row call(String line) throws Exception {
StringTokenizer st = new StringTokenizer(line,” “);
return RowFactory.create(st.nextToken().trim(),Integer.parseInt(st.nextToken().trim()),Integer.parseInt(st.nextToken().trim()));
}
}); //Create Struct Type
List<StructField> inputFields = new ArrayList<StructField>();
StructField inputStructField = DataTypes.createStructField(“city”,DataTypes.StringType, true);
inputFields.add(inputStructField);
StructField inputStructField2 = DataTypes.createStructField(“Female”,DataTypes.IntegerType, true);
inputFields.add(inputStructField2);
StructField inputStructField3 = DataTypes.createStructField(“Male”,DataTypes.IntegerType, true);
inputFields.add(inputStructField3);
StructType inputSchema = DataTypes.createStructType(inputFields); //Create Data Frame
DataFrame df = sqlContext.createDataFrame(rowRdd, inputSchema); //Register our Spark UDAF
SparkUDAF sparkUDAF = new SparkUDAF();
sqlContext.udf().register(“uf”,sparkUDAF); //Register dataframe as table
df.registerTempTable(“cities”); //Run query
sqlContext.sql(“SELECT city , count[‘dominant’] as Dominant, count[‘Total’] as Total from(select city, uf(Female,Male) as count from cities group by (city)) temp”).show(false); }
}

文章来自:https://blog.augmentiq.in/2016/08/05/spark-multiple-inputoutput-user-defined-aggregate-function-udaf-using-java/

转:Spark User Defined Aggregate Function (UDAF) using Java的更多相关文章

  1. Spark笔记之使用UDAF(User Defined Aggregate Function)

    一.UDAF简介 先解释一下什么是UDAF(User Defined Aggregate Function),即用户定义的聚合函数,聚合函数和普通函数的区别是什么呢,普通函数是接受一行输入产生一个输出 ...

  2. Spark SQL中UDF和UDAF

    转载自:https://blog.csdn.net/u012297062/article/details/52227909 UDF: User Defined Function,用户自定义的函数,函数 ...

  3. Spark Sql的UDF和UDAF函数

    Spark Sql提供了丰富的内置函数供猿友们使用,辣为何还要用户自定义函数呢?实际的业务场景可能很复杂,内置函数hold不住,所以spark sql提供了可扩展的内置函数接口:哥们,你的业务太变态了 ...

  4. 【理解】column must appear in the GROUP BY clause or be used in an aggregate function

    column "ms.xxx_time" must appear in the GROUP BY clause or be used in an aggregate functio ...

  5. invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause

    Column 'dbo.tbm_vie_View.ViewID' is invalid in the select list because it is not contained in either ...

  6. must appear in the GROUP BY clause or be used in an aggregate function

    今天在分组统计的时候pgsql报错 must appear in the GROUP BY clause or be used in an aggregate function,在mysql里面是可以 ...

  7. 解决spark程序报错:Caused by: java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]

    报错信息: 09-05-2017 09:58:44 CST xxxx_job_1494294485570174 INFO - at org.apache.spark.sql.catalyst.erro ...

  8. spark算子之Aggregate

    Aggregate函数 一.源码定义 /** * Aggregate the elements of each partition, and then the results for all the ...

  9. Spark MLlib 之 aggregate和treeAggregate从原理到应用

    在阅读spark mllib源码的时候,发现一个出镜率很高的函数--aggregate和treeAggregate,比如matrix.columnSimilarities()中.为了好好理解这两个方法 ...

随机推荐

  1. Windows Phone App Studio发布

    Windows Phone App Studio发布重要更新-支持Windows 8.1 源代码生成 自2013年8月Apps Team发布Windows Phone App Studio以来,由于其 ...

  2. 生成自己的Webapi帮助文档(二)

    经过今天一上午的修改,已经有个基础的框架了,其它功能只能是在实际使用中发现一个修改一个了. 以下是生成的结果示例: 相比昨天,几个Model都有修改,这里就不一一贴代码了,放个代码包上来,有需要的自己 ...

  3. AngularJS的初始化

    AngularJS的初始化 本文主要介绍AngularJS的自动初始化以及在必要的适合如何手动初始化. Angular <script> Tag 下面通过一小段代码来介绍推荐的自动初始化过 ...

  4. 初探performance.timing API

    初探performance.timing API   浏览器新提供的performance接口精确的告诉我们当访问一个网站页面时当前网页每个处理阶段的精确时间(timestamp),以方便我们进行前端 ...

  5. Model Binding To A List

    [文章来源see here] Using the DefaultModelBinder in ASP.NET MVC, you can bind submitted form values to ar ...

  6. 关于grub的那些事(二)

    上回说到/etc/default/grub文件,我直接抄了人家的文章,感觉那Wiki确实写的很详细,所以就用上拿来主义了. 这次是分析该文件,因为这是grub必读的文件,也记录着控制grub工作的环境 ...

  7. poj3083走玉米地问题

    走玉米地迷宫,一般有两种简单策略,遇到岔路总是优先沿着自己的左手方向,或者右手方向走.给一个迷宫,给出这两种策略的步数,再给出最短路径的长度. ######### #.#.#.#.# S....... ...

  8. 批量转换cue文件编码

    之前在网上下载的无损(flac.ape),好多都是整盘的,也就是说一个flac或ape文件搭配一个cue分轨文件,这个文件记录着在不同时间段是哪一首歌曲. 由于之前的操作都是在windows下进行的, ...

  9. IOS使用不同父类的 view 进行约束

    最终效果图如下: 很多限制条件都已经应用到了视图中,我们在解释一下: ·在我们的视图控制器的主视图中有两个灰色的视图.两个视图距视图控制器的视图左 边和右边都有一个标准的空白距离.视图的顶部距顶部的视 ...

  10. T-SQL的10个好习惯

    有关T-SQL的10个好习惯 1.在生产环境中不要出现Select * 这一点我想大家已经是比较熟知了,这样的错误相信会犯的人不会太多.但我这里还是要说一下. 不使用Select *的原因主要不是坊间 ...