45、sparkSQL UDF&UDAF

一、UDF

1、UDF

UDF：User Defined Function。用户自定义函数。

2、scala案例

package cn.spark.study.sql

import org.apache.spark.SparkConf

import org.apache.spark.SparkContext

import org.apache.spark.sql.SQLContext

import org.apache.spark.sql.Row

import org.apache.spark.sql.types.StructType

import org.apache.spark.sql.types.StructField

import org.apache.spark.sql.types.StringType

object UDF {

  def main(args: Array[String]): Unit = {

    val conf = new SparkConf().setMaster("local").setAppName("UDF")

    val sc = new SparkContext(conf)

    val sqlContext = new SQLContext(sc)

    // 构造模拟数据

    val names = Array("Leo", "Marry", "Jack", "Tom")

    val namesRDD = sc.parallelize(names, 5)

    val namesRowRDD = namesRDD.map(name => Row(name))

    val structType = StructType(Array(StructField("name", StringType, true)))

    val namesDF = sqlContext.createDataFrame(namesRowRDD, structType)

    // 注册一张names表

    namesDF.registerTempTable("names")

    // 定义和注册自定义函数

    // 定义函数：自己写匿名函数

    // 注册函数：SQLContext.udf.register()

    // UDF函数名：strLen； 函数体(匿名函数)：(str: String) => str.length()

    sqlContext.udf.register("strLen", (str: String) => str.length())

    // 使用自定义函数

    sqlContext.sql("select name, strLen(name) from names")

      .collect()

      .foreach(println)

  }

}

3、java案例

package cn.spark.study.sql;

import java.util.ArrayList;

import java.util.List;

import org.apache.spark.SparkConf;

import org.apache.spark.api.java.JavaRDD;

import org.apache.spark.api.java.JavaSparkContext;

import org.apache.spark.api.java.function.Function;

import org.apache.spark.api.java.function.VoidFunction;

import org.apache.spark.sql.DataFrame;

import org.apache.spark.sql.Row;

import org.apache.spark.sql.RowFactory;

import org.apache.spark.sql.SQLContext;

import org.apache.spark.sql.api.java.UDF1;

import org.apache.spark.sql.types.DataTypes;

import org.apache.spark.sql.types.StructField;

import org.apache.spark.sql.types.StructType;

public class UDF {

    public static void main(String[] args) {

        SparkConf conf = new SparkConf().setAppName("UDFJava").setMaster("local");

        JavaSparkContext sparkContext = new JavaSparkContext(conf);

        SQLContext sqlContext = new SQLContext(sparkContext);

        List<String> stringList = new ArrayList<String>();

        stringList.add("Leo");

        stringList.add("Marry");

        stringList.add("Jack");

        stringList.add("Tom");

        JavaRDD<String> rdd = sparkContext.parallelize(stringList);

        JavaRDD<Row> nameRDD = rdd.map(new Function<String, Row>() {

            private static final long serialVersionUID = 1L;

            @Override

            public Row call(String v1) throws Exception {

                return RowFactory.create(v1);

            }

        });

        List<StructField> fieldList = new ArrayList<StructField>();

        fieldList.add(DataTypes.createStructField("name", DataTypes.StringType, true));

        StructType structType = DataTypes.createStructType(fieldList);

        DataFrame dataFrame = sqlContext.createDataFrame(nameRDD, structType);

        dataFrame.registerTempTable("name");

        sqlContext.udf().register("strLen", new UDF1<String, Integer>() {

            private static final long serialVersionUID = 1L;

            @Override

            public Integer call(String s) throws Exception {

                // TODO Auto-generated method stub

                return s.length();

            }

        }, DataTypes.IntegerType);

        sqlContext.sql("select name, strLen(name) from name").javaRDD().

        foreach(new VoidFunction<Row>() {

            private static final long serialVersionUID = 1L;

            @Override

            public void call(Row row) throws Exception {

                System.out.println(row);

            }

        });

    }

}

二、UDAF

1、概述

UDAF：User Defined Aggregate Function。用户自定义聚合函数。是Spark 1.5.x引入的最新特性。

UDF，其实更多的是针对单行输入，返回一个输出，这里的UDAF，则可以针对一组(多行)输入，进行聚合计算，返回一个输出，功能更加强大

使用：

1. 自定义类继承UserDefinedAggregateFunction，对每个阶段方法做实现

2. 在spark中注册UDAF，为其绑定一个名字

3. 然后就可以在sql语句中使用上面绑定的名字调用

2、scala案例

统计字符串次数的例子，先定义一个类继承UserDefinedAggregateFunction：

package cn.spark.study.sql

import org.apache.spark.sql.expressions.UserDefinedAggregateFunction

import org.apache.spark.sql.types.StructType

import org.apache.spark.sql.types.DataType

import org.apache.spark.sql.expressions.MutableAggregationBuffer

import org.apache.spark.sql.Row

import org.apache.spark.sql.types.StructField

import org.apache.spark.sql.types.StringType

import org.apache.spark.sql.types.IntegerType

/**

 * @author Administrator

 */

class StringCount extends UserDefinedAggregateFunction {  

  // inputSchema，指的是，输入数据的类型

  def inputSchema: StructType = {

    StructType(Array(StructField("str", StringType, true)))

  }

  // bufferSchema，指的是，中间进行聚合时，所处理的数据的类型

  def bufferSchema: StructType = {

    StructType(Array(StructField("count", IntegerType, true)))

  }

  // dataType，指的是，函数返回值的类型

  def dataType: DataType = {

    IntegerType

  }

  def deterministic: Boolean = {

    true

  }

  // 为每个分组的数据执行初始化操作

  def initialize(buffer: MutableAggregationBuffer): Unit = {

    buffer(0) = 0

  }

  // 指的是，每个分组，有新的值进来的时候，如何进行分组对应的聚合值的计算

  def update(buffer: MutableAggregationBuffer, input: Row): Unit = {

    buffer(0) = buffer.getAs[Int](0) + 1

  }

  // 由于Spark是分布式的，所以一个分组的数据，可能会在不同的节点上进行局部聚合，就是update

  // 但是，最后一个分组，在各个节点上的聚合值，要进行merge，也就是合并

  def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {

    buffer1(0) = buffer1.getAs[Int](0) + buffer2.getAs[Int](0)

  }

  // 最后，指的是，一个分组的聚合值，如何通过中间的缓存聚合值，最后返回一个最终的聚合值

  def evaluate(buffer: Row): Any = {

    buffer.getAs[Int](0)

  }

}

然后注册并使用它：

package cn.spark.study.sql

import org.apache.spark.SparkConf

import org.apache.spark.SparkContext

import org.apache.spark.sql.SQLContext

import org.apache.spark.sql.Row

import org.apache.spark.sql.types.StructType

import org.apache.spark.sql.types.StructField

import org.apache.spark.sql.types.StringType

/**

 * @author Administrator

 */

object UDAF {

  def main(args: Array[String]): Unit = {

    val conf = new SparkConf()

        .setMaster("local")

        .setAppName("UDAF")

    val sc = new SparkContext(conf)

    val sqlContext = new SQLContext(sc)

    // 构造模拟数据

    val names = Array("Leo", "Marry", "Jack", "Tom", "Tom", "Tom", "Leo")

    val namesRDD = sc.parallelize(names, 5)

    val namesRowRDD = namesRDD.map { name => Row(name) }

    val structType = StructType(Array(StructField("name", StringType, true)))

    val namesDF = sqlContext.createDataFrame(namesRowRDD, structType) 

    // 注册一张names表

    namesDF.registerTempTable("names")  

    // 定义和注册自定义函数

    // 定义函数：自己写匿名函数

    // 注册函数：SQLContext.udf.register()

    sqlContext.udf.register("strCount", new StringCount) 

    // 使用自定义函数

    sqlContext.sql("select name,strCount(name) from names group by name")

        .collect()

        .foreach(println)

  }

}

45、sparkSQL UDF&UDAF的更多相关文章

简述UDF/UDAF/UDTF是什么，各自解决问题及应用场景
UDF User-Defined-Function 自定义函数 .一进一出: 背景系统内置函数无法解决实际的业务问题,需要开发者自己编写函数实现自身的业务实现诉求. 应用场景非常多,面临的业务不同导 ...
45、[源码]-Spring容器创建-执行BeanFactoryPostProcessor
45.[源码]-Spring容器创建-执行BeanFactoryPostProcessor 5.invokeBeanFactoryPostProcessors(beanFactory);执行BeanF ...
Spark(十三)【SparkSQL自定义UDF/UDAF函数】
目录一.UDF(一进一出) 二.UDAF(多近一出) spark2.X 实现方式案例 ①继承UserDefinedAggregateFunction,实现其中的方法 ②创建函数对象,注册函数,在s ...
[转]HIVE UDF/UDAF/UDTF的Map Reduce代码框架模板
FROM : http://hugh-wangp.iteye.com/blog/1472371 自己写代码时候的利用到的模板 UDF步骤: 1.必须继承org.apache.hadoop.hive ...
2、Hive UDF编程实例
Hive的UDF包括3种:UDF(User-Defined Function).UDAF(User-Defined Aggregate Function)和UDTF(User-Defined Tabl ...
Hive 自定义函数 UDF UDAF UDTF
1.UDF:用户定义(普通)函数,只对单行数值产生作用: 继承UDF类,添加方法 evaluate() /** * @function 自定义UDF统计最小值 * @author John * */ ...
【转】HIVE UDF UDAF UDTF 区别使用
原博文出自于:http://blog.csdn.net/longzilong216/article/details/23921235(暂时) 感谢! 自己写代码时候的利用到的模板 UDF步骤: 1 ...
SparkSQL之UDAF使用
1.创建一个类继承UserDefinedAggregateFunction类. ------------------------------------------------------------ ...
sparksql udf的运用----scala及python版（2016年7月17日前完成）
问:udf在sparksql 里面的作用是什么呢? 答:oracle的存储过程会有用到定义函数,那么现在udf就相当于一个在sparksql用到的函数定义: 第二个问题udf是怎么实现的呢? regi ...

随机推荐

「CTS2019」氪金手游
「CTS2019」氪金手游解题思路考场上想出了外向树的做法,居然没意识到反向边可以容斥,其实外向树会做的话这个题差不多就做完了. 令 \(dp[u][i]\) 表示单独考虑 \(u\) 节点所在子 ...
Spring AOP 创建Advice 基于Annotation
public interface IHello { public void sayHello(String str); } public class Hello implements IHello { ...
通俗易懂的join、left join、right join、full join、cross join
内连接:列出与连接条件匹配的数据行(join\inner join) 外连接:两表合并,如有不相同的列,另外一个表显示null(left join\right join\full outer join ...
方便前端使用的SVG雪碧图
更多代码详情:github.crmeb.net/u/LXT 简介由于SVG自身的矢量性质,不管在什么情况下,图标都很清晰,可以适应不同尺寸大小和不同分辨率.不用担心模糊和锯齿.同时还能更改图标的填充 ...
border-radius圆角属性
border-radius圆角当盒子的宽高一样时,设置盒子的border-radius为50%,得到一个圆形 border-radius: 20px 30px 200px 200px; 只写一个值: ...
UNTIY Canvas
一.Canvas 组件 Render Mode(渲染模式) (1)Screen Space-Overlay:2D UI,始终显示在屏幕最前方,适合制作HP,MP等(相当于GUI) (2)Screen ...
隐藏Apache版本号及版本敏感信息
在安装软件前,我们需要隐藏软件的版本号及版本其他信息,这样就大大提高了安全指数. 只隐藏版本号: 我们在主配置文件里:httpd.conf [root@bqh- ~]# curl -i bbs.bqh ...
php error_log记录日志的使用方法--拿来即用，超简单
如果本文对你有用,请爱心点个赞,提高排名,帮助更多的人.谢谢大家!❤ 如果解决不了,可以在文末进群交流. 如果对你有帮助的话麻烦点个[推荐]~最好还可以follow一下我的GitHub~感谢观看! 对 ...
Linux命令——sync
参考:A Step-By-Step Guide to Using the Linux sync Command 前言数据只有被读入内存才能被CPU所处理,但是数据又常常需要由内存写回磁盘当中(例如储 ...
Kotlin继承与重写重要特性剖析
继续Kotlin的面向对象之旅. 继承: 在Java中我们知道除了final类不能被继承,其它的情况都是可以被继承的,而在Kotlin中的规则是这样的:“在Kotlin中,所有类在默认情况下都是无法被 ...

45、sparkSQL UDF&UDAF

45、sparkSQL UDF&UDAF的更多相关文章

随机推荐

热门专题