如何编写自定义hive UDF函数

Hive可以允许用户编写自己定义的函数UDF，来在查询中使用。Hive中有3种UDF：

UDF：操作单个数据行，产生单个数据行；

UDAF：操作多个数据行，产生一个数据行。

UDTF：操作一个数据行，产生多个数据行一个表作为输出。

用户构建的UDF使用过程如下：

第一步：继承UDF或者UDAF或者UDTF，实现特定的方法。

UDF实例参见http://svn.apache.org/repos/asf/hive/trunk/contrib/src/java/org/apache/hadoop/hive/contrib/udf/example/UDFExampleAdd.java

package org.apache.hadoop.hive.contrib.udf.example;

import org.apache.hadoop.hive.ql.exec.Description;

import org.apache.hadoop.hive.ql.exec.UDF;

/**

 * UDFExampleAdd.

 *

 */

//UDF是作用于单个数据行，产生一个数据行

//用户必须要继承UDF，且必须至少实现一个evalute方法，该方法并不在UDF中

//但是Hive会检查用户的UDF是否拥有一个evalute方法

@Description(name = "example_add", value = "_FUNC_(expr) - Example UDAF that returns the sum")

public class UDFExampleAdd extends UDF {

//实现具体逻辑

  public Integer evaluate(Integer... a) {

    int total = 0;

    for (Integer element : a) {

      if (element != null) {

        total += element;

      }

    }

    return total;

  }

  public Double evaluate(Double... a) {

    double total = 0;

    for (Double element : a) {

      if (element != null) {

        total += element;

      }

    }

    return total;

  }

}

UDAF实例参见

http://svn.apache.org/repos/asf/hive/trunk/contrib/src/java/org/apache/hadoop/hive/contrib/udaf/example/UDAFExampleAvg.java

package org.apache.hadoop.hive.contrib.udaf.example;

import org.apache.hadoop.hive.ql.exec.Description;

import org.apache.hadoop.hive.ql.exec.UDAF;

import org.apache.hadoop.hive.ql.exec.UDAFEvaluator;

/**

 * This is a simple UDAF that calculates average.

 *

 * It should be very easy to follow and can be used as an example for writing

 * new UDAFs.

 *

 * Note that Hive internally uses a different mechanism (called GenericUDAF) to

 * implement built-in aggregation functions, which are harder to program but

 * more efficient.

 *

 */

//UDAF是输入多个数据行，产生一个数据行

//用户自定义的UDAF必须是继承了UDAF，且内部包含多个实现了exec的静态类

@Description(name = "example_avg",

value = "_FUNC_(col) - Example UDAF to compute average")

public final class UDAFExampleAvg extends UDAF {

  /**

   * The internal state of an aggregation for average.

   *

   * Note that this is only needed if the internal state cannot be represented

   * by a primitive.

   *

   * The internal state can also contains fields with types like

   * ArrayList<String> and HashMap<String,Double> if needed.

   */

  public static class UDAFAvgState {

    private long mCount;

    private double mSum;

  }

  /**

   * The actual class for doing the aggregation. Hive will automatically look

   * for all internal classes of the UDAF that implements UDAFEvaluator.

   */

  public static class UDAFExampleAvgEvaluator implements UDAFEvaluator {

    UDAFAvgState state;

    public UDAFExampleAvgEvaluator() {

      super();

      state = new UDAFAvgState();

      init();

    }

    /**

     * Reset the state of the aggregation.

	 * 重置聚合过程的状态

     */

    public void init() {

      state.mSum = 0;

      state.mCount = 0;

    }

    /**

     * Iterate through one row of original data.

     *

     * The number and type of arguments need to the same as we call this UDAF

     * from Hive command line.

     *

     * This function should always return true.

	 * 在原始值的一行数据上进行迭代

     * 参数的个数和类型需与hive命令行中调用该UDF的参数相同。

	 * 这个函数应当总是返回true

     */

    public boolean iterate(Double o) {

      if (o != null) {

        state.mSum += o;

        state.mCount++;

      }

      return true;

    }

    /**

     * Terminate a partial aggregation and return the state. If the state is a

     * primitive, just return primitive Java classes like Integer or String.

     */

	//Hive需要部分聚集结果的时候会调用该方法

    //会返回一个封装了聚集计算当前状态的对象

    public UDAFAvgState terminatePartial() {

      // This is SQL standard - average of zero items should be null.

      return state.mCount == 0 ? null : state;

    }

    /**

     * Merge with a partial aggregation.

     *

     * This function should always have a single argument which has the same

     * type as the return value of terminatePartial().

     */

	//合并两个部分聚集值会调用这个方法

    public boolean merge(UDAFAvgState o) {

      if (o != null) {

        state.mSum += o.mSum;

        state.mCount += o.mCount;

      }

      return true;

    }

    /**

     * Terminates the aggregation and return the final result.

	 * 终止聚合过程，返回最终结果

     */

	//Hive需要最终聚集结果时候会调用该方法

    public Double terminate() {

      // This is SQL standard - average of zero items should be null.

      return state.mCount == 0 ? null : Double.valueOf(state.mSum

          / state.mCount);

    }

  }

  private UDAFExampleAvg() {

    // prevent instantiation

  }

}

第二步：将写好的UDF函数注册到Hive中，具体有下面两种方法。

方法一

（1）将写好的类打包为jar。

（2）进入到Hive外壳环境中，利用add jar 注册该jar文件

（3）为该类起一个别名，用于查询使用。

参考命令见下：

add jar UDFExample.jar //注册jar

create temporary function my_add as 'org.apache.hadoop.hive.contrib.udf.example. UDFExampleAdd';  // UDF只是为这个Hive会话临时定义的

create temporary function my_avg as 'org.apache.hadoop.hive.contrib.udaf.example. UDAFExampleAvg';

但这种方法注册的UDF只有在当前Hive会话中生效。如果想永久生效，可在Hive源码中注册该UDF函数，具体见方法二　

方法二

（1）在org.apache.hadoop.hive.ql.exec.FunctionRegistry中注册UDF函数

registerUDF("my_add", UDFExampleAdd.class, false);

registerUDAF("my_avg", UDAFExampleAvg.class);

（2）打包编译Hive源码包

（3）部署Hive包和UDF包，将UDF包放在Hive的ClassPath中即可。

如何编写自定义hive UDF函数的更多相关文章

Hive UDF函数构建
1. 概述 UDF函数其实就是一个简单的函数,执行过程就是在Hive转换成MapReduce程序后,执行java方法,类似于像MapReduce执行过程中加入一个插件,方便扩展.UDF只能实现一进一出 ...
hive UDF函数
虽然Hive提供了很多函数,但是有些还是难以满足我们的需求.因此Hive提供了自定义函数开发自定义函数包括三种UDF.UADF.UDTF UDF(User-Defined-Function) ...
Hadoop3集群搭建之——hive添加自定义函数UDF
上篇: Hadoop3集群搭建之——虚拟机安装 Hadoop3集群搭建之——安装hadoop,配置环境 Hadoop3集群搭建之——配置ntp服务 Hadoop3集群搭建之——hive安装 Hadoo ...
Hive UDF开发指南
编写Apache Hive用户自定义函数(UDF)有两个不同的接口,一个非常简单,另一个...就相对复杂点. 如果你的函数读和返回都是基础数据类型(Hadoop&Hive 基本writable ...
Hive UDF作业
说到这次作业,看似简单的几个步骤,对于我这样的菜鸟来说可真是一波三折啊.下面来说说这次的步骤和我遇到的问题. 首先准备工作,搭建好hive环境,保证hadoop集群是启动的.这个就不多说了. 第一步: ...
Hadoop生态圈-hive编写自定义函数
Hadoop生态圈-hive编写自定义函数作者:尹正杰版权声明:原创作品,谢绝转载!否则将追究法律责任.
Hive中如何添加自定义UDF函数以及oozie中使用hive的自定义函数
操作步骤: 1. 修改.hiverc文件在hive的conf文件夹下面,如果没有.hiverc文件,手工自己创建一个. 参照如下格式添加: add jar /usr/local/hive/exter ...
hive 中简单的udf函数编写
.注册函数,使用using jar方式在hdfs上引用udf库. $hive.注销函数,只需要删除mysql的hive数据记录即可. delete from func_ru ; delete from ...
如何给Apache Pig自定义UDF函数？
近日由于工作所需,需要使用到Pig来分析线上的搜索日志数据,散仙本打算使用hive来分析的,但由于种种原因,没有用成,而Pig(pig0.12-cdh)散仙一直没有接触过,所以只能临阵磨枪了,花了两天 ...

随机推荐

hdu4496-D-city--逆序并查集
D-City Time Limit: 2000/1000 MS (Java/Others) Memory Limit: 65535/65535 K (Java/Others)Total Subm ...
项目版本不同导致Eclipse报错问题——关于在JDK1.7环境中，运行JDK1.8环境下编写的项目
本人电脑环境配置的是JDK1.7,朋友的是JDK1.8 ,我把她编的java文件导入到我电脑里的Eclipse(LUNA版本)的时候,项目出现一个红色叹号,当然运行是肯定出错了.SO我就开始了解决之旅 ...
datatables传参
前段时间需要修改一个项目.是拿datatables渲染的.然后需要做一个筛选.找各种文档想各种方法很麻烦.最后硬是用原生方式撸下来了. 首先他原来页面可以看到是通过ajax方式请求了数据.那么其实筛 ...
问题记录——com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure
最近在搞一个Spring boot + Mybatis + Mysql的项目,用Mybatis访问数据库时,报了如下的错误,先在网上搜索了,试了各种办法都不行, 奇葩的是,连接另外1个数据库又没问题. ...
(转)Linux：使用libgen.h：basename，dirname
Linux:使用libgen.h:basename,dirname basename以及dirname是两个命令: [test1280@localhost ~]$ which basename /bi ...
持续集成Jenkins入门【截图】
clojure学习笔记(一)
下载地址需要安装xmind打开 http://pan.baidu.com/s/1dDxKj1B
UML 依赖\泛化\关联\实现\聚合\组合的 Java实现
在类图中,类与类之间的关系主要有一下几种: 泛化关系:(就是继承) public class Employee { } public class SaleEmployee extends Employ ...
object与byte[]的相互转换、文件与byte数组相互转换
转载自 https://blog.csdn.net/scimence/article/details/52233656 object与byte[]互转 /// <summary> // ...
JQuery JSON Servlet
<script type="text/javascript" src="js/jquery-1.10.2.js"></script> & ...

如何编写自定义hive UDF函数

如何编写自定义hive UDF函数的更多相关文章

随机推荐

热门专题