hive中 udf,udaf,udtf

1.hive中基本操作；

DDL，DML

2.hive中函数

User-Defined Functions : UDF(用户自定义函数，简称JDF函数)
UDF: 一进一出 upper lower substring（进来一条记录，出去还是一条记录）
UDAF：Aggregation（用户自定的聚合函数）多进一出 count max min sum ...
UDTF: Table-Generation 一进多出

3.举例

show functions显示系统支持的函数

行数举例：split(),explode()

exercise：使用hive统计单词出现次数

explode把数组转成多行的数据

[hadoop@hadoop000 data]$ vi hive-wc.txt

hello,world,welcome

hello,welcome

hive> create table hive_wc(sentence string);

OK

Time taken: 1.083 seconds

hive> load data local inpath '/home/hadoop/data/hive-wc.txt' into table hive_wc;

Loading data to table default.hive_wc

Table default.hive_wc stats: [numFiles=, totalSize=]

OK

Time taken: 1.539 seconds

hive> select * from hive_wc;

OK

hello,world,welcome

hello,welcome

Time taken: 0.536 seconds, Fetched:  row(s)

hive> select split(sentence,",") from hive_wc;

OK

["hello","world","welcome"]

["hello","welcome"]

[""]

Time taken: 0.161 seconds, Fetched:  row(s）

"hello"

"world"

"welcome"

"hello"

"welcome"

用一个SQL完成wordcount统计：

hive> select word, count() as c

    > from (select explode(split(sentence,",")) as word from hive_wc) t

    > group by word ;

Query ID = hadoop_20180613094545_920c2e72--47eb-9a9c-5e5a30ebb1ae

Total jobs =

Launching Job  out of

Number of reduce tasks not specified. Estimated from input data size:

In order to change the average load for a reducer (in bytes):

  set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

  set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

  set mapreduce.job.reduces=<number>

Starting Job = job_1528851144815_0001, Tracking URL = http://hadoop000:8088/proxy/application_1528851144815_0001/

Kill Command = /home/hadoop/app/hadoop-2.6.-cdh5.7.0/bin/hadoop job  -kill job_1528851144815_0001

Hadoop job information for Stage-: number of mappers: ; number of reducers:

-- ::, Stage- map = %,  reduce = %

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 2.42 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 4.31 sec

MapReduce Total cumulative CPU time:  seconds  msec

Ended Job = job_1528851144815_0001

MapReduce Jobs Launched:

Stage-Stage-: Map:   Reduce:    Cumulative CPU: 4.31 sec   HDFS Read:  HDFS Write:  SUCCESS

Total MapReduce CPU Time Spent:  seconds  msec

OK

hello

welcome

world

Time taken: 26.859 seconds, Fetched:  row(s)

4.json类型数据

使用到的文件： rating.json

创建一张表 rating_json，上传数据，并查看前十行数据信息：

hive> create table rating_json(json string);

OK

hive> load data local inpath '/home/hadoop/data/rating.json' into table rating_json;

Loading data to table default.rating_json

Table default.rating_json stats: [numFiles=, totalSize=]

OK

hive> select * from rating_json limit ;

OK

{"movie":"","rate":"","time":"","userid":""}

{"movie":"","rate":"","time":"","userid":""}

{"movie":"","rate":"","time":"","userid":""}

{"movie":"","rate":"","time":"","userid":""}

{"movie":"","rate":"","time":"","userid":""}

{"movie":"","rate":"","time":"","userid":""}

{"movie":"","rate":"","time":"","userid":""}

{"movie":"","rate":"","time":"","userid":""}

{"movie":"","rate":"","time":"","userid":""}

{"movie":"","rate":"","time":"","userid":""}

Time taken: 0.195 seconds, Fetched:  row(s)

对json的数据进行处理，json_tuple 是一个UDTF是 Hive0.7版本引进的：

hive> select

    > json_tuple(json,"movie","rate","time","userid") as (movie,rate,time,userid)

    > from rating_json limit ;

OK

Time taken: 0.189 seconds, Fetched:  row(s)

5.时间类型的转换：

[hadoop@hadoop000 data]$ more hive_row_number.txt

,,ruoze,M

,,jepson,M

,,wangwu,F

,,zhaoliu,F

,,tianqi,M

,,wangba,F

[hadoop@hadoop000 data]$

hive> create table hive_rownumber(id int,age int, name string, sex string)

    > row format delimited fields terminated by ',';

OK

Time taken: 0.451 seconds

hive> load data local inpath '/home/hadoop/data/hive_row_number.txt' into table hive_rownumber;

Loading data to table hive3.hive_rownumber

Table hive3.hive_rownumber stats: [numFiles=, totalSize=]

OK

Time taken: 1.381 seconds

hive> select * from hive_rownumber ;

OK

             ruoze   M

             jepson  M

             wangwu  F

             zhaoliu F

             tianqi  M

             wangba  F

Time taken: 0.455 seconds, Fetched:  row(s)

需求：查询出每种性别中年龄最大的两条数据 -- > topn：

分析：order by 是全局的排序，是做不到分组内的排序的；组内进行排序，就要用到窗口函数or分析函数

select id,age,name.sex

from

(select id,age,name,sex,

row_number() over(partition by sex order by age desc)

from hive_rownumber) t

where rank<=2;

hive> select id,age,name,sex

    > from

    > (select id,age,name,sex,

    > row_number() over(partition by sex order by age desc) as rank

    > from hive_rownumber) t

    > where rank<=;

Query ID = hadoop_20180614202525_9829dc42-3c37--8b12-89c416589ebc

Total jobs =

Launching Job  out of

Number of reduce tasks not specified. Estimated from input data size:

In order to change the average load for a reducer (in bytes):

  set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

  set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

  set mapreduce.job.reduces=<number>

Starting Job = job_1528975858636_0001, Tracking URL = http://hadoop000:/proxy/application_1528975858636_0001/

Kill Command = /home/hadoop/app/hadoop-2.6.-cdh5.7.0/bin/hadoop job  -kill job_1528975858636_0001

Hadoop job information for Stage-: number of mappers: ; number of reducers:

-- ::, Stage- map = %,  reduce = %

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 1.48 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 3.86 sec

MapReduce Total cumulative CPU time:  seconds  msec

Ended Job = job_1528975858636_0001

MapReduce Jobs Launched:

Stage-Stage-: Map:   Reduce:    Cumulative CPU: 3.86 sec   HDFS Read:  HDFS Write:  SUCCESS

Total MapReduce CPU Time Spent:  seconds  msec

OK

             wangba  F

             wangwu  F

             tianqi  M

             jepson  M

Time taken: 29.262 seconds, Fetched:  row(s)

hive中 udf,udaf,udtf的更多相关文章

hive中UDF、UDAF和UDTF使用
Hive进行UDF开发十分简单,此处所说UDF为Temporary的function,所以需要hive版本在0.4.0以上才可以. 一.背景:Hive是基于Hadoop中的MapReduce,提供HQ ...
【转】hive中UDF、UDAF和UDTF使用
原博文出自于: http://blog.csdn.net/liuj2511981/article/details/8523084 感谢! Hive进行UDF开发十分简单,此处所说UDF为Tempora ...
[转]HIVE UDF/UDAF/UDTF的Map Reduce代码框架模板
FROM : http://hugh-wangp.iteye.com/blog/1472371 自己写代码时候的利用到的模板 UDF步骤: 1.必须继承org.apache.hadoop.hive ...
Hive 自定义函数 UDF UDAF UDTF
1.UDF:用户定义(普通)函数,只对单行数值产生作用: 继承UDF类,添加方法 evaluate() /** * @function 自定义UDF统计最小值 * @author John * */ ...
【转】HIVE UDF UDAF UDTF 区别使用
原博文出自于:http://blog.csdn.net/longzilong216/article/details/23921235(暂时) 感谢! 自己写代码时候的利用到的模板 UDF步骤: 1 ...
在hive中UDF和UDAF使用说明
Hive进行UDF开发十分简单,此处所说UDF为Temporary的function,所以需要hive版本在0.4.0以上才可以. 一.背景:Hive是基于Hadoop中的MapReduce,提供HQ ...
简述UDF/UDAF/UDTF是什么，各自解决问题及应用场景
UDF User-Defined-Function 自定义函数 .一进一出: 背景系统内置函数无法解决实际的业务问题,需要开发者自己编写函数实现自身的业务实现诉求. 应用场景非常多,面临的业务不同导 ...
Hive中的UDF详解
hive作为一个sql查询引擎,自带了一些基本的函数,比如count(计数),sum(求和),有时候这些基本函数满足不了我们的需求,这时候就要写hive hdf(user defined funati ...
hive自定义UDF
udf udaf udtf 使用方式 hiverc文件 1.jar包放到安装日录下或者指定目录下 2.${HIVE_HOME}/bin目录下有个.hiverc文件,它是隐藏文件. 3.把初始化语句加载 ...

随机推荐

51nod 1837 砝码称重【数学，规律】
题目链接:51nod 1837 砝码称重小 Q 有 n 个砝码,它们的质量分别为 1 克. 2 克.……. n 克. 他给 i 克的砝码标上了编号 i (i = 1, 2, ..., n),但是编号 ...
解决SpringMVC拦截器拦截静态资源的问题。
在使用SpringMVC进行开发的时候,遇到了以下代码不能执行的情况.而且我已经正确导入了JQuery框架. <script type="text/javascript"&g ...
BZOJ4602:[SDOI2016]齿轮(并查集)
Description 现有一个传动系统,包含了N个组合齿轮和M个链条.每一个链条连接了两个组合齿轮u和v,并提供了一个传动比x : y.即如果只考虑这两个组合齿轮,编号为u的齿轮转动x圈,编号为v ...
安装最新版的wampserver，可以兼容php5和php7
本文介绍的wamp是Windows+Apache+MySQL+PHP+phpMyAdmin,主要应用于开发环境[一键安装包,简单好用]. 这是运行在Windows系统下的官方安装包,可以快速的搭建属于 ...
ASP.NET SingalR + MongoDB 实现简单聊天室（三）：实现用户群聊，总结完善
前两篇已经介绍的差不多了,本篇就作为收尾. 使用hub方法初始化聊天室的基本步骤和注意事项首先确保页面已经引用了jquery和singalR.js还有对应的hubs文件,注意,MVC框架有时会将jq ...
剑指offer13 在O(1)时间删除链表的结点
把下一个节点的值直接赋值给要删除的节点,然后删除下一个节点.当这样做会有两个bad case:被删除的链表结点的下一个结点为空指针,如果链表只有一个结点.其实链表只有一个结点应该属于下一个结点为空指针 ...
一点一点看JDK源码（五）java.util.ArrayList 后篇之removeIf与Predicate
一点一点看JDK源码(五)java.util.ArrayList 后篇之removeIf与Predicate liuyuhang原创,未经允许禁止转载本文举例使用的是JDK8的API 目录:一点一点 ...
MySQL数据导入导出(一)
今天遇到一个需求,要用自动任务将一张表的数据导入另一张表.具体场景及限制:将数据库A中表A的数据导入到数据库B的表B中(增量数据或全量数据两种方式):体系1和体系2只能分别访问数据库A和数据库B.附图 ...
npm常见配置收集
npm代理设置为走ss通道:npm config set proxy 'http://localhost:1080'
基于jquery，ajax请求及自我终止的函数封装。
场景描述: 在我们平时的开发过程中,经常会遇到这样的情况.在搜索功能中进行模糊搜索或者联想关联. 这就要我们每次对输入框中的数据进行改动时,都要发送一次请求.当在短时间内多次操作改动时,问题就出现了. ...

hive中 udf,udaf,udtf

hive中 udf,udaf,udtf的更多相关文章

随机推荐

热门专题