Transform/Map-Reduce Syntax

Users can also plug in their own custom mappers and reducers in the data stream by using features natively supported in the Hive 2.0 language. e.g. in order to run a custom mapper script - map_script - and a custom reducer script - reduce_script - the user can issue the following command which uses the TRANSFORM clause to embed the mapper and the reducer scripts.

By default, columns will be transformed to STRING and delimited by TAB before feeding to the user script; similarly, all NULL values will be converted to the literal string \N in order to differentiate NULL values from empty strings. The standard output of the user script will be treated as TAB-separated STRINGcolumns, any cell containing only \N will be re-interpreted as a NULL, and then the resulting STRING column will be cast to the data type specified in the table declaration in the usual way. User scripts can output debug information to standard error which will be shown on the task detail page on hadoop. These defaults can be overridden with ROW FORMAT ....

In windows, use "cmd /c your_script" instead of just "your_script"

Warning

Icon

It is your responsibility to sanitize any STRING columns prior to transformation. If your STRING column contains tabs, an identity transformer will not give you back what you started with! To help with this, see REGEXP_REPLACE and replace the tabs with some other character on their way into the TRANSFORM() call.

Warning

Icon

Formally, MAP ... and REDUCE ... are syntactic transformations of SELECT TRANSFORM ( ... ). In other words, they serve as comments or notes to the reader of the query. BEWARE: Use of these keywords may be dangerous as (e.g.) typing "REDUCE" does not force a reduce phase to occur and typing "MAP" does not force a new map phase!

Please also see Sort By / Cluster By / Distribute By and Larry Ogrodnek's blog post.

clusterBy: CLUSTER BY colName (',' colName)*
distributeBy: DISTRIBUTE BY colName (',' colName)*
sortBy: SORT BY colName (ASC | DESC)? (',' colName (ASC | DESC)?)*
 
rowFormat
  : ROW FORMAT
    (DELIMITED [FIELDS TERMINATED BY char]
               [COLLECTION ITEMS TERMINATED BY char]
               [MAP KEYS TERMINATED BY char]
               [ESCAPED BY char]
               [LINES SEPARATED BY char]
     |
     SERDE serde_name [WITH SERDEPROPERTIES
                            property_name=property_value,
                            property_name=property_value, ...])
 
outRowFormat : rowFormat
inRowFormat : rowFormat
outRecordReader : RECORDREADER className
 
query:
  FROM (
    FROM src
    MAP expression (',' expression)*
    (inRowFormat)?
    USING 'my_map_script'
    ( AS colName (',' colName)* )?
    (outRowFormat)? (outRecordReader)?
    ( clusterBy? | distributeBy? sortBy? ) src_alias
  )
  REDUCE expression (',' expression)*
    (inRowFormat)?
    USING 'my_reduce_script'
    ( AS colName (',' colName)* )?
    (outRowFormat)? (outRecordReader)?
 
  FROM (
    FROM src
    SELECT TRANSFORM '(' expression (',' expression)* ')'
    (inRowFormat)?
    USING 'my_map_script'
    ( AS colName (',' colName)* )?
    (outRowFormat)? (outRecordReader)?
    ( clusterBy? | distributeBy? sortBy? ) src_alias
  )
  SELECT TRANSFORM '(' expression (',' expression)* ')'
    (inRowFormat)?
    USING 'my_reduce_script'
    ( AS colName (',' colName)* )?
    (outRowFormat)? (outRecordReader)?

SQL Standard Based Authorization Disallows TRANSFORM

The TRANSFORM clause is disallowed when SQL standard based authorization is configured in Hive 0.13.0 and later releases (HIVE-6415).

TRANSFORM Examples

Example #1:

FROM (
  FROM pv_users
  MAP pv_users.userid, pv_users.date
  USING 'map_script'
  AS dt, uid
  CLUSTER BY dt) map_output
INSERT OVERWRITE TABLE pv_users_reduced
  REDUCE map_output.dt, map_output.uid
  USING 'reduce_script'
  AS date, count;
FROM (
  FROM pv_users
  SELECT TRANSFORM(pv_users.userid, pv_users.date)
  USING 'map_script'
  AS dt, uid
  CLUSTER BY dt) map_output
INSERT OVERWRITE TABLE pv_users_reduced
  SELECT TRANSFORM(map_output.dt, map_output.uid)
  USING 'reduce_script'
  AS date, count;

Example #2

FROM (
  FROM src
  SELECT TRANSFORM(src.key, src.value) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.TypedBytesSerDe'
  USING '/bin/cat'
  AS (tkey, tvalue) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.TypedBytesSerDe'
  RECORDREADER 'org.apache.hadoop.hive.ql.exec.TypedBytesRecordReader'
) tmap
INSERT OVERWRITE TABLE dest1 SELECT tkey, tvalue

Schema-less Map-reduce Scripts

If there is no AS clause after USING my_script, Hive assumes that the output of the script contains 2 parts: key which is before the first tab, and value which is the rest after the first tab. Note that this is different from specifying AS key, value because in that case, value will only contain the portion between the first tab and the second tab if there are multiple tabs.

Note that we can directly do CLUSTER BY key without specifying the output schema of the scripts.

FROM (
  FROM pv_users
  MAP pv_users.userid, pv_users.date
  USING 'map_script'
  CLUSTER BY key) map_output
INSERT OVERWRITE TABLE pv_users_reduced
  REDUCE map_output.key, map_output.value
  USING 'reduce_script'
  AS date, count;

Typing the output of TRANSFORM

The output fields from a script are typed as strings by default; for example in

SELECT TRANSFORM(stuff)
USING 'script'
AS thing1, thing2

They can be immediately casted with the syntax:

SELECT TRANSFORM(stuff)
USING 'script'
AS (thing1 INT, thing2 INT)

[HIve - LanguageManual] Transform [没懂]的更多相关文章

  1. [HIve - LanguageManual] Hive Operators and User-Defined Functions (UDFs)

    Hive Operators and User-Defined Functions (UDFs) Hive Operators and User-Defined Functions (UDFs) Bu ...

  2. Hive的Transform功能

    Hive的TRANSFORM关键字提供了在SQL中调用自写脚本的功能,适合实现Hive中没有的功能又不想写UDF的情况.例如,按日期统计每天出现的uid数,通常用如下的SQL SELECT date, ...

  3. HIVE的transform函数的使用

    Hive的TRANSFORM关键字提供了在SQL中调用自写脚本的功能,适合实现Hive中没有的功能又不想写UDF的情况.例如,按日期统计每天出现的uid数,通常用如下的SQL SELECT date, ...

  4. [Hive - LanguageManual ] ]SQL Standard Based Hive Authorization

    Status of Hive Authorization before Hive 0.13 SQL Standards Based Hive Authorization (New in Hive 0. ...

  5. [Hive - LanguageManual ] Windowing and Analytics Functions (待)

    LanguageManual WindowingAndAnalytics     Skip to end of metadata   Added by Lefty Leverenz, last edi ...

  6. [HIve - LanguageManual] Sort/Distribute/Cluster/Order By

    Syntax of Order By Syntax of Sort By Difference between Sort By and Order By Setting Types for Sort ...

  7. [Hive - LanguageManual] Import/Export

    LanguageManual ImportExport     Skip to end of metadata   Added by Carl Steinbach, last edited by Le ...

  8. [Hive - LanguageManual] DML: Load, Insert, Update, Delete

    LanguageManual DML Hive Data Manipulation Language Hive Data Manipulation Language Loading files int ...

  9. [Hive - LanguageManual] Alter Table/Partition/Column

    Alter Table/Partition/Column Alter Table Rename Table Alter Table Properties Alter Table Comment Add ...

随机推荐

  1. 一个java的DES加解密类转换成C#

    原文:一个java的DES加解密类转换成C# 一个java的des加密解密代码如下: //package com.visionsky.util; import java.security.*; //i ...

  2. 区分int a() 和 int a

    事因 #include <iostream> using namespace std; struct A { A(int) {} A() {} void fun() {}; }; int ...

  3. Java API —— 多线程

    1.多线程概述     1)进程:         正在运行的程序,是系统进行资源分配和调用的独立单位.         每一个进程都有它自己的内存空间和系统资源.     2)线程:         ...

  4. JBPM4 常用表结构

    JBPM4 常用表结构 第一部分:表结构说明 Jbpm4 共有18张表,如下,其中红色的表为经常使用的表   一:资源库与运行时表结构 1.  JBPM4_DEPLOYMENT 流程定义表 2.  J ...

  5. .net通过获取客户端IP地址反查出用户的计算机名

    这个功能看似很少用到,但又非常实用,看似简单,但又在其中存在很多未知因素造成让人悲痛莫名的负能量... 这是公司内部最近在使用的功能,因为是DHCP,所以有时会根据计算机名做一些统计和权限的设置. 也 ...

  6. Tomcat原理 分类: 原理 2015-06-28 19:26 5人阅读 评论(0) 收藏

    Tomcat的模块结构设计的相当好,而且其Web 容器的性能相当出色.JBoss直接就使用了Tomcat的web容器,WebLogic的早期版本也是使用了Tomcat的代码. Web容器的工作过程在下 ...

  7. Android studio在真机上进行调试

    1.在Android Studio中,把app的默认启动目标改为USB device,点击[app]→[app configuration],在[Target Device]选择[USB device ...

  8. JdbcTemplate查询数据 三种callback之间的区别

    JdbcTemplate针对数据查询提供了多个重载的模板方法,你可以根据需要选用不同的模板方法. 如果你的查询很简单,仅仅是传入相应SQL或者相关参数,然后取得一个单一的结果,那么你可以选择如下一组便 ...

  9. 基于XMPP的即时通信系统的建立(六)— 开发环境搭建

    服务器端 新建空工程 使用Eclipse新建名为openfire的空java工程. 导入源代码 这里使用的是openfire的openfire_src_3_10_3.zip源码. 导入后将目录src/ ...

  10. 关于ie6对齐

    先来没有任何对齐时的样子: 1.一种是在父级没有高度的情况下居中. 给每个独立的元素都加上vertical-align:middle; 针对文字可以不加,加与不加都可以居中对齐.但是无法做到绝对的居中 ...