UDFs

通过UDFs（用户自己定义函数），能够自己定义数据处理方法，扩展Pig功能。实际上，UDFS除了使用之前须要register/define外。和内置函数没什么不同。

主要的EvalFunc

以内置的ABS函数为例：

public class ABS extends EvalFunc<Double>{

    /**

     * java level API

     * @param input expectsa single numeric value

     * @return output returns a single numeric value, absolute value of the argument

     */

    public Double exec(Tuple input) throws IOException {

        if (input == null || input.size() == 0)

            return null;

        Double d;

        try{

            d = DataType.toDouble(input.get(0));

        } catch (NumberFormatException nfe){

            System.err.println("Failed to process input; error -" + nfe.getMessage());

            return null;

        } catch (Exception e){

            throw new IOException("Caught exception processing input row", e);

        }

        return Math.abs(d);

    }

    ……

    public Schema outputSchema(Schema input) ;

    public List<FuncSpec> getArgToFuncMapping() throws FrontendException;

}

函数都继承EvalFunc接口，泛型參数Double代表返回类型。
exec方法：输入參数类型为元组，代表一行记录。
outputSchema方法：用于处理输入和输出Schema
getArgToFuncMapping：用于支持各种数据类型重载。

聚合函数

EvalFuc方法也能实现聚合函数，这是由于group操作对每一个分组都返回一条记录，每组中包括一个Bag，所以exec方法中迭代处理Bag中记录就可以。

以Count函数为例：

public Long exec(Tuple input) throws IOException {

    try {

        DataBag bag = (DataBag)input.get(0);

        if(bag==null)

            return null;

        Iterator it = bag.iterator();

        long cnt = 0;

        while (it.hasNext()){

            Tuple t = (Tuple)it.next();

            if (t != null && t.size() > 0 && t.get(0) != null )

                cnt++;

        }

        return cnt;

    } catch (ExecException ee) {

        throw ee;

    } catch (Exception e) {

        int errCode = 2106;

        String msg = "Error while computing count in " + this.getClass().getSimpleName();

        throw new ExecException(msg, errCode, PigException.BUG, e);

    }

}

Algebraic 和Accumulator 接口

如前所述，具备algebraic性质的聚合函数在Map-Reduce过程中能被Combiner优化。直观来理解，具备algebraic性质的函数处理过程能被分为三部分：initial（初始化，处理部分输入数据）、intermediate(中间过程，处理初始化过程的结果)和final（收尾，处理中间过程的结果）。

比方COUNT函数，初始化过程为count计数操作。中间过程和收尾为sum求和操作。更进一步。假设函数在这三个阶段中都能进行同样的操作，那么函数具备distributive性质。比方SUM函数。

Pig提供了Algebraic 接口：

public interface Algebraic{

    /**

     * Get the initial function.

     * @return A function name of f_init. f_init shouldbe an eval func.

     * The return type off_init.exec() has to be Tuple

     */

    public String getInitial();

    /**

     * Get the intermediatefunction.

     * @return A function name of f_intermed. f_intermedshould be an eval func.

     * The return type off_intermed.exec() has to be Tuple

     */

    public String getIntermed();

    /**

     * Get the final function.

     * @return A function name of f_final. f_final shouldbe an eval func parametrized by

     * the same datum as the evalfunc implementing this interface.

     */

    public String getFinal();

}

当中每一个方法都返回EvalFunc实现类的名称。

继续以COUNT函数为例，COUNT实现了Algebraic接口。针对下面语句：

input= load 'data' as (x, y);

grpd= group input by x;

cnt= foreach grpd generate group, COUNT(input);

storecnt into 'result';

Pig会重写MR运行计划：

Map

load,foreach(group,COUNT.Initial)

Combine

foreach(group,COUNT.Intermediate)

Reduce

foreach(group,COUNT.Final),store

Algebraic 接口通过Combiner优化降低传输数据量，而Accumulator接口则关注的是内存使用量。UDF实现Accumulator接口后，Pig保证全部key相同的数据（通过Shuffle）以增量的形式传递给UDF（默认pig.accumulative.batchsize=20000）。相同。COUNT也实现了Accumulator接口。

/* Accumulator interface implementation */

    private long intermediateCount = 0L;

    @Override

    public void accumulate(Tuple b) throws IOException {

       try {

           DataBag bag = (DataBag)b.get(0);

           Iterator it = bag.iterator();

           while (it.hasNext()){

                Tuple t = (Tuple)it.next();

                if (t != null && t.size() > 0 && t.get(0) != null) {

                    intermediateCount += 1;

                }

            }

       } catch (ExecException ee) {

           throw ee;

       } catch (Exception e) {

           int errCode = 2106;

           String msg = "Error while computing min in " + this.getClass().getSimpleName();

           throw new ExecException(msg, errCode, PigException.BUG, e);

       }

    }

    @Override

    public void cleanup() {

       intermediateCount = 0L;

    }

    @Override

    /*

    *当前key都被处理完之后被调用

    */

    public Long getValue() {

       return intermediateCount;

    }

前后端数据传递

通过UDFs构造函数传递数据是最简单的方法。然后通过define语句定义UDF实例时指定构造方法參数。但有些情况下。比方数据在执行期才产生，或者数据不能用String格式表达，这时候就得使用UDFContext了。

UDF通过getUDFContext方法获取保存在ThreadLoacl中的UDFContext实例。

UDFContext包括下面信息：

jconf：Hadoop Configuration。
clientSysProps：系统属性。
HashMap<UDFContextKey,Properties> udfConfs：用户自己保存的属性，当中UDFContextKey由UDF类名生成。

UDFs运行流程

watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQvaWRvbnR3YW50b2Jl/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/Center" alt="">

Pig架构可扩展性

Pig哲学之三——Pigs Live Anywhere。

理论上。Pig并不被限定执行在Hadoop框架上,有几个能够參考的实现和提议。

Pigen。Pig on Tez。https://github.com/achalsoni81/pigeon，架构图例如以下：
Pig的后端抽象层：https://wiki.apache.org/pig/PigAbstractionLayer。

眼下已经实现了PigLatin执行在Galago上。

http://www.galagosearch.org/

參考资料

Pig官网：http://pig.apache.org/

Pig paper at SIGMOD 2008：Building a High Level DataflowSystem on top of MapReduce:The Pig Experience

Programming.Pig：Dataflow.Scripting.with.Hadoop(2011.9).Alan.Gates

Pig系统分析(8)-Pig可扩展性的更多相关文章

Pig系统分析(7)-Pig有用工具类
Explain Explain是Pig提供的调试工具,使用explain能够输出Pig Lation的运行计划.值得一提的是,explain支持-dot选项.将运行计划以DOT格式输出, (DOT是一 ...
Pig系统分析(5)-从Logical Plan到Physical Plan
Physical Plan生成过程优化后的逻辑运行计划被LogToPhyTranslationVisitor处理,生成物理运行计划. 这是一个经典的Vistor设计模式应用场景. 当中,LogToP ...
Pig系统分析(6)-从Physical Plan到MR Plan再到Hadoop Job
从Physical Plan到Map-Reduce Plan 注:由于我们重点关注的是Pig On Spark针对RDD的运行计划,所以Pig物理运行计划之后的后端參考意义不大,这些部分主要分析流程, ...
pig 介绍与pig版 hello world
前两天使用pig做ETL,粗浅的看了一下,没有系统地学习,感觉pig还是值得学习的,故又重新看programming pig. 以下是看的第一章的笔记: What is pig? Pig provid ...
pig简介
Apache Pig是MapReduce的一个抽象.它是一个工具/平台,用于分析较大的数据集,并将它们表示为数据流.Pig通常与 Hadoop 一起使用:我们可以使用Apache Pig在Hadoop ...
Pig Latin程序设计1
Pig是一个大规模数据分析平台.Pig的基础结构层包括一个产生MapReduce程序的编译器.在编译器中,大规模并行执行依据存在.Pig的语言包括一个叫Pig Latin的文本语言,此语言有如下特性: ...
Hive集成HBase;安装pig
Hive集成HBase 配置将hive的lib/中的HBase.jar包用实际安装的Hbase的jar包替换掉 cd /opt/hive/lib/ ls hbase-0.94.2* rm -rf ...
Hadoop Pig简介、安装、试用
相比Java的MapReduce api,Pig为大型数据集的处理提供了更高层次的抽象,与MapReduce相比,Pig提供了更丰富的数据结构,一般都是多值和嵌套的数据结构.Pig还提供了一套更强大的 ...
Pig + Ansj 统计中文文本词频
最近特别喜欢用Pig,拥有能满足大部分需求的内置函数(built-in functions),支持自定义函数(user defined functions, UDF),能load 纯文本.avro等格 ...

随机推荐

js调试方法
参考:1.https://developers.google.com/web/tools/chrome-devtools/javascript/ 2.https://developers.google ...
adb logcat通过包名过滤（dos命令find后跟变量）
adb命令中似乎没有直接通过报名来过滤的功能,可是能够通过过滤进程的pid来过滤该应用的日志过滤条件:该app在执行实现原理: 1.获取该app执行时的pid 2.通过find命令,过滤pid的日 ...
OpenGL ES 3.0之Fragment buffer objects（FBO）详解 (转)
http://www.cnblogs.com/salam/p/4957250.html 片段操作图这篇文章将介绍从写入帧缓冲和读取帧缓冲的方式. Buffers(缓冲) OpenGL ES支持三种缓 ...
构建高性能web站点-1
以下为阅读<构建高性能web站点>郭欣著这本书的适合读者: 1.编写web程序.关心站点性能,并且希望自己做的更加出色的开发人员 2.关心性能和可用性的web架构师 3.希望构建高性能 ...
shell中单引号、双引号、反引号的区别
'单引号' 忽略所有特殊字符 "双引号" 忽略大部分特殊字符,除了$ ` `反引号` 输出执行结果
app crash率的标准
手Q定义是: android: 发布目标是低于1% ios: 0.8%以下
[TypeScript] Deeply mark all the properties of a type as read-only in TypeScript
We will look at how we can use mapped types, conditional types, self-referencing types and the “infe ...
虚拟机oracle virtualbox 上安装centos6.5 网络设置
上篇文章写到,在虚拟机上安装centos6.5,结果依照文章非常顺利的安装了,可是用yum安装软件的时候.报错,源有问题,不能下载,然后ping一下摆渡.非常悲催 dns解析不了,cat /etc/r ...
让Qt Creator支持Windows Phone 8开发
让Qt Creator支持Windows Phone 8开发近期QtCreator3.2出了.修复了一些Bug.比上一个版本号3.1.2要好了一些. 因为在上一个版本号(Qt for WinRT自带 ...
XML之Schema
前面学习了DTD.相同我们有了一套更完好的定义法则-Schema. 以下环绕Schema是什么.为何用以及怎么用谈谈自己的感受. XML Schema是基于XML的DTD替代者. XML Schema ...

Pig系统分析(8)-Pig可扩展性