pig 自定义udf中读取hdfs 文件

最近几天，在研究怎么样把日志中的IP地址转化成具体省份城市。

希望写一个pig udf

IP数据库采用的纯真IP数据库文件qqwry.dat,可以从http://www.cz88.net/下载。

这里关键点在于怎么样读取这个文件，浪费了二天时间，现在把代码记录下来供和我遇到相同问题的朋友参考。

pig script

register /usr/local/pig/mypigudf.jar;

define ip2address my.pig.func.IP2Address('/user/anny/qqwry.dat');

a = load '/user/anny/hdfs/logtestdata/ipdata.log' as (ip:chararray);

b = foreach a generate ip,ip2address(ip) as cc:map[chararray];

c = foreach b generate ip,cc#'province' as province,cc#'city' as city,cc#'region' as region;

dump c;

java写的pig udf:

package my.pig.func;

import java.io.FileNotFoundException;

import java.io.IOException;

import java.io.RandomAccessFile;

import java.util.ArrayList;

import java.util.HashMap;

import java.util.List;

import java.util.Map;

import my.pig.func.IPConvertCity.IPSeeker;

import my.pig.func.IPConvertCity.IPUtil;

import my.pig.func.IPConvertCity.LogFactory;

import org.apache.log4j.Level;

import org.apache.pig.EvalFunc;

import org.apache.pig.data.DataType;

import org.apache.pig.data.Tuple;

import org.apache.pig.impl.logicalLayer.schema.Schema;

public class IP2Address extends EvalFunc<Map<String, Object>> {

    private String lookupFile = "";

    private RandomAccessFile objFile = null;

    public IP2Address(String file) {

        this.lookupFile = file;

    }

    @Override

    public Map<String, Object> exec(Tuple input) throws IOException {

        if (input == null || input.size() == 0 || input.get(0) == null)

            return null;

        Map<String, Object> output = new HashMap<String, Object>();

        String str = (String) input.get(0);

        try {

            if (str.length() == 0)

                return output;

            if (objFile == null) {

                try {

                    objFile = new RandomAccessFile("./qqwry.dat", "r");

                } catch (FileNotFoundException e1) {

                    System.out.println("IP地址信息文件没有找到" + lookupFile);

                    return null;

                }

            }

            IPSeeker seeker = new IPSeeker(objFile);

            String country = seeker.getCountry(str);

            output = IPUtil.splitCountry(country);

            return output;

        } catch (Exception e) {

            return output;

        }

    }

    @Override

    public Schema outputSchema(Schema input) {

        return new Schema(new Schema.FieldSchema(null, DataType.MAP));

    }

    public List<String> getCacheFiles() {

        List<String> list = new ArrayList<String>(1);

        list.add(lookupFile + "#qqwry.dat");

        return list;

    }

}

Search for "Distributed Cache" in this page of the Pig docs: http://pig.apache.org/docs/r0.11.0/udf.html

The example it shows using the getCacheFiles() method should ensure that the file is accessible to all the nodes in the cluster.

参考文章：http://stackoverflow.com/questions/17514022/access-hdfs-file-from-udf

http://stackoverflow.com/questions/19149839/pig-udf-maxmind-geoip-database-data-file-loading-issue

pig 自定义udf中读取hdfs 文件的更多相关文章

在spark udf中读取hdfs上的文件
某些场景下,我们在写UDF实现业务逻辑时候,可能需要去读取某个文件. 我们可以将此文件上传个hdfs某个路径下,然后通过hdfs api读取该文件,但是需要注意: UDF中读取文件部分最好放在静态代码 ...
Spark设置自定义的InputFormat读取HDFS文件
本文通过MetaWeblog自动发布,原文及更新链接:https://extendswind.top/posts/technical/problem_spark_reading_hdfs_serial ...
五种方式让你在java中读取properties文件内容不再是难题
一.背景最近,在项目开发的过程中,遇到需要在properties文件中定义一些自定义的变量,以供java程序动态的读取,修改变量,不再需要修改代码的问题.就借此机会把Spring+SpringMVC ...
Spark读取HDFS文件，任务本地化(NODE_LOCAL)
Spark也有数据本地化的概念(Data Locality),这和MapReduce的Local Task差不多,如果读取HDFS文件,Spark则会根据数据的存储位置,分配离数据存储最近的Execu ...
记录一次读取hdfs文件时出现的问题java.net.ConnectException: Connection refused
公司的hadoop集群是之前的同事搭建的,我(小白一个)在spark shell中读取hdfs上的文件时,执行以下指令 >>> word=sc.textFile("hdfs ...
cocos2d-x 3.0rc2中读取sqlite文件
cocos2d-x 3.0rc2中读取sqlite文件的方式,在Android中直接读取软件内的会失败.须要复制到可写的路径下 sqlite3* dbFile = NULL; std::string ...
Spark读取HDFS文件，文件格式为GB2312，转换为UTF-8
package iie.udps.example.operator.spark; import scala.Tuple2; import org.apache.hadoop.conf.Configur ...
【解惑】深入jar包：从jar包中读取资源文件
[解惑]深入jar包:从jar包中读取资源文件 http://hxraid.iteye.com/blog/483115 TransferData组件的spring配置文件路径:/D:/develop/ ...
java 从jar包中读取资源文件
在代码中读取一些资源文件(比如图片,音乐,文本等等),在集成环境(Eclipse)中运行的时候没有问题.但当打包成一个可执行的jar包(将资源文件一并打包)以后,这些资源文件找不到,如下代码: Jav ...

随机推荐

20145216史婧瑶《Java程序设计》第5周学习总结
20145216 <Java程序设计>第5周学习总结教材学习内容总结第八章异常处理 8.1 语法与继承架构 Java中所有错误都会被打包为对象,运用try.catch,可以在错误发生 ...
Oracle查询一个表的数据插入到另一个表
1. 新增一个表,通过另一个表的结构和数据 create table XTHAME.tab1 as select * from DSKNOW.COMBDVERSION 2. 如果表存在: insert ...
JavaConfig 使用Java代码进行显示配置
从Spring 3起,JavaConfig功能已经包含在Spring核心模块,它允许开发者将bean定义和在Spring配置XML文件到Java类中. 需要先加载spring-context 包 &l ...
[Hdu6315]Naive Operations
题意:给定一个初始数组b和一个初始值全部为0的数组a,每次操作可以在给定的区间(l,r)内让a[i](l=<i<=r)加一,或者查询区间区间(l,r)中a[i]/b[i](l=<i& ...
安装配置mariadb-10.1.19
本文参考:http://chenzehe.iteye.com/blog/1266260 感谢原作者的分享! 首先安装/更新一些编译时会用到的基础包 [root@localhost local]# y ...
Spring boot 解决 hibernate no session异常
启动类中加入 @Beanpublic OpenEntityManagerInViewFilter openEntityManagerInViewFilter(){ return new OpenEnt ...
Xampp mysql启动
因为最近项目要用到php,需要集成Xampp环境,但是并没有接触过php,从官网下载了Xampp后,基本上就是傻瓜式安装了, 完成安装界面如下: 点击Apache的start可以正常启动,点击MYSQ ...
【Network Architecture】Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning（转）
文章来源: https://www.cnblogs.com/shouhuxianjian/p/7786760.html Feature Extractor[Inception v4] 0. 背景随着 ...
springMvc REST 请求和响应
前言: 突然怎么也想不起来 springMvc REST 请求的返回类型了! (尴尬+究竟) 然后本着方便的想法百度了一下发现了个问题,大家在写 springMvc RES ...
apollo 部门管理
apollo 默认部门有两个.如果想要增加自己的部门,只能通过数据库ApolloPortalDB 修改表ServiceConfig中organizations即可:

pig 自定义udf中读取hdfs 文件

pig 自定义udf中读取hdfs 文件的更多相关文章

随机推荐

热门专题