pig 自定义udf中读取hdfs 文件

最近几天，在研究怎么样把日志中的IP地址转化成具体省份城市。

希望写一个pig udf

IP数据库采用的纯真IP数据库文件qqwry.dat,可以从http://www.cz88.net/下载。

这里关键点在于怎么样读取这个文件，浪费了二天时间，现在把代码记录下来供和我遇到相同问题的朋友参考。

pig script

register /usr/local/pig/mypigudf.jar;

define ip2address my.pig.func.IP2Address('/user/anny/qqwry.dat');

a = load '/user/anny/hdfs/logtestdata/ipdata.log' as (ip:chararray);

b = foreach a generate ip,ip2address(ip) as cc:map[chararray];

c = foreach b generate ip,cc#'province' as province,cc#'city' as city,cc#'region' as region;

dump c;

java写的pig udf:

package my.pig.func;

import java.io.FileNotFoundException;

import java.io.IOException;

import java.io.RandomAccessFile;

import java.util.ArrayList;

import java.util.HashMap;

import java.util.List;

import java.util.Map;

import my.pig.func.IPConvertCity.IPSeeker;

import my.pig.func.IPConvertCity.IPUtil;

import my.pig.func.IPConvertCity.LogFactory;

import org.apache.log4j.Level;

import org.apache.pig.EvalFunc;

import org.apache.pig.data.DataType;

import org.apache.pig.data.Tuple;

import org.apache.pig.impl.logicalLayer.schema.Schema;

public class IP2Address extends EvalFunc<Map<String, Object>> {

    private String lookupFile = "";

    private RandomAccessFile objFile = null;

    public IP2Address(String file) {

        this.lookupFile = file;

    }

    @Override

    public Map<String, Object> exec(Tuple input) throws IOException {

        if (input == null || input.size() == 0 || input.get(0) == null)

            return null;

        Map<String, Object> output = new HashMap<String, Object>();

        String str = (String) input.get(0);

        try {

            if (str.length() == 0)

                return output;

            if (objFile == null) {

                try {

                    objFile = new RandomAccessFile("./qqwry.dat", "r");

                } catch (FileNotFoundException e1) {

                    System.out.println("IP地址信息文件没有找到" + lookupFile);

                    return null;

                }

            }

            IPSeeker seeker = new IPSeeker(objFile);

            String country = seeker.getCountry(str);

            output = IPUtil.splitCountry(country);

            return output;

        } catch (Exception e) {

            return output;

        }

    }

    @Override

    public Schema outputSchema(Schema input) {

        return new Schema(new Schema.FieldSchema(null, DataType.MAP));

    }

    public List<String> getCacheFiles() {

        List<String> list = new ArrayList<String>(1);

        list.add(lookupFile + "#qqwry.dat");

        return list;

    }

}

Search for "Distributed Cache" in this page of the Pig docs: http://pig.apache.org/docs/r0.11.0/udf.html

The example it shows using the getCacheFiles() method should ensure that the file is accessible to all the nodes in the cluster.

参考文章：http://stackoverflow.com/questions/17514022/access-hdfs-file-from-udf

http://stackoverflow.com/questions/19149839/pig-udf-maxmind-geoip-database-data-file-loading-issue

pig 自定义udf中读取hdfs 文件的更多相关文章

在spark udf中读取hdfs上的文件
某些场景下,我们在写UDF实现业务逻辑时候,可能需要去读取某个文件. 我们可以将此文件上传个hdfs某个路径下,然后通过hdfs api读取该文件,但是需要注意: UDF中读取文件部分最好放在静态代码 ...
Spark设置自定义的InputFormat读取HDFS文件
本文通过MetaWeblog自动发布,原文及更新链接:https://extendswind.top/posts/technical/problem_spark_reading_hdfs_serial ...
五种方式让你在java中读取properties文件内容不再是难题
一.背景最近,在项目开发的过程中,遇到需要在properties文件中定义一些自定义的变量,以供java程序动态的读取,修改变量,不再需要修改代码的问题.就借此机会把Spring+SpringMVC ...
Spark读取HDFS文件，任务本地化(NODE_LOCAL)
Spark也有数据本地化的概念(Data Locality),这和MapReduce的Local Task差不多,如果读取HDFS文件,Spark则会根据数据的存储位置,分配离数据存储最近的Execu ...
记录一次读取hdfs文件时出现的问题java.net.ConnectException: Connection refused
公司的hadoop集群是之前的同事搭建的,我(小白一个)在spark shell中读取hdfs上的文件时,执行以下指令 >>> word=sc.textFile("hdfs ...
cocos2d-x 3.0rc2中读取sqlite文件
cocos2d-x 3.0rc2中读取sqlite文件的方式,在Android中直接读取软件内的会失败.须要复制到可写的路径下 sqlite3* dbFile = NULL; std::string ...
Spark读取HDFS文件，文件格式为GB2312，转换为UTF-8
package iie.udps.example.operator.spark; import scala.Tuple2; import org.apache.hadoop.conf.Configur ...
【解惑】深入jar包：从jar包中读取资源文件
[解惑]深入jar包:从jar包中读取资源文件 http://hxraid.iteye.com/blog/483115 TransferData组件的spring配置文件路径:/D:/develop/ ...
java 从jar包中读取资源文件
在代码中读取一些资源文件(比如图片,音乐,文本等等),在集成环境(Eclipse)中运行的时候没有问题.但当打包成一个可执行的jar包(将资源文件一并打包)以后,这些资源文件找不到,如下代码: Jav ...

随机推荐

20145324 《Java程序设计》第8周学习总结
20145324 <Java程序设计>第8周学习总结教材学习内容总结第十四章 1.NIO使用频道来衔接数据节点,可以设定缓冲区容量,在缓冲区中对感兴趣的数据区块进行标记,提供clear ...
php-fpm 信号
使用信号之前,需要先确保php-fpm.conf 里面有配置pid,默认是被注释掉的. ;pid = run/php-fpm.pid 文件在 php安装目录/var/run/php-fpm.pid 信 ...
【转】获取Windows系统明文密码神器
前序电脑密码忘记了可以用本工具找回,前提是你能进入系统,例如本机保存了远程服务器登录的密码或借别人的电脑,而忘记了密码:mimikatz 2.0工具正好解决了你的问题. 工具下载 binaires ...
LeetCode——Arithmetic Slices
Question A sequence of number is called arithmetic if it consists of at least three elements and if ...
【TensorFlow/简单网络】MNIST数据集-softmax、全连接神经网络，卷积神经网络模型
初学tensorflow,参考了以下几篇博客: soft模型 tensorflow构建全连接神经网络 tensorflow构建卷积神经网络 tensorflow构建卷积神经网络 tensorflow构 ...
mybatis的操作数据库基础
1.domain类 package com.xiaostudy.mybatis.domain; /** * @desc domain类 * @author xiaostudy * */ public ...
response.getWriter().write("中文");乱码问题
起初遇到这个问题,网上几乎所有的建议都是: response.setHeader("Content-type", "text/html;charset=UTF-8&quo ...
【小而优】如何实现 tail -f 动态显示日志时高亮显示关键字
前言如果你在linux下工作,那用tail -f跟踪一个日志文件的输出内容应该是家常便饭了. 但是,有时你更关心的是一些敏感字词,希望能够在动态跟踪的同时,把这些字词高亮出来,比如日志中的 ERRO ...
cassandra 之在spark-shell 中使用 spark cassandra connector 完整案例
1.cassandra 准备启动cqlsh, CQLSH_HOST=172.16.163.131 bin/cqlsh cqlsh>CREATE KEYSPACE productlogs WIT ...
欢迎来到 Flask 的世界
欢迎来到 Flask 的世界欢迎阅读 Flask 的文档.本文档分成几个部分,我推荐您先读 < 安装 >,然后读< 快速上手 >.< 教程 > 比快速上手文档更详 ...

pig 自定义udf中读取hdfs 文件

pig 自定义udf中读取hdfs 文件的更多相关文章

随机推荐

热门专题