Hive中自定义Map/Reduce示例 In Python

Hive支持自定义map与reduce script。接下来我用一个简单的wordcount例子加以说明。使用Python开发(如果使用Java开发，请看这里)。

开发环境:

python:2.7.5

hive:2.3.0

hadoop:2.8.1

一、map与reduce脚本

map脚本(mapper.py)

#!/usr/bin/python

import sys

import re

while True:

   line = sys.stdin.readline().strip()

   if not line:

     break

   p = re.compile(r'\W+')

   words=p.split(line)

   #write the tuples to stdout

   for word in words:

     print '%s\t%s' % (word, "")

reduce脚本(reducer.py)

#!/usr/bin/python

import sys 

# maps words to their counts

word2count = {}

while True:

    line=sys.stdin.readline().strip()

    if not line:

      break

    # parse the input we got from mapper.py

    try:

        word,count= line.split('\t', 1)

    except:

        continue

    # convert count (currently a string) to int

    try:

        count = int(filter(str.isdigit,count))

    except ValueError:

        continue

    try:

        word2count[word] = word2count[word]+count

    except:

        word2count[word] = count

# write the tuples to stdout

# Note: they are unsorted

for word in word2count.keys():

    print '%s\t%s' % ( word, word2count[word] )

注意一点的是，不能使用for line in std.in，因为for是一个字节一个字节的读取，而不是一行一行地读。而且在对map输出的word,count进行拆分时，要注意将拆分的count部分非数字部分去掉，以免count转换成int错误。

二、编写hive hql

drop table if exists raw_lines;

-- create table raw_line, and read all the lines in '/user/inputs', this is the path on your local HDFS

create external table if not exists raw_lines(line string)

ROW FORMAT DELIMITED

stored as textfile

location '/user/inputs';

drop table if exists word_count;

-- create table word_count, this is the output table which will be put in '/user/outputs' as a text file, this is the path on your local HDFS

create external table if not exists word_count(word string, count int)

 ROW FORMAT DELIMITED

 FIELDS TERMINATED BY '\t'

 lines terminated by '\n' STORED AS TEXTFILE LOCATION '/user/outputs/';

-- add the mapper&reducer scripts as resources, please change your/local/path

add file /home/yanggy/mapper.py;

add file /home/yanggy/reducer.py;

from (

        from raw_lines

        map raw_lines.line

        --call the mapper here

        using 'mapper.py'

        as word, count

        cluster by word) map_output

insert overwrite table word_count

reduce map_output.word, map_output.count

--call the reducer here

using 'reducer.py'

as word,count;

Hive中自定义Map/Reduce示例 In Python的更多相关文章

Hive中自定义Map/Reduce示例 In Java
Hive支持自定义map与reduce script.接下来我用一个简单的wordcount例子加以说明. 如果自己使用Java开发,需要处理System.in,System,out以及key/val ...
Python中的Map/Reduce
MapReduce是一种函数式编程模型,用于大规模数据集(大于1TB)的并行运算.概念"Map(映射)"和"Reduce(归约)",是它们的主要思想,都是从函数 ...
Hive中自定义函数
Hive的自定义的函数的步骤: 1°.自定义UDF extends org.apache.hadoop.hive.ql.exec.UDF 2°.需要实现evaluate函数,evaluate函数支持重 ...
perl编程中的map函数示例
转自:http://www.jbxue.com/article/14854.html 发布:脚本学堂/Perl 编辑:JB01 2013-12-20 10:20:01 [大中小] 本文介绍 ...
Hadoop Map/Reduce 示例程序WordCount
#进入hadoop安装目录 cd /usr/local/hadoop #创建示例文件:input #在里面输入以下内容: #Hello world, Bye world! vim input #在hd ...
Python中 filter | map | reduce | lambda的用法
1.filter(function, sequence):对sequence中的item依次执行function(item),将执行结果为True的item组成一个List/String/Tupl ...
python中lambda,map,reduce,filter,zip函数
函数式编程函数式编程(Functional Programming)或者函数程序设计,又称泛函编程,是一种编程范型,它将计算机运算视为数学上的函数计算,并且避免使用程序状态以及易变对象.简单来讲,函 ...
python 中的map(), reduce(), filter
据说是函数式编程的一个函数(然后也有人tucao py不太适合干这个),在我看来算是pythonic的一种写法. 简化了我们的操作,比方我们想将list中的数字都加1,最基本的可能是编写一个函数: I ...
Python 中的 map, reduce, zip, filter, lambda基本使用方法
map(function, sequence[, sequence, ...] 该函数是对sequence中的每个成员调用一次function函数,如果参数有多个,则对每个sequence中对应的元素 ...

随机推荐

SpringBoot tomcat
该文的前提是已经可以在控制台运行成功,有些时候想放在tomcat里运行大致分为如下几步 1.配置文件更改 <dependency> <groupId>org.springfr ...
test命令详解
test命令格式: test condition 通常,在if-then-else语句中,用[]代替,即[ condition ].注意:方括号两边都要用空格. 1.数值比较 ========== ...
MVC-1.1 BundleConfig-ScriptBundle
App_Start中的BudleCnfig.cs中 bundles.Add(new ScriptBundle("~/bundles/jquery").Include( " ...
给IDistributedCache新增了扩展方法GetOrCreate、GetOrCreateAsync
public static class DistributedCacheExtensions { public static TItem GetOrCreate<TItem>(this I ...
深入理解Aspnet Core之Identity(3)
主题账户管理一个比较常见的功能就是密码强度策略,Identity已经内置了一个通用的可配置的策略,我们一般情况下可以直接拿来用即可.本篇我会介绍一些Identity内置的密码策略类:Password ...
Linq to SQL 参考Demo
LINQ to SQL语句()之Where Where操作适用场景:实现过滤,查询等功能. 说明:与SQL命令中的Where作用相似,都是起到范围限定也就是过滤作用的,而判断条件就是它后面所接的子句 ...
python爬虫从入门到放弃（一）——试用bs4, request爬百度股票
文章实践主要来自于:https://mp.weixin.qq.com/s/FiKqb06nz0K0AD9VUWJapw 爬虫流程: 明确目的(哪些数据),确认网页可爬,查看源网页是否有需要的数据. b ...
Android四种数据存储方式
一.SharedPreference数据存储篇 1.作用范围 (1).它是一种轻型的数据存储方式 (2).本质是基于XML文件存储key-value键值对数据 (3).通常用来存储一些简单的配置方式 ...
数组内数据不使用for循环实现多个移动
题目: 有序数组中加入一个新的数据,需保持数组有序,如何操作? 方式A :for循环将后续数组依次后移. 方式B :内存拷贝代码: /******************************** ...
rabbitmq系列三之发布/订阅
1.发布/订阅在上篇教程中,我们搭建了一个工作队列,每个任务只分发给一个工作者(worker).在本篇教程中,我们要做的跟之前完全不一样 —— 分发一个消息给多个消费者(consumers).这种模 ...

Hive中自定义Map/Reduce示例 In Python

一、map与reduce脚本

二、编写hive hql

Hive中自定义Map/Reduce示例 In Python的更多相关文章

随机推荐

热门专题