MR hadoop streaming job的学习 combiner

代码已经拷贝到了公司电脑的：

/Users/baidu/Documents/Data/Work/Code/Self/hadoop_mr_streaming_jobs

首先是主控脚本 main.sh

调用的是 extract.py

然后发现写的不太好。其中有一个combiner，可以看这里：

https://blog.csdn.net/u010700335/article/details/72649186

streaming 脚本的时候，是以管道为基础的：

（5） Python脚本

import sys

for line in sys.stdin:

.......

#!/usr/bin/env python

import sys

# maps words to their counts

word2count = {}

# input comes from STDIN (standard input)

for line in sys.stdin:

    # remove leading and trailing whitespace

    line = line.strip()

    # split the line into words while removing any empty strings

    words = filter(lambda word: word, line.split())

    # increase counters

    for word in words:

        # write the results to STDOUT (standard output);

        # what we output here will be the input for the

        # Reduce step, i.e. the input for reducer.py

        #

        # tab-delimited; the trivial word count is

        print '%s\t%s' % (word, )

#---------------------------------------------------------------------------------------------------------

#!/usr/bin/env python

from operator import itemgetter

import sys

# maps words to their counts

word2count = {}

# input comes from STDIN

for line in sys.stdin:

    # remove leading and trailing whitespace

    line = line.strip()

    # parse the input we got from mapper.py

    word, count = line.split()

    # convert count (currently a string) to int

    try:

        count = int(count)

        word2count[word] = word2count.get(word, ) + count

    except ValueError:

        # count was not a number, so silently

        # ignore/discard this line

        pass

# sort the words lexigraphically;

#

# this step is NOT required, we just do it so that our

# final output will look more like the official Hadoop

# word count examples

sorted_word2count = sorted(word2count.items(), key=itemgetter())

# write the results to STDOUT (standard output)

for word, count in sorted_word2count:

    print '%s\t%s'% (word, count)

MR hadoop streaming job的学习 combiner的更多相关文章

hadoop学习；Streaming，aggregate；combiner
hadoop streaming同意我们使用不论什么可运行脚本来处理按行组织的数据流,数据取自UNIX的标准输入STDIN,并输出到STDOUT 我们能够用 linux命令管道查看文本有多少行,cat ...
Hadoop Streaming框架学习（一）
Hadoop Streaming框架学习(一) Hadoop Streaming框架学习(一) 2013-08-19 12:32 by ATP_, 473 阅读, 3 评论, 收藏, 编辑 1.Had ...
Hadoop Streaming框架学习2
Hadoop Streaming框架学习(二) 1.常用Streaming命令介绍使用下面的命令运行Streaming MapReduce程序: 1: $HADOOP_HOME/bin/hadoop ...
Hadoop Streaming框架学习（二）
1.常用Streaming命令介绍使用下面的命令运行Streaming MapReduce程序: 1: $HADOOP_HOME/bin/hadoop/hadoop streaming args 其 ...
Hadoop Streaming框架使用（一）
Streaming简介 link:http://www.cnblogs.com/luchen927/archive/2012/01/16/2323448.html Streaming框架允许任何程 ...
hadoop streaming 编程
概况 Hadoop Streaming 是一个工具, 代替编写Java的实现类,而利用可执行程序来完成map-reduce过程.一个最简单的程序 $HADOOP_HOME/bin/hadoop jar ...
Hadoop Streaming Command Details and Q&A
Hadoop Streaming Hadoopstreaming is a utility that comes with the Hadoop distribution. The utilityal ...
hadoop streaming编程小demo(python版)
大数据团队搞数据质量评测.自动化质检和监控平台是用django,MR也是通过python实现的.(后来发现有orc压缩问题,python不知道怎么解决,正在改成java版本) 这里展示一个python ...
Hadoop Streaming详解
一: Hadoop Streaming详解 1.Streaming的作用 Hadoop Streaming框架,最大的好处是,让任何语言编写的map, reduce程序能够在hadoop集群上运行:m ...

随机推荐

js三层引号嵌套
··· 参考:https://blog.csdn.net/feiyangbaxia/article/details/49681131 第一层用双引号,第二层转义双引号,第三层单引号
[Leetcode Week3]Clone Graph
Clone Graph题解原创文章,拒绝转载题目来源:https://leetcode.com/problems/clone-graph/description/ Description Clon ...
libyuv编译【转】
转自:http://blog.csdn.net/kl222/article/details/41309541 版权声明:本文为博主原创文章,未经博主允许不得转载. 下载代码(要FQ): git clo ...
Mybatis三剑客
1.Mybatis-generator 自动化生成数据库交互代码->dao+pojo+xml 2.Mybatis-plugin dao文件和xml自动跳转,验证正确性,在xml中只能提示等功能 ...
Mac-装机
不过大家可别被「命令行」三个字吓到,其实你只需按步骤来,复制粘贴命令即可快速完成,事实上是很简单的. 一.准备工作: 准备一个 8GB 或以上容量的 U 盘,确保里面的数据已经妥善备份好(该过程会抹掉 ...
MATLAB作图方法与技巧(三)
1.利用指令plot绘制圆的参数方程x = sin(t),y = cos(t),(0<=t<=2*pi)的曲线图. 代码如下 t = linspace(0,2*pi,100); x = s ...
Tornado 模块概述
Tornado模块分类 1. Core web framework tornado.web — 包含web框架的大部分主要功能,包含RequestHandler和Application两个重要的类 t ...
C# 调试程序时如何输入命令行参数
调试程序时如何输入命令行参数http://www.a769.com/archives/320.html 开发命令行程序时,我们会疑惑,从那里输入参数呢?请看下面的教程,让你摆脱困扰. 1.点击菜单栏: ...
RAID磁盘恢复方法之一Winhex镜像硬盘与镜像中恢复数据图文
winhex镜像硬盘和ghost备份是完全不同的,ghost只能克隆或者镜像分区内正常的数据,删除的数据他是不会克隆的,所以在数据恢复应用中,ghost对我们来讲作用就不大了,而使用winhex备份( ...
（3）三剑客之sed
(1)基本介绍 1) 工作流程:sed每次处理一行内容,处理时,把当前处理的行存储在临时缓存区,称为模式空间,接着用sed命令处理缓冲区中的内容,处理完成后,把缓冲区的内容送往屏幕,直到内容处理完毕2 ...

MR hadoop streaming job的学习 combiner

MR hadoop streaming job的学习 combiner的更多相关文章

随机推荐

热门专题