注,reduce之前已经shuff。

mapper.py

#!/usr/bin/env python
"""mapper.py"""

import sys

# input comes from STDIN (standard input)
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
    # split the line into words
    words = line.split()
    # increase counters
    for word in words:
        # write the results to STDOUT (standard output);
        # what we output here will be the input for the
        # Reduce step, i.e. the input for reducer.py
        #
        # tab-delimited; the trivial word count is 1
        print '%s\t%s' % (word, 1)

reducer.py

#!/usr/bin/env python
"""reducer.py"""

from operator import itemgetter
import sys

current_word = None
current_count = 0
word = None

# input comes from STDIN
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()

    # parse the input we got from mapper.py
    word, count = line.split('\t', 1)

    # convert count (currently a string) to int
    try:
        count = int(count)
    except ValueError:
        # count was not a number, so silently
        # ignore/discard this line
        continue

    # this IF-switch only works because Hadoop sorts map output
    # by key (here: word) before it is passed to the reducer
    if current_word == word:
        current_count += count
    else:
        if current_word:
            # write result to STDOUT
            print '%s\t%s' % (current_word, current_count)
        current_count = count
        current_word = word

# do not forget to output the last word if needed!
if current_word == word:
    print '%s\t%s' % (current_word, current_count)

Improved Mapper and Reducer code: using Python iterators and generators

mapper.py

#!/usr/bin/env python
"""A more advanced Mapper, using Python iterators and generators."""

import sys

def read_input(file):
    for line in file:
        # split the line into words
        yield line.split()

def main(separator='\t'):
    # input comes from STDIN (standard input)
    data = read_input(sys.stdin)
    for words in data:
        # write the results to STDOUT (standard output);
        # what we output here will be the input for the
        # Reduce step, i.e. the input for reducer.py
        #
        # tab-delimited; the trivial word count is 1
        for word in words:
            print '%s%s%d' % (word, separator, 1)

if __name__ == "__main__":
    main()

reducer.py

#!/usr/bin/env python
"""A more advanced Reducer, using Python iterators and generators."""

from itertools import groupby
from operator import itemgetter
import sys

def read_mapper_output(file, separator='\t'):
    for line in file:
        yield line.rstrip().split(separator, 1)

def main(separator='\t'):
    # input comes from STDIN (standard input)
    data = read_mapper_output(sys.stdin, separator=separator)
    # groupby groups multiple word-count pairs by word,
    # and creates an iterator that returns consecutive keys and their group:
    #   current_word - string containing a word (the key)
    #   group - iterator yielding all ["<current_word>", "<count>"] items
    for current_word, group in groupby(data, itemgetter(0)):
        try:
            total_count = sum(int(count) for current_word, count in group)
            print "%s%s%d" % (current_word, separator, total_count)
        except ValueError:
            # count was not a number, so silently discard this item
            pass

if __name__ == "__main__":
    main()

从groupby 理解mapper-reducer的更多相关文章

  1. hadoop2.7之Mapper/reducer源码分析

    一切从示例程序开始: 示例程序 Hadoop2.7 提供的示例程序WordCount.java package org.apache.hadoop.examples; import java.io.I ...

  2. hadoop mapper reducer

    Local模式运行MR流程------------------------- 1.创建外部Job(mapreduce.Job),设置配置信息 2.通过jobsubmitter将job.xml + sp ...

  3. Mapper 与 Reducer 解析

    1 . 旧版 API 的 Mapper/Reducer 解析 Mapper/Reducer 中封装了应用程序的数据处理逻辑.为了简化接口,MapReduce 要求所有存储在底层分布式文件系统上的数据均 ...

  4. Mapper类/Reducer类中的setup方法和cleanup方法以及run方法的介绍

    在hadoop的源码中,基类Mapper类和Reducer类中都是只包含四个方法:setup方法,cleanup方法,run方法,map方法.如下所示: 其方法的调用方式是在run方法中,如下所示: ...

  5. JVM | 第1部分:自动内存管理与性能调优《深入理解 Java 虚拟机》

    目录 前言 1. 自动内存管理 1.1 JVM运行时数据区 1.2 Java 内存结构 1.3 HotSpot 虚拟机创建对象 1.4 HotSpot 虚拟机的对象内存布局 1.5 访问对象 2. 垃 ...

  6. hadoop之mapper类妙用

    1. Mapper类 首先 Mapper类有四个方法: (1) protected void setup(Context context) (2) Protected void map(KEYIN k ...

  7. Mybatis 入门到理解篇

    MyBatis         MyBatis 本是apache的一个开源项目iBatis, 2010年这个项目由apache software foundation 迁移到了google code, ...

  8. 【转】Hive配置文件中配置项的含义详解(收藏版)

    http://www.aboutyun.com/thread-7548-1-1.html 这里面列出了hive几乎所有的配置项,下面问题只是说出了几种配置项目的作用.更多内容,可以查看内容问题导读:1 ...

  9. 为你揭秘知乎是如何搞AI的——窥大厂 | 数智方法论第1期

    文章发布于公号[数智物语] (ID:decision_engine),关注公号不错过每一篇干货. 数智物语(公众号ID:decision_engine)出品 策划.编写:卷毛雅各布 「我们相信,在垃圾 ...

随机推荐

  1. JIRA问题状态已关闭,但是解决结果还是未解决

    自己设置的工作流,状态和解决结果是没有关联的,这时候我们要配置关联关系 1.如下,状态时已关闭,但是解决结果是未解决 . 2.解决方法: 2.1设置-问题-工作流,找到目前在使用的工作流,点击编辑 3 ...

  2. 使用jmeter HTTP代理服务器录制APP脚本

    使用jmeter HTTP代理服务器录制APP脚本 步骤一.jemter设置 1.启动JMeter,双击运行jmeter.bat,启动jmeter jmeter运行主界面 2.添加线程组:右键测试计划 ...

  3. Jetty的安装和配置

    Jetty 是一个开源的servlet容器,它为基于Java的web内容,例如JSP和servlet提供运行环境.Jetty是使用Java语言编写的,它的API以一组JAR包的形式发布.开发人员可以将 ...

  4. context.xml文件配置

    <?xml version='1.0' encoding='utf-8'?> <Context> <WatchedResource>WEB-INF/web.xml& ...

  5. [bzoj3420]Poi2013 Triumphal arch_树形dp_二分

    Triumphal arch 题目链接:https://lydsy.com/JudgeOnline/problem.php?id=3420 数据范围:略. 题解: 首先,发现$ k $具有单调性,我们 ...

  6. NumPy使用图解教程

    NumPy是Python中用于数据分析.机器学习.科学计算的重要软件包.它极大地简化了向量和矩阵的操作及处理.python的不少数据处理软件包依赖于NumPy作为其基础架构的核心部分(例如scikit ...

  7. nlp算法

    人工智能算法大体上来说可以分类两类:基于统计的机器学习算法(Machine Learning)和深度学习算法(Deep Learning) 总的来说,在sklearn中机器学习算法大概的分类如下: 1 ...

  8. QT http请求数据

    1.创建一个请求类(HttpWork): HttpWork.h头文件 #pragma once #include <QObject> #include <QNetworkAccess ...

  9. 剑指offer55:链表中环的入口结点

    1 题目描述 给一个链表,若其中包含环,请找出该链表的环的入口结点,否则,输出null. 2 思路和方法 这是一个典型的链表中查找环的问题,基本思路是,首先设置两个快慢指针slow和fast,并且快指 ...

  10. 字典的学习3——嵌套——Python编程从入门到实践

    嵌套 ? 一系列字典存储在列表or列表作为值存储在字典or字典中套字典 1. 字典列表 alien_0 = {'color': 'green', 'points': 5} alien_1 = {'co ...