Python & MapReduce

使用Python实现Hadoop MapReduce程序

原文请参考：

http://blog.csdn.net/zhaoyl03/article/details/8657031/

下面只是将mapper.py和reducer.py在windows上运行了一遍，没有用Hadoop的环境去测试。

环境准备：

Window 7 – 32
安装GunWin32，使得Linux命令可以在cmd上执行
安装IDLE (Python GUI)，使得Python脚本可以执行
将Python的安装路径添加到windows的环境变量中，使得在cmd窗口中切换到Python脚本所在目录时，通过输入脚本名，可以直接执行Python脚本

我的Python安装在： C:\Python27\python.exe下

测试脚本放在： E:\PythonTest下

windows环境变量中增加：C:\Python27

mapper.py :

#!/usr/bin/env python  

import sys  

# input comes from STDIN (standard input)

for line in sys.stdin:

    # remove leading and trailing whitespace

    line = line.strip()

    # split the line into words

    words = line.split()

    # increase counters

    for word in words:

        # write the results to STDOUT (standard output);

        # what we output here will be the input for the

        # Reduce step, i.e. the input for reducer.py

        #

        # tab-delimited; the trivial word count is 1

        print '%s\t%s' % (word, 1)

reducer.py :

#!/usr/bin/env python  

from operator import itemgetter

import sys  

current_word = None

current_count = 0

word = None  

# input comes from STDIN

for line in sys.stdin:

    # remove leading and trailing whitespace

    line = line.strip()  

    # parse the input we got from mapper.py

    word, count = line.split('\t', 1)  

    # convert count (currently a string) to int

    try:

        count = int(count)

    except ValueError:

        # count was not a number, so silently

        # ignore/discard this line

        continue  

    # this IF-switch only works because Hadoop sorts map output

    # by key (here: word) before it is passed to the reducer

    if current_word == word:

        current_count += count

    else:

        if current_word:

            # write result to STDOUT

            print '%s\t%s' % (current_word, current_count)

        current_count = count

        current_word = word  

# do not forget to output the last word if needed!

if current_word == word:

    print '%s\t%s' % (current_word, current_count)

输出结果：

Python & MapReduce的更多相关文章

Python+MapReduce实现矩阵相乘
算法原理 map阶段在map阶段,需要做的是进行数据准备.把来自矩阵A的元素aij,标识成p条<key, value>的形式,key="i,k",(其中k=1,2,. ...
Writing an Hadoop MapReduce Program in Python
In this tutorial I will describe how to write a simpleMapReduce program for Hadoop in thePython prog ...
使用Python实现Hadoop MapReduce程序
转自:使用Python实现Hadoop MapReduce程序英文原文:Writing an Hadoop MapReduce Program in Python 根据上面两篇文章,下面是我在自己的 ...
用python写MapReduce函数——以WordCount为例
尽管Hadoop框架是用java写的,但是Hadoop程序不限于java,可以用python.C++.ruby等.本例子中直接用python写一个MapReduce实例,而不是用Jython把pyth ...
用Python语言写Hadoop MapReduce程序Writing an Hadoop MapReduce Program in Python
In this tutorial I will describe how to write a simple MapReduce program for Hadoop in the Python pr ...
Hadoop：使用Mrjob框架编写MapReduce
Mrjob简介 Mrjob是一个编写MapReduce任务的开源Python框架,它实际上对Hadoop Streaming的命令行进行了封装,因此接粗不到Hadoop的数据流命令行,使我们可以更轻松 ...
MapReduce原理及其主要实现平台分析
原文:http://www.infotech.ac.cn/article/2012/1003-3513-28-2-60.html MapReduce原理及其主要实现平台分析亢丽芸, 王效岳, 白如江 ...
MapReduce实现PageRank算法（邻接矩阵法）
前言之前写过稀疏图的实现方法,这次写用矩阵存储数据的算法实现,只要会矩阵相乘的话,实现这个就很简单了.如果有不懂的可以先看一下下面两篇随笔. MapReduce实现PageRank算法(稀疏图法) ...
资源list：Github上关于大数据的开源项目、论文等合集
Awesome Big Data A curated list of awesome big data frameworks, resources and other awesomeness. Ins ...

随机推荐

Viewpager图片自动轮播，网络图片加载，图片自动刷新
package com.teffy.viewpager; import java.util.ArrayList; import java.util.concurrent.Executors; impo ...
node.js BootStrap安装
最近想用Bootstrap开发项目,以便使用其丰富的资源: 捯饬了一下nodejs的安装和配置:windows下弄起来还是比较狗屎的,两三天下班时间才弄好: http://xiaoyaojones.b ...
SQL server 中的dbo、guest
dbo database owner 数据库的创建者,创建该对象的用户. guest 顾客能够访问数据库中对象的数据, 要求dbo分配权限给guest, 一般给他查看的权限select 数据库所有者 ...
javascript体系 DOM原理
解释清楚DOM原理并不是一件容易的事,但是任何一个前端工程师,都必须牢牢掌握它. DOM整体架构: 图解: DOM,即针对XML文档的应用程序编程接口(API).通俗一点说,HTML属于XML的一种, ...
SQL : 在SQL Server 2008(Or Express)中如何Open并编辑数据表【转】
来源:http://www.cnblogs.com/wsdj-ITtech/archive/2011/04/28/2031601.html 通常在SQL Server 2005中,我们可以通过SQL ...
无法启动：此实现不是Windows平台FIPS验证的加密算法的一部分
个别同学可能会在启动订票助手.NET的时候发现这个提示: 出现这个问题的原因是订票助手.NET使用了MD5算法,而系统的组策略安全设置导致无法使用此算法.要修正此问题,请按照如下操作(两种方法任选其一 ...
【转】PHP error_reporting() 错误控制函数功能详解
定义和用法: error_reporting() 设置 PHP 的报错级别并返回当前级别. 函数语法: error_reporting(report_level) 如果参数 level 未指定 ...
分布式服务框架：Zookeeper
Zookeeper是一个高性能,分布式的,开源分布式应用协调服务.它提供了简单原始的功能,分布式应用可以基于它实现更高级的服务,比如同步,配置管理,集群管理,名空间.它被设计为易于编程,使用文件系统目 ...
mysql-master-ha
https://code.google.com/p/mysql-master-ha/wiki/TableOfContents?tm=6 http://www.cnblogs.com/gomysql/c ...
Redis配制说明
配置文件参数说明: 1. Redis默认不是以守护进程的方式运行,可以通过该配置项修改,使用yes启用守护进程 daemonize no 2. 当Redis以守护进程方式运行时,Redis默 ...

Python & MapReduce

Python & MapReduce的更多相关文章

随机推荐

热门专题