Writing an Hadoop MapReduce Program in Python
In this tutorial I will describe how to write a simpleMapReduce program for Hadoop in thePython programming
language.
Motivation
Even though the Hadoop framework is written in Java, programs for Hadoop need not to be coded in Java but can also bedeveloped in other languages like Python or C++ (the latter since version 0.14.1). However,Hadoop’s
documentation and the most prominentPython example on the Hadoop website could make you think that youmust translate your Python code using
Jython into a Java jar file. Obviously, this is notvery convenient and can even be problematic if you depend on Python features not provided by Jython. Another issue ofthe Jython approach is the overhead
of writing your Python program in such a way that it can interact with Hadoop –just have a look at the example in
$HADOOP_HOME/src/examples/python/WordCount.py and you see what I mean.
That said, the ground is now prepared for the purpose of this tutorial: writing a Hadoop MapReduce program in a morePythonic way, i.e. in a way you should be familiar with.
What we want to do
We will write a simple MapReduce program (see also theMapReduce article on Wikipedia) for Hadoop in Python but
without usingJython to translate our code to Java jar files.
Our program will mimick the WordCount, i.e. it reads text files andcounts how often words occur. The input is text files and the output is text files, each line of which contains aword and the count of how often it occured, separated by a tab.
Prerequisites
You should have an Hadoop cluster up and running because we will get our hands dirty. If you don’t have a clusteryet, my following tutorials might help you to build one. The tutorials are tailored to Ubuntu Linux but the informationdoes also apply to other
Linux/Unix variants.
- Running Hadoop On Ubuntu Linux (Single-Node Cluster)– How to set up a pseudo-distributed,
single-node Hadoop cluster backed by the Hadoop Distributed File System(HDFS) - Running Hadoop On Ubuntu Linux (Multi-Node Cluster)– How to set up a distributed,
multi-node Hadoop cluster backed by the Hadoop Distributed File System(HDFS)
Python MapReduce Code
The “trick” behind the following Python code is that we will use theHadoop Streaming API (see also the correspondingwiki
entry) for helping us passing data between our Map and Reducecode via (standard input) and
STDINSTDOUT (standard output). We will simply use Python’s sys.stdin toread input data and print our own output to sys.stdout. That’s all we need to do because Hadoop Streaming willtake care
of everything else!
Map step: mapper.py
Save the following code in the file /home/hduser/mapper.py. It will read data from
STDIN, split it into wordsand output a list of lines mapping words to their (intermediate) counts to
STDOUT. The Map script will notcompute an (intermediate) sum of a word’s occurrences though. Instead, it will output
<word> 1 tuples immediately– even though a specific word might occur multiple times in the input. In our case we let the subsequent Reducestep do the final sum count. Of course, you can change this behavior in your own scripts as you please, but
we willkeep it like that in this tutorial because of didactic reasons. :-)
Make sure the file has execution permission (chmod +x /home/hduser/mapper.py should do the trick) or you will runinto problems.
1 |
|
Reduce step: reducer.py
Save the following code in the file /home/hduser/reducer.py. It will read the results of mapper.py fromSTDIN (so the output format of
mapper.py and the expected input format of reducer.py must match) and sum theoccurrences of each word to a final count, and then output its results to
STDOUT.
Make sure the file has execution permission (chmod +x /home/hduser/reducer.py should do the trick) or you will runinto problems.
1 |
|
Test your code (cat data | map | sort | reduce)
I recommend to test your mapper.py and reducer.py scripts locally before using them in a MapReduce job.Otherwise your jobs might successfully complete but there will be no job result data at all or not the resultsyou would have
expected. If that happens, most likely it was you (or me) who screwed up.
Here are some ideas on how to test the functionality of the Map and Reduce scripts.
1 |
|
Running the Python Code on Hadoop
Download example input data
We will use three ebooks from Project Gutenberg for this example:
- The Outline of Science, Vol. 1 (of 4) by J. Arthur Thomson
- The Notebooks of Leonardo Da Vinci
- Ulysses by James Joyce
Download each ebook as text files in Plain Text UTF-8 encoding and store the files in a local temporary directory ofchoice, for example
/tmp/gutenberg.
1 |
|
Copy local example data to HDFS
Before we run the actual MapReduce job, we must first copy the filesfrom our local file system to Hadoop’s HDFS.
1 |
|
Run the MapReduce job
Now that everything is prepared, we can finally run our Python MapReduce job on the Hadoop cluster. As I said above,we leverage the Hadoop Streaming API for helping us passing data between our Map and Reduce code via
STDIN andSTDOUT.
1 |
|
If you want to modify some Hadoop settings on the fly like increasing the number of Reduce tasks, you can use the-D option:
1 |
|
and doesn’t manipulate that. You cannot force mapred.map.tasks but can specify mapred.reduce.tasks.
The job will read all the files in the HDFS directory /user/hduser/gutenberg, process it, and store the results inthe HDFS directory /user/hduser/gutenberg-output. In general Hadoop will create one output file per reducer; inour
case however it will only create a single file because the input files are very small.
Example output of the previous command in the console:
1 |
|
As you can see in the output above, Hadoop also provides a basic web interface for statistics and information. Whenthe Hadoop cluster is running, open
http://localhost:50030/ in a browser and have a lookaround. Here’s a screenshot of the Hadoop web interface for the job we just ran.

Check if the result is successfully stored in HDFS directory /user/hduser/gutenberg-output:
1 |
|
You can then inspect the contents of the file with the dfs -cat command:
1 |
|
Note that in this specific output above the quote signs (") enclosing the words have not been inserted by Hadoop.They are the result of how our Python code splits words, and in this case it matched the beginning of a quote in theebook texts.
Just inspect the part-00000 file further to see it for yourself.
Improved Mapper and Reducer code: using Python iterators and generators
The Mapper and Reducer examples above should have given you an idea of how to create your first MapReduce application.The focus was code simplicity and ease of understanding, particularly for beginners of the Python programming language.In a real-world application
however, you might want to optimize your code by usingPython iterators and generators (an evenbetter
introduction in PDF).
Generally speaking, iterators and generators (functions that create iterators, for example with Python’s yieldstatement) have the advantage that an element of a sequence is not produced until you actually need it. This can helpa lot in terms
of computational expensiveness or memory consumption depending on the task at hand.
| ./reducer.py” will not work correctly anymore because some functionality is intentionally outsourced to Hadoop.
Precisely, we compute the sum of a word’s occurrences, e.g. ("foo", 4), only if by chance the same word (foo)appears multiple times in succession. In the majority of cases, however, we let the Hadoop group the (key, value) pairsbetween
the Map and the Reduce step because Hadoop is more efficient in this regard than our simple Python scripts.
mapper.py
1 |
|
reducer.py
1 |
|
Related Links
From yours truly:
- Running Hadoop On Ubuntu Linux (Single-Node Cluster)
- Running Hadoop On Ubuntu Linux (Multi-Node Cluster)
From others:
原文链接:http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/
Writing an Hadoop MapReduce Program in Python的更多相关文章
- 用Python语言写Hadoop MapReduce程序Writing an Hadoop MapReduce Program in Python
In this tutorial I will describe how to write a simple MapReduce program for Hadoop in the Python pr ...
- 使用Python实现Hadoop MapReduce程序
转自:使用Python实现Hadoop MapReduce程序 英文原文:Writing an Hadoop MapReduce Program in Python 根据上面两篇文章,下面是我在自己的 ...
- python - hadoop,mapreduce demo
Hadoop,mapreduce 介绍 59888745@qq.com 大数据工程师是在Linux系统下搭建Hadoop生态系统(cloudera是最大的输出者类似于Linux的红帽), 把用户的交易 ...
- MapReduce 原理与 Python 实践
MapReduce 原理与 Python 实践 1. MapReduce 原理 以下是个人在MongoDB和Redis实际应用中总结的Map-Reduce的理解 Hadoop 的 MapReduce ...
- 【Big Data - Hadoop - MapReduce】hadoop 学习笔记:MapReduce框架详解
开始聊MapReduce,MapReduce是Hadoop的计算框架,我学Hadoop是从Hive开始入手,再到hdfs,当我学习hdfs时候,就感觉到hdfs和mapreduce关系的紧密.这个可能 ...
- Hadoop MapReduce执行过程详解(带hadoop例子)
https://my.oschina.net/itblog/blog/275294 摘要: 本文通过一个例子,详细介绍Hadoop 的 MapReduce过程. 分析MapReduce执行过程 Map ...
- hadoop MapReduce Yarn运行机制
原 Hadoop MapReduce 框架的问题 原hadoop的MapReduce框架图 从上图中可以清楚的看出原 MapReduce 程序的流程及设计思路: 首先用户程序 (JobClient) ...
- Hadoop Mapreduce分区、分组、二次排序过程详解[转]
原文地址:Hadoop Mapreduce分区.分组.二次排序过程详解[转]作者: 徐海蛟 教学用途 1.MapReduce中数据流动 (1)最简单的过程: map - reduce (2) ...
- Hadoop MapReduce编程 API入门系列之薪水统计(三十一)
不多说,直接上代码. 代码 package zhouls.bigdata.myMapReduce.SalaryCount; import java.io.IOException; import jav ...
随机推荐
- windows git的安装配置(转)
Win7上Git安装及配置过程 http://www.cnblogs.com/sunny5156/archive/2012/10/23/2735799.html 对于需要使用Putty登录的参见 ...
- JCO事务管理
/* * 标准对账单过账 * @account 标准对账单号 * @year 年度 */ public List<String> doAccountStatmentPost(String ...
- ERROR 1044 (42000): Access denied for user ''@'localhost' to database 'db'
1.问题 在刚刚安装MySQL之后,进入到mysql环境下,创建数据库,出现下面的提示信息: ERROR 1044 (42000): Access denied for user ''@'localh ...
- [转载] TLS协议分析 与 现代加密通信协议设计
https://blog.helong.info/blog/2015/09/06/tls-protocol-analysis-and-crypto-protocol-design/?from=time ...
- poj1434Fill the Cisterns!(二分)
链接 题目说给你n个水箱,初始是没有水的,每个的高低位置可能不同,给了你初始的水箱底部所处的水平线位置,问给你V体积水时,水的水平线位置. 直接二分位置p,对于每一个底部低于水平线位置的水箱,里面的水 ...
- 教你如何精通Struts:Tiles框架
Tiles框架特性和内容 Tiles框架为创建Web页面提供了一种模板机制,它能将网页的布局和内容分离.它允许先创建模板,然后在运行时动态地将内容插入到模板中.Tiles 框架建立在JSP的inclu ...
- JSP连接数据库的两种方式:Jdbc-Odbc桥和Jdbc直连(转)
学JSP的同学都要知道怎么连数据库,网上的示例各有各的做法,弄得都不知道用谁的好.其实方法千变万化,本质上就两种:Jdbc-Odbc桥和Jdbc直连. 下面先以MySQL为例说说这两种方式各是怎么连的 ...
- css技术和实例
今天,我为大家收集精选了30个使用纯CSS完成的强大实践的优秀CSS技术和实例,您将在这里发现很多与众不同的技术,比如:图片集.阴影效果.可扩展按钮.菜单等-这些实例都是使用纯CSS和HTML实现的. ...
- iOS开发之.pch文件初识
pch全称是“precompiled header”,即预编译头文件,自Xcode6诞生之日起,便在Supporting Files文件下消失多年.说起苹果对pch的爱恨情仇,其分析pch的作用便不言 ...
- [css]【转载】CSS样式分离之再分离
原文链接:http://www.zhangxinxu.com/wordpress/2010/07/css%E6%A0%B7%E5%BC%8F%E5%88%86%E7%A6%BB%E4%B9%8B%E5 ...