In this tutorial I will describe how to write a simple MapReduce program for Hadoop in the Python programming language.

Motivation

Even though the Hadoop framework is written in Java, programs for Hadoop need not to be coded in Java but can also be developed in other languages like Python or C++ (the latter since version 0.14.1). However, Hadoop’s documentation and the most prominent Python example on the Hadoop website could make you think that you must translate your Python code using Jython into a Java jar file. Obviously, this is not very convenient and can even be problematic if you depend on Python features not provided by Jython. Another issue of the Jython approach is the overhead of writing your Python program in such a way that it can interact with Hadoop – just have a look at the example in $HADOOP_HOME/src/examples/python/WordCount.py and you see what I mean.

That said, the ground is now prepared for the purpose of this tutorial: writing a Hadoop MapReduce program in a more Pythonic way, i.e. in a way you should be familiar with.

What we want to do

We will write a simple MapReduce program (see also the MapReduce article on Wikipedia) for Hadoop in Python but without using Jython to translate our code to Java jar files.

Our program will mimick the WordCount, i.e. it reads text files and counts how often words occur. The input is text files and the output is text files, each line of which contains a word and the count of how often it occured, separated by a tab.

Note: You can also use programming languages other than Python such as Perl or Ruby with the “technique” described in this tutorial.

Prerequisites

You should have an Hadoop cluster up and running because we will get our hands dirty. If you don’t have a cluster yet, my following tutorials might help you to build one. The tutorials are tailored to Ubuntu Linux but the information does also apply to other Linux/Unix variants.

Python MapReduce Code

The “trick” behind the following Python code is that we will use the Hadoop Streaming API (see also the corresponding wiki entry) for helping us passing data between our Map and Reduce code via STDIN (standard input) and STDOUT (standard output). We will simply use Python’s sys.stdin to read input data and print our own output to sys.stdout. That’s all we need to do because Hadoop Streaming will take care of everything else!

Map step: mapper.py

Save the following code in the file /home/hduser/mapper.py. It will read data from STDIN, split it into words and output a list of lines mapping words to their (intermediate) counts to STDOUT. The Map script will not compute an (intermediate) sum of a word’s occurrences though. Instead, it will output <word> 1 tuples immediately – even though a specific word might occur multiple times in the input. In our case we let the subsequent Reduce step do the final sum count. Of course, you can change this behavior in your own scripts as you please, but we will keep it like that in this tutorial because of didactic reasons. :-)

Make sure the file has execution permission (chmod +x /home/hduser/mapper.py should do the trick) or you will run into problems.

mapper.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#!/usr/bin/env python

import sys

# input comes from STDIN (standard input)
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# split the line into words
words = line.split()
# increase counters
for word in words:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
# tab-delimited; the trivial word count is 1
print '%s\t%s' % (word, 1)

Reduce step: reducer.py

Save the following code in the file /home/hduser/reducer.py. It will read the results of mapper.py from STDIN (so the output format of mapper.py and the expected input format of reducer.py must match) and sum the occurrences of each word to a final count, and then output its results to STDOUT.

Make sure the file has execution permission (chmod +x /home/hduser/reducer.py should do the trick) or you will run into problems.

reducer.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
#!/usr/bin/env python

from operator import itemgetter
import sys current_word = None
current_count = 0
word = None # input comes from STDIN
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip() # parse the input we got from mapper.py
word, count = line.split('\t', 1) # convert count (currently a string) to int
try:
count = int(count)
except ValueError:
# count was not a number, so silently
# ignore/discard this line
continue # this IF-switch only works because Hadoop sorts map output
# by key (here: word) before it is passed to the reducer
if current_word == word:
current_count += count
else:
if current_word:
# write result to STDOUT
print '%s\t%s' % (current_word, current_count)
current_count = count
current_word = word # do not forget to output the last word if needed!
if current_word == word:
print '%s\t%s' % (current_word, current_count)

Test your code (cat data | map | sort | reduce)

I recommend to test your mapper.py and reducer.py scripts locally before using them in a MapReduce job. Otherwise your jobs might successfully complete but there will be no job result data at all or not the results you would have expected. If that happens, most likely it was you (or me) who screwed up.

Here are some ideas on how to test the functionality of the Map and Reduce scripts.

Test mapper.py and reducer.py locally first

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# very basic test
hduser@ubuntu:~$ echo "foo foo quux labs foo bar quux" | /home/hduser/mapper.py
foo 1
foo 1
quux 1
labs 1
foo 1
bar 1
quux 1 hduser@ubuntu:~$ echo "foo foo quux labs foo bar quux" | /home/hduser/mapper.py | sort -k1,1 | /home/hduser/reducer.py
bar 1
foo 3
labs 1
quux 2 # using one of the ebooks as example input
# (see below on where to get the ebooks)
hduser@ubuntu:~$ cat /tmp/gutenberg/20417-8.txt | /home/hduser/mapper.py
The 1
Project 1
Gutenberg 1
EBook 1
of 1
[...]
(you get the idea)

Running the Python Code on Hadoop

Download example input data

We will use three ebooks from Project Gutenberg for this example:

Download each ebook as text files in Plain Text UTF-8 encoding and store the files in a local temporary directory of choice, for example /tmp/gutenberg.

1
2
3
4
5
6
hduser@ubuntu:~$ ls -l /tmp/gutenberg/
total 3604
-rw-r--r-- 1 hduser hadoop 674566 Feb 3 10:17 pg20417.txt
-rw-r--r-- 1 hduser hadoop 1573112 Feb 3 10:18 pg4300.txt
-rw-r--r-- 1 hduser hadoop 1423801 Feb 3 10:18 pg5000.txt
hduser@ubuntu:~$

Copy local example data to HDFS

Before we run the actual MapReduce job, we must first copy the files from our local file system to Hadoop’s HDFS.

1
2
3
4
5
6
7
8
9
10
hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -copyFromLocal /tmp/gutenberg /user/hduser/gutenberg
hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls
Found 1 items
drwxr-xr-x - hduser supergroup 0 2010-05-08 17:40 /user/hduser/gutenberg
hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls /user/hduser/gutenberg
Found 3 items
-rw-r--r-- 3 hduser supergroup 674566 2011-03-10 11:38 /user/hduser/gutenberg/pg20417.txt
-rw-r--r-- 3 hduser supergroup 1573112 2011-03-10 11:38 /user/hduser/gutenberg/pg4300.txt
-rw-r--r-- 3 hduser supergroup 1423801 2011-03-10 11:38 /user/hduser/gutenberg/pg5000.txt
hduser@ubuntu:/usr/local/hadoop$

Run the MapReduce job

Now that everything is prepared, we can finally run our Python MapReduce job on the Hadoop cluster. As I said above, we leverage the Hadoop Streaming API for helping us passing data between our Map and Reduce code via STDIN and STDOUT.

1
2
3
4
hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-*streaming*.jar \
-file /home/hduser/mapper.py -mapper /home/hduser/mapper.py \
-file /home/hduser/reducer.py -reducer /home/hduser/reducer.py \
-input /user/hduser/gutenberg/* -output /user/hduser/gutenberg-output

If you want to modify some Hadoop settings on the fly like increasing the number of Reduce tasks, you can use the -D option:

1
hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-*streaming*.jar -D mapred.reduce.tasks=16 ...
Note about mapred.map.tasksHadoop does not honor mapred.map.tasks beyond considering it a hint. But it accepts the user specified mapred.reduce.tasks and doesn’t manipulate that. You cannot force mapred.map.tasks but can specify mapred.reduce.tasks.

The job will read all the files in the HDFS directory /user/hduser/gutenberg, process it, and store the results in the HDFS directory /user/hduser/gutenberg-output. In general Hadoop will create one output file per reducer; in our case however it will only create a single file because the input files are very small.

Example output of the previous command in the console:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-*streaming*.jar -mapper /home/hduser/mapper.py -reducer /home/hduser/reducer.py -input /user/hduser/gutenberg/* -output /user/hduser/gutenberg-output
additionalConfSpec_:null
null=@@@userJobConfProps_.get(stream.shipped.hadoopstreaming
packageJobJar: [/app/hadoop/tmp/hadoop-unjar54543/]
[] /tmp/streamjob54544.jar tmpDir=null
[...] INFO mapred.FileInputFormat: Total input paths to process : 7
[...] INFO streaming.StreamJob: getLocalDirs(): [/app/hadoop/tmp/mapred/local]
[...] INFO streaming.StreamJob: Running job: job_200803031615_0021
[...]
[...] INFO streaming.StreamJob: map 0% reduce 0%
[...] INFO streaming.StreamJob: map 43% reduce 0%
[...] INFO streaming.StreamJob: map 86% reduce 0%
[...] INFO streaming.StreamJob: map 100% reduce 0%
[...] INFO streaming.StreamJob: map 100% reduce 33%
[...] INFO streaming.StreamJob: map 100% reduce 70%
[...] INFO streaming.StreamJob: map 100% reduce 77%
[...] INFO streaming.StreamJob: map 100% reduce 100%
[...] INFO streaming.StreamJob: Job complete: job_200803031615_0021
[...] INFO streaming.StreamJob: Output: /user/hduser/gutenberg-output
hduser@ubuntu:/usr/local/hadoop$

As you can see in the output above, Hadoop also provides a basic web interface for statistics and information. When the Hadoop cluster is running, open http://localhost:50030/ in a browser and have a look around. Here’s a screenshot of the Hadoop web interface for the job we just ran.

Figure 1: A screenshot of Hadoop’s JobTracker web interface, showing the details of the MapReduce job we just ran

Check if the result is successfully stored in HDFS directory /user/hduser/gutenberg-output:

1
2
3
4
hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls /user/hduser/gutenberg-output
Found 1 items
/user/hduser/gutenberg-output/part-00000 &lt;r 1&gt; 903193 2007-09-21 13:00
hduser@ubuntu:/usr/local/hadoop$

You can then inspect the contents of the file with the dfs -cat command:

1
2
3
4
5
6
7
8
9
10
11
12
hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -cat /user/hduser/gutenberg-output/part-00000
"(Lo)cra" 1
"1490 1
"1498," 1
"35" 1
"40," 1
"A 2
"AS-IS". 2
"A_ 1
"Absoluti 1
[...]
hduser@ubuntu:/usr/local/hadoop$

Note that in this specific output above the quote signs (") enclosing the words have not been inserted by Hadoop. They are the result of how our Python code splits words, and in this case it matched the beginning of a quote in the ebook texts. Just inspect the part-00000 file further to see it for yourself.

Improved Mapper and Reducer code: using Python iterators and generators

The Mapper and Reducer examples above should have given you an idea of how to create your first MapReduce application. The focus was code simplicity and ease of understanding, particularly for beginners of the Python programming language. In a real-world application however, you might want to optimize your code by using Python iterators and generators (an even better introduction in PDF).

Generally speaking, iterators and generators (functions that create iterators, for example with Python’s yield statement) have the advantage that an element of a sequence is not produced until you actually need it. This can help a lot in terms of computational expensiveness or memory consumption depending on the task at hand.

Note: The following Map and Reduce scripts will only work “correctly” when being run in the Hadoop context, i.e. as Mapper and Reducer in a MapReduce job. This means that running the naive test command “cat DATA | ./mapper.py | sort -k1,1 | ./reducer.py” will not work correctly anymore because some functionality is intentionally outsourced to Hadoop.

Precisely, we compute the sum of a word’s occurrences, e.g. ("foo", 4), only if by chance the same word (foo) appears multiple times in succession. In the majority of cases, however, we let the Hadoop group the (key, value) pairs between the Map and the Reduce step because Hadoop is more efficient in this regard than our simple Python scripts.

mapper.py

mapper.py (improved)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
#!/usr/bin/env python
"""A more advanced Mapper, using Python iterators and generators.""" import sys def read_input(file):
for line in file:
# split the line into words
yield line.split() def main(separator='\t'):
# input comes from STDIN (standard input)
data = read_input(sys.stdin)
for words in data:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
# tab-delimited; the trivial word count is 1
for word in words:
print '%s%s%d' % (word, separator, 1) if __name__ == "__main__":
main()

reducer.py

reducer.py (improved)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
#!/usr/bin/env python
"""A more advanced Reducer, using Python iterators and generators.""" from itertools import groupby
from operator import itemgetter
import sys def read_mapper_output(file, separator='\t'):
for line in file:
yield line.rstrip().split(separator, 1) def main(separator='\t'):
# input comes from STDIN (standard input)
data = read_mapper_output(sys.stdin, separator=separator)
# groupby groups multiple word-count pairs by word,
# and creates an iterator that returns consecutive keys and their group:
# current_word - string containing a word (the key)
# group - iterator yielding all ["&lt;current_word&gt;", "&lt;count&gt;"] items
for current_word, group in groupby(data, itemgetter(0)):
try:
total_count = sum(int(count) for current_word, count in group)
print "%s%s%d" % (current_word, separator, total_count)
except ValueError:
# count was not a number, so silently discard this item
pass if __name__ == "__main__":
main()

Related Links

From yours truly:

From others:

from: http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/

用Python语言写Hadoop MapReduce程序Writing an Hadoop MapReduce Program in Python的更多相关文章

  1. 【初学者教程】在电脑上安装Python,写第一个程序

    欢迎来到Python的世界 1.存在Python 2和Python 3两个版本,我该用哪个?如果书是关于2的,下载2:如果书是关于3的,就下载3.建议用Python 3,不过用2也是可以的. 2.下载 ...

  2. Python语言基础04-构造程序逻辑

    本文收录在Python从入门到精通系列文章系列 学完前面的几个章节后,博主觉得有必要在这里带大家做一些练习来巩固之前所学的知识,虽然迄今为止我们学习的内容只是Python的冰山一角,但是这些内容已经足 ...

  3. 如何用VS2017用C++语言写Hello world 程序?

    1,首先,打开VS2017. 2,左上角按文件——新建——项目,或按ctrl+shift+n. 3,按照图片里的选,选完按“确定”. 4,右键“源文件”,再按添加——新建项. 5,剩下的就很简单了,只 ...

  4. [转]Wote用python语言写的imgHash.py

    #!/usr/bin/python import glob import os import sys from PIL import Image EXTS = 'jpg', 'jpeg', 'JPG' ...

  5. 用python语言写一个简单的计算器

    假如我们有这样一个式子: 1 - 2 * ( (60-30 +(-40/5) * (9-2*5/3 + 7 /3*99/4*2998 +10 * 568/14 )) - (-4*3)/ (16-3*2 ...

  6. 计算均值mean的MapReduce程序Computing mean with MapReduce

    In this post we'll see how to compute the mean of the max temperatures of every month for the city o ...

  7. Python 实现根据不同的程序运行环境存放日志目录,Python实现Linux和windows系统日志的存放

    说明:在我们开发的时候,有时候是在windows系统下开发的代码,我们的生产环境是Linux系统,更新代码就需要修改日志的环境,本文实现了代码更新,不需要配置日志文件的目录,同样也可以延伸到ip地址 ...

  8. 用Python语言设计GUI界面

    我们大家都编写过程序,但是如果能够设计一个GUI界面,会使程序增添一个很大的亮点!今天就让我们来用目前十分流行的python语言写出一个最基本的GUI,为日后设计更加漂亮的GUI打下基础. 工具/原料 ...

  9. 使用Python实现Hadoop MapReduce程序

    转自:使用Python实现Hadoop MapReduce程序 英文原文:Writing an Hadoop MapReduce Program in Python 根据上面两篇文章,下面是我在自己的 ...

随机推荐

  1. bzoj 1925 dp

    思路:dp[ i ][ 0 ]表示第一个是山谷的方案,dp[ i ][ 1 ]表示第一个是山峰的方案, 我们算dp[ x ][ state ]的时候枚举 x 的位置 x 肯定是山峰, 然后就用组合数算 ...

  2. FILE operattion

    #include "mainwindow.h"#include "ui_mainwindow.h"#include <QMessageBox>#in ...

  3. 使用Generator(小黑鸟)反向生成Java项目(IDEA + Maven)

    一.生成Maven项目 二.配置pom.xml文件 通用代码 <properties> <!-- 设置项目编码编码 --> <project.build.sourceEn ...

  4. 【mysql】当where后接字符串,查询时会发生什么?

    好久没有研究一个“深层次”的问题了. 首先来看我们为什么要讨论这个问题~ 首先这是一个正常的数据库查询,我们可以看到在ruizhi数据库里的chouka表内,所有数据如图. 现在,我们运行查询: se ...

  5. Laravel数据库配置问题

    由于项目需要,使用Laravel连接远程数据库进行开发 将默认的数据库配置信息改为远程数据库信息 在模型文件中定义了 protected $connection = 'default' 运行程序,报错 ...

  6. JXOI2017-2018 解题报告

    链接:JXOI2017-2018 解题报告 代码预览:Github

  7. 前端安全系列之二:如何防止CSRF攻击?

    背景 随着互联网的高速发展,信息安全问题已经成为企业最为关注的焦点之一,而前端又是引发企业安全问题的高危据点.在移动互联网时代,前端人员除了传统的 XSS.CSRF 等安全问题之外,又时常遭遇网络劫持 ...

  8. Block的基本用法

    NSString* (^myBlock)(NSString*, int); myBlock = ^(NSString *name, int age){ return [NSString stringW ...

  9. Fiddler手机抓包,相关细节回顾

    目录 0. 准备工作 1. Fiddler配置 2. iPhone配置 3. 抓包示例 上篇Fiddler教程,我们教了大家Fiddler安装配置及如何使用Fiddler进行基本的Http抓包及模拟请 ...

  10. BZOJ 4003: [JLOI2015]城池攻占 左偏树 可并堆

    https://www.lydsy.com/JudgeOnline/problem.php?id=4003 感觉就是……普通的堆啊(暴论),因为这个堆是通过递归往右堆里加一个新堆或者新节点的,所以要始 ...