运行Hadoop的示例程序WordCount-Running Hadoop Example

In the last post we've installed Hadoop 2.2.0 on Ubuntu. Now we'll see how to launch an example mapreduce task on Hadoop.

In the Hadoop directory (which you should find at /opt/hadoop/2.2.0) you can find a JAR containing some examples: the exact path is $HADOOP_COMMON_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar .
This JAR contains different examples of mapreduce programs. We'll launch the WordCount program, which is the equivalent of "Hello, world" for MapReduce. This programs just count the occurrences of every single word of the file given as the input.
To run this example we need to prepare something. We assume that we have the HDFS service running; if we didn't create a user directory, we have to do it now (assuming the hadoop user we're using is mapred):

$ hadoop fs -mkdir -p /user/mapred

When we pass "fs" as the first argument to the hadoop command, we're telling hadoop to work on HDFS filesystem; in this case, we used the mkdir command as a switch to create a new directory on HDFS.
Now that our user has a home directory, we can create a directory that we'll use lo load the input file for the mapreduce programs:

$ hadoop fs -mkdir inputdir

We can check the result issuing a "ls" command on HDFS:

$ hadoop fs -ls

Found 1 items

drwxr-xr-x   - mapred mrusers        0 2014-02-11 22:54 inputdir

Now we can decide which file we'll count the words of; in this example, I'll use the text of the novella Flatland by Edwin Abbot, which is freely available on gutemberg project for download:

$ wget http://www.gutenberg.org/cache/epub/201/pg201.txt

Now we can put this file onto the HDFS, more precisely into the inputdir dir we created a moment ago:

$ hadoop fs -put pg201.txt inputdir

The switch "-put" tells Hadoop to get the file from the machine's file system and to put it onto the HDFS filesystem. We can check that the file is really there:

$ hadoop fs -ls inputdir

Found 1 items

drwxr-xr-x   - mapred mrusers        227368 2014-02-11 22:59 inputdir/pg201.txt

Now we're ready to execute the MapReduce program. Hadoop tarball comes with a JAR containing the WordCount example; we can launch Hadoop with these parameters:

jar: we're telling Hadoop we want to execute a mapreduce program contained in a JAR
/opt/hadoop-2.2.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar: this is the absolute path and filename of the JAR
wordcount: tells Hadoop which of the many examples contained in the JAR to run
inputdir: the directory on HDFS in which Hadoop can find the input file(s)
outputdir: the directory on HDFS in which Hadoop must write the result of the program

$ hadoop jar /opt/hadoop-2.2.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar wordcount inputdir outputdir

and the output is:

14/02/11 23:16:19 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032

14/02/11 23:16:20 INFO input.FileInputFormat: Total input paths to process : 1

14/02/11 23:16:20 INFO mapreduce.JobSubmitter: number of splits:1

14/02/11 23:16:21 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name

14/02/11 23:16:21 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar

14/02/11 23:16:21 INFO Configuration.deprecation: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class

14/02/11 23:16:21 INFO Configuration.deprecation: mapreduce.combine.class is deprecated. Instead, use mapreduce.job.combine.class

14/02/11 23:16:21 INFO Configuration.deprecation: mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class

14/02/11 23:16:21 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name

14/02/11 23:16:21 INFO Configuration.deprecation: mapreduce.reduce.class is deprecated. Instead, use mapreduce.job.reduce.class

14/02/11 23:16:21 INFO Configuration.deprecation: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir

14/02/11 23:16:21 INFO Configuration.deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir

14/02/11 23:16:21 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps

14/02/11 23:16:21 INFO Configuration.deprecation: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class

14/02/11 23:16:21 INFO Configuration.deprecation: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir

14/02/11 23:16:21 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1392155226604_0001

14/02/11 23:16:22 INFO impl.YarnClientImpl: Submitted application application_1392155226604_0001 to ResourceManager at /0.0.0.0:8032

14/02/11 23:16:23 INFO mapreduce.Job: The url to track the job: http://hadoop-VirtualBox:8088/proxy/application_1392155226604_0001/

14/02/11 23:16:23 INFO mapreduce.Job: Running job: job_1392155226604_0001

14/02/11 23:16:38 INFO mapreduce.Job: Job job_1392155226604_0001 running in uber mode : false

14/02/11 23:16:38 INFO mapreduce.Job:  map 0% reduce 0%

14/02/11 23:16:47 INFO mapreduce.Job:  map 100% reduce 0%

14/02/11 23:16:57 INFO mapreduce.Job:  map 100% reduce 100%

14/02/11 23:16:58 INFO mapreduce.Job: Job job_1392155226604_0001 completed successfully

14/02/11 23:16:58 INFO mapreduce.Job: Counters: 43

 File System Counters

  FILE: Number of bytes read=121375

  FILE: Number of bytes written=401139

  FILE: Number of read operations=0

  FILE: Number of large read operations=0

  FILE: Number of write operations=0

  HDFS: Number of bytes read=227485

  HDFS: Number of bytes written=88461

  HDFS: Number of read operations=6

  HDFS: Number of large read operations=0

  HDFS: Number of write operations=2

 Job Counters

  Launched map tasks=1

  Launched reduce tasks=1

  Data-local map tasks=1

  Total time spent by all maps in occupied slots (ms)=7693

  Total time spent by all reduces in occupied slots (ms)=7383

 Map-Reduce Framework

  Map input records=4239

  Map output records=37680

  Map output bytes=366902

  Map output materialized bytes=121375

  Input split bytes=117

  Combine input records=37680

  Combine output records=8341

  Reduce input groups=8341

  Reduce shuffle bytes=121375

  Reduce input records=8341

  Reduce output records=8341

  Spilled Records=16682

  Shuffled Maps =1

  Failed Shuffles=0

  Merged Map outputs=1

  GC time elapsed (ms)=150

  CPU time spent (ms)=5490

  Physical memory (bytes) snapshot=399077376

  Virtual memory (bytes) snapshot=1674149888

  Total committed heap usage (bytes)=314048512

 Shuffle Errors

  BAD_ID=0

  CONNECTION=0

  IO_ERROR=0

  WRONG_LENGTH=0

  WRONG_MAP=0

  WRONG_REDUCE=0

 File Input Format Counters

  Bytes Read=227368

 File Output Format Counters

  Bytes Written=88461

The last part of the output is a summary of the execution of the mapreduce program; just before this, we can spot the "Job job_1392155226604_0001 completed successfully" line, which tells us the mapreduce program has been executed successfully. As told, Hadoop wrote the output onto the outputdir on HDFS; let's see what's inside this dir:

$ hadoop fs -ls outputdir

Found 2 items

-rw-r--r--   1 mapred mrusers          0 2014-02-11 23:16 outputdir/_SUCCESS

-rw-r--r--   1 mapred mrusers      88461 2014-02-11 23:16 outputdir/part-r-00000

The presence of the _SUCCESS file confirms us the successful execution of the job; in the part-r-00000 Hadoop wrote the result of the execution. We can bring the file up to the filesystem of our machine using the "get" switch:

$ hadoop fs -get outputdir/part-r-00000 .

Now we can see the content of the file (this is a small subset of the whole file):

...

leading 2

leagues 1

leaning 1

leap    1

leaped  1

learn   7

learned 1

least   23

least.  1

leave   3

leaves  3

leaving 2

lecture 1

led     4

left    9

...

The wordcount program just count the occurrences of every single word and outputs it.
Well, we've successfully run our first mapreduce job on our Hadoop installation!

from: http://andreaiacono.blogspot.com/2014/02/running-hadoop-example.html

运行Hadoop的示例程序WordCount-Running Hadoop Example的更多相关文章

hadoop第一个程序WordCount
hadoop第一个程序WordCount package test; import org.apache.hadoop.mapreduce.Job; import java.io.IOExceptio ...
Hadoop示例程序WordCount编译运行
首先确保Hadoop已正确安装及运行. 将WordCount.java拷贝出来 $ cp ./src/examples/org/apache/hadoop/examples/WordCount.jav ...
Hadoop Map/Reduce 示例程序WordCount
#进入hadoop安装目录 cd /usr/local/hadoop #创建示例文件:input #在里面输入以下内容: #Hello world, Bye world! vim input #在hd ...
(转载)Hadoop示例程序WordCount详解
最近在学习云计算,研究Haddop框架,费了一整天时间将Hadoop在Linux下完全运行起来,看到官方的map-reduce的demo程序WordCount,仔细研究了一下,算做入门了. 其实Wor ...
Hadoop示例程序WordCount详解及实例（转）
1.图解MapReduce 2.简历过程: Input: Hello World Bye World Hello Hadoop Bye Hadoop Bye Hadoop Hello Hadoop M ...
[MapReduce_1] 运行 Word Count 示例程序
0. 说明 MapReduce 实现 Word Count 示意图 && Word Count 代码编写 1. MapReduce 实现 Word Count 示意图 1. Map:预 ...
CC2650LaunchPad 运行contiki hello-world示例程序
最近做毕设,开始接触contiki. 下载并运行Instant Contiki 3.0 这是官方制作的虚拟机镜像,直接用vmware等工具就可以运行. 从这里下载. 下载并解压后,用vmware运行. ...
用Python语言写Hadoop MapReduce程序Writing an Hadoop MapReduce Program in Python
In this tutorial I will describe how to write a simple MapReduce program for Hadoop in the Python pr ...
IDEA Maven Hadoop调试hdfs程序
IDEA 远程调试 Hadoop 两大特色:一是采用maven的pom配置:二是直接连接hdfs:9000端口,无须另外在服务端配置参数. 其实内容包含了两种方式:本地与远程调试.这里仅仅只是使用远程 ...

随机推荐

html5本次存储几种方式
一.cookies 大家都懂的,不必多说二.sessionStorage/localStorage HTML5 LocalStorage 本地存储说到本地存储,这玩意真是历尽千辛万苦才走到HTML ...
Windows Azure 初体验
最近看到windows azure 在做活动,只需花一块钱就可以体验一个月的windows azure. 于是,我就注册了一个账号也尝试一把云时代,传送门. 注册很简单的,成功后可以看到这个界面. 然 ...
Python全栈开发之2、运算符与基本数据结构
运算符一.算数元算: 读者可以跟着我按照下面的代码重新写一遍,其中需要注意的是,使用除的话,在python3中可以直接使用,结果是4没有任何问题,但是在python2中使用的话,则不行,比如 9/2 ...
(13) go map
1.定义 map 无序, key唯一 (1) (2) (3)定义+赋值 2. map的值时map, 记得要make 3.增删改查 (1)增改 (2)删除 (3)查 4.遍历值map 嵌套for, ...
Python之路【第三篇】:文件操作
一.文件操作步骤打开文件,得到文件句柄并赋值给一个变量通过句柄对文件进行操作关闭文件歌名:<大火> 演唱:李佳薇作词:姚若龙作曲:马奕强歌词: 有座巨大的停了的时钟倾倒在赶 ...
Android升级ADT22后会报ClassNotFoundException的原因分析
http://blog.csdn.net/huzgd/article/details/8962702 1.ADT16下,只要add to path就是add to path并export:2.ADT2 ...
Java常用工具类之自定义访问对象
package com.wazn.learn.util; import javax.servlet.http.HttpServletRequest; /** * 自定义访问对象工具类 * * 获取对象 ...
折半搜索【p4799】[CEOI2015 Day2]世界冰球锦标赛
Description 今年的世界冰球锦标赛在捷克举行.Bobek 已经抵达布拉格,他不是任何团队的粉丝,也没有时间观念.他只是单纯的想去看几场比赛.如果他有足够的钱,他会去看所有的比赛.不幸的是,他 ...
【BZOJ 4104】 4104: [Thu Summer Camp 2015]解密运算（智商）
4104: [Thu Summer Camp 2015]解密运算 Time Limit: 10 Sec Memory Limit: 512 MBSubmit: 370 Solved: 237 De ...
android 传递类对象序列化 Serializable
public class Song implements Serializable { /** * */ private static final long serialVersionUID = 64 ...

运行Hadoop的示例程序WordCount-Running Hadoop Example

运行Hadoop的示例程序WordCount-Running Hadoop Example的更多相关文章

随机推荐

热门专题