Hadoop Streaming

Hadoopstreaming is a utility that comes with the Hadoop distribution. The utilityallows you to create and run Map/Reduce jobs with any executable or script asthe mapper and/or the reducer. For example:

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
    -input myInputDirs \
    -output myOutputDir \
    -mapper /bin/cat \
    -reducer /bin/wc

How Streaming Works

Inthe above example, both the mapper and the reducer are executables that readthe input from stdin (line by line) and emit the output to stdout. The utilitywill create a Map/Reduce job, submit the job to an appropriate cluster, andmonitor the progress of the job until it completes.

Whenan executable is specified for mappers, each mapper task will launch theexecutable as a separate process when the mapper is initialized. As the mappertask runs, it converts its inputs into lines and feed the lines to the stdin ofthe process. In the meantime, the mapper collects the line oriented outputsfrom the stdout of the process and converts each line into a key/value pair,which is collected as the output of the mapper. By default, the prefix of a line up to the first tabcharacter is the keyand the rest of the line (excluding the tab character) will be the value. If there is no tabcharacter in the line, then entire line is considered as key and the value isnull. However, this can be customized, as discussed later.

Whenan executable is specified for reducers, each reducer task will launch theexecutable as a separate process then the reducer is initialized. As thereducer task runs, it converts its input key/values pairs into lines and feedsthe lines to the stdin of the process. In the meantime, the reducer collectsthe line oriented outputs from the stdout of the process, converts each lineinto a key/value pair, which is collected as the output of the reducer. Bydefault, the prefix of a line up to the first tab character is the key and therest of the line (excluding the tab character) is the value. However, this canbe customized, as discussed later.

Thisis the basis for the communication protocol between the Map/Reduce frameworkand the streaming mapper/reducer.

Youcan supply a Java class as the mapper and/or the reducer. The above example isequivalent to:

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
    -input myInputDirs \
    -output myOutputDir \
    -mapper org.apache.hadoop.mapred.lib.IdentityMapper \
    -reducer /bin/wc

Usercan specify stream.non.zero.exit.is.failure as true or false to make astreaming task that exits with a non-zero status to be Failure or Successrespectively. By default, streaming tasks exiting with non-zero status areconsidered to be failed tasks.

Streaming Command Options

Streamingsupports streaming command options as well as genericcommand options. The general command line syntax is shown below.

Note: Be sure to place the generic options before thestreaming options, otherwise the command will fail. For an example, see MakingArchives Available to Tasks.

bin/hadoop command [genericOptions] [streamingOptions]

TheHadoop streaming command options are listed here:

Parameter

Optional/Required

Description

-input directoryname or filename

Required

Input location for mapper

-output directoryname

Required

Output location for reducer

-mapper executable or JavaClassName

Required

Mapper executable

-reducer executable or JavaClassName

Required

Reducer executable

-file filename

Optional

Make the mapper, reducer, or combiner executable available locally on the compute nodes

-inputformat JavaClassName

Optional

Class you supply should return key/value pairs of Text class. If not specified, TextInputFormat is used as the default

-outputformat JavaClassName

Optional

Class you supply should take key/value pairs of Text class. If not specified, TextOutputformat is used as the default

-partitioner JavaClassName

Optional

Class that determines which reduce a key is sent to

-combiner streamingCommand or JavaClassName

Optional

Combiner executable for map output

-cmdenv name=value

Optional

Pass environment variable to streaming commands

-inputreader

Optional

For backwards-compatibility: specifies a record reader class (instead of an input format class)

-verbose

Optional

Verbose output

-lazyOutput

Optional

Create output lazily. For example, if the output format is based on FileOutputFormat, the output file is created only on the first call to output.collect (or Context.write)

-numReduceTasks

Optional

Specify the number of reducers

-mapdebug

Optional

Script to call when map task fails

-reducedebug

Optional

Script to call when reduce task fails

Specifying a Java Class as theMapper/Reducer

Youcan supply a Java class as the mapper and/or the reducer.

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
    -input myInputDirs \
    -output myOutputDir \
    -mapper org.apache.hadoop.mapred.lib.IdentityMapper \
    -reducer /bin/wc

Youcan specify stream.non.zero.exit.is.failure as true or false to make astreaming task that exits with a non-zero status to be Failure or Success respectively.By default, streaming tasks exiting with non-zero status are considered to befailed tasks.

Packaging Files With Job Submissions

Youcan specify any executable as the mapper and/or the reducer. The executables donot need to pre-exist on the machines in the cluster; however, if they don't,you will need to use "-file" option to tell the framework to packyour executable files as a part of job submission. For example:

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
    -input myInputDirs \
    -output myOutputDir \
    -mapper myPythonScript.py \
    -reducer /bin/wc \
    -file myPythonScript.py

Theabove example specifies a user defined Python executable as the mapper. Theoption "-file myPythonScript.py" causes the python executable shippedto the cluster machines as a part of job submission.

Inaddition to executable files, you can also package other auxiliary files (suchas dictionaries, configuration files, etc) that may be used by the mapperand/or the reducer. For example:

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
    -input myInputDirs \
    -output myOutputDir \
    -mapper myPythonScript.py \
    -reducer /bin/wc \
    -file myPythonScript.py \
    -file myDictionary.txt

Specifying Other Plugins for Jobs

Justas with a normal Map/Reduce job, you can specify other plugins for a streamingjob:

   -inputformat JavaClassName
   -outputformat JavaClassName
   -partitioner JavaClassName
   -combiner streamingCommand or JavaClassName

Theclass you supply for the input format should return key/value pairs of Textclass. If you do not specify an input format class, the TextInputFormat is usedas the default. Since the TextInputFormat returns keys of LongWritable class,which are actually not part of the input data, the keys will be discarded; onlythe values will be piped to the streaming mapper.

Theclass you supply for the output format is expected to take key/value pairs ofText class. If you do not specify an output format class, the TextOutputFormatis used as the default.

Setting Environment Variables

Toset an environment variable in a streaming command use:

   -cmdenv EXAMPLE_DIR=/home/example/dictionaries/

Generic Command Options

Streamingsupports streamingcommand options as well as generic command options. The general commandline syntax is shown below.

Note: Be sure to place the generic options before thestreaming options, otherwise the command will fail. For an example, see MakingArchives Available to Tasks.

bin/hadoop command [genericOptions] [streamingOptions]

TheHadoop generic command options you can use with streaming are listed here:

Parameter

Optional/Required

Description

-conf configuration_file

Optional

Specify an application configuration file

-D property=value

Optional

Use value for given property

-fs host:port or local

Optional

Specify a namenode

-jt host:port or local

Optional

Specify a job tracker

-files

Optional

Specify comma-separated files to be copied to the Map/Reduce cluster

-libjars

Optional

Specify comma-separated jar files to include in the classpath

-archives

Optional

Specify comma-separated archives to be unarchived on the compute machines

Specifying Configuration Variableswith the -D Option

Youcan specify additional configuration variables by using "-D<property>=<value>".

Specifying Directories

Tochange the local temp directory use:

   -D dfs.data.dir=/tmp

Tospecify additional local temp directories use:

   -D mapred.local.dir=/tmp/local
   -D mapred.system.dir=/tmp/system
   -D mapred.temp.dir=/tmp/temp

Note: For more details on jobconf parameters see: mapred-default.html

Specifying Map-Only Jobs

Often,you may want to process input data using a map function only. To do this,simply set mapred.reduce.tasks to zero. The Map/Reduce framework will notcreate any reducer tasks. Rather, the outputs of the mapper tasks will be thefinal output of the job.

    -D mapred.reduce.tasks=0

Tobe backward compatible, Hadoop Streaming also supports the "-reduceNONE" option, which is equivalent to "-D mapred.reduce.tasks=0".

Specifying the Number of Reducers

Tospecify the number of reducers, for example two, use:

 $HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
    -D mapred.reduce.tasks=2 \
    -input myInputDirs \
    -output myOutputDir \
    -mapper org.apache.hadoop.mapred.lib.IdentityMapper \
    -reducer /bin/wc 
 

Customizing How Lines are Split intoKey/Value Pairs

Asnoted earlier, when the Map/Reduce framework reads a line from the stdout ofthe mapper, it splits the line into a key/value pair. By default, the prefix ofthe line up to the first tab character is the key and the rest of the line(excluding the tab character) is the value.

However,you can customize this default. You can specify a field separator other thanthe tab character (the default), and you can specify the nth (n >= 1)character rather than the first character in a line (the default) as theseparator between the key and value. For example:

 $HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
    -D stream.map.output.field.separator=. \
    -D stream.num.map.output.key.fields=4 \
    -input myInputDirs \
    -output myOutputDir \
    -mapper org.apache.hadoop.mapred.lib.IdentityMapper \
    -reducer org.apache.hadoop.mapred.lib.IdentityReducer 
 

Inthe above example, "-D stream.map.output.field.separator=." specifies"." as the field separator for the map outputs, and the prefix up tothe fourth "." in a line will be the key and the rest of the line(excluding the fourth ".") will be the value. If a line has less thanfour "."s, then the whole line will be the key and the value will bean empty Text object (like the one created by new Text("")).

Similarly,you can use "-D stream.reduce.output.field.separator=SEP" and"-D stream.num.reduce.output.fields=NUM" to specify the nth fieldseparator in a line of the reduce outputs as the separator between the key andthe value.

Similarly,you can specify "stream.map.input.field.separator" and"stream.reduce.input.field.separator" as the input separator forMap/Reduce inputs. By default the separator is the tab character.

Working with Large Files and Archives

The-files and -archives options allow you to make files and archives available tothe tasks. The argument is a URI to the file or archive that you have alreadyuploaded to HDFS. These files and archives are cached across jobs. You canretrieve the host and fs_port values from the fs.default.name config variable.

Note: The -files and -archives options are generic options.Be sure to place the generic options before the command options, otherwise thecommand will fail. For an example, see The-archives Option. Also see OtherSupported Options.

Making Files Available to Tasks

The-files option creates a symlink in the current working directory of the tasksthat points to the local copy of the file.

Inthis example, Hadoop automatically creates a symlink named testfile.txt in thecurrent working directory of the tasks. This symlink points to the local copyof testfile.txt.

-files hdfs://host:fs_port/user/testfile.txt

Usercan specify a different symlink name for -files using #.

-files hdfs://host:fs_port/user/testfile.txt#testfile

Multipleentries can be specified like this:

-files hdfs://host:fs_port/user/testfile1.txt,hdfs://host:fs_port/user/testfile2.txt

Making Archives Available to Tasks

The-archives option allows you to copy jars locally to the current workingdirectory of tasks and automatically unjar the files.

Inthis example, Hadoop automatically creates a symlink named testfile.jar in thecurrent working directory of tasks. This symlink points to the directory thatstores the unjarred contents of the uploaded jar file.

-archives hdfs://host:fs_port/user/testfile.jar

Usercan specify a different symlink name for -archives using #.

-archives hdfs://host:fs_port/user/testfile.tgz#tgzdir

Inthis example, the input.txt file has two lines specifying the names of the twofiles: cachedir.jar/cache.txt and cachedir.jar/cache2.txt."cachedir.jar" is a symlink to the archived directory, which has thefiles "cache.txt" and "cache2.txt".

 $HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
                  -archives 'hdfs://hadoop-nn1.example.com/user/me/samples/cachefile/cachedir.jar' \  
                  -D mapred.map.tasks=1 \
                  -D mapred.reduce.tasks=1 \ 
                  -D mapred.job.name="Experiment" \
                  -input "/user/me/samples/cachefile/input.txt"  \
                  -output "/user/me/samples/cachefile/out" \  
                  -mapper "xargs cat"  \
                  -reducer "cat"  
 
$ ls test_jar/
cache.txt  cache2.txt
 
$ jar cvf cachedir.jar -C test_jar/ .
added manifest
adding: cache.txt(in = 30) (out= 29)(deflated 3%)
adding: cache2.txt(in = 37) (out= 35)(deflated 5%)
 
$ hadoop dfs -put cachedir.jar samples/cachefile
 
$ hadoop dfs -cat /user/me/samples/cachefile/input.txt
cachedir.jar/cache.txt
cachedir.jar/cache2.txt
 
$ cat test_jar/cache.txt 
This is just the cache string
 
$ cat test_jar/cache2.txt 
This is just the second cache string
 
$ hadoop dfs -ls /user/me/samples/cachefile/out      
Found 1 items
/user/me/samples/cachefile/out/part-00000  <r 3>   69
 
$ hadoop dfs -cat /user/me/samples/cachefile/out/part-00000
This is just the cache string   
This is just the second cache string

More Usage Examples

Hadoop Partitioner Class

Hadoophas a library class, KeyFieldBasedPartitioner,p> that is useful for many applications. This class allows the Map/Reduceframework to partition the map outputs based on certain key fields, not thewhole keys. For example:

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
    -D stream.map.output.field.separator=. \
    -D stream.num.map.output.key.fields=4 \
    -D map.output.key.field.separator=. \
    -D mapred.text.key.partitioner.options=-k1,2 \
    -D mapred.reduce.tasks=12 \
    -input myInputDirs \
    -output myOutputDir \
    -mapper org.apache.hadoop.mapred.lib.IdentityMapper \
    -reducer org.apache.hadoop.mapred.lib.IdentityReducer \
    -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner 

Here,-Dstream.map.output.field.separator=. and -D stream.num.map.output.key.fields=4 are asexplained in previous example. The two variables are used by streaming toidentify the key/value pair of mapper.

Themap output keys of the above Map/Reduce job normally have four fields separatedby ".". However, the Map/Reduce framework will partition the mapoutputs by the first two fields of the keys using the -Dmapred.text.key.partitioner.options=-k1,2 option. Here, -D map.output.key.field.separator=.specifies the separator for the partition. This guarantees that all thekey/value pairs with the same first two fields in the keys will be partitionedinto the same reducer.

Thisis effectively equivalent to specifying the first two fields as the primary keyand the next two fields as the secondary. The primary key is used forpartitioning, and the combination of the primary and secondary keys is used forsorting. A simple illustration is shown here:

Outputof map (the keys)

11.12.1.2
11.14.2.3
11.11.4.1
11.12.1.1
11.14.2.2

Partitioninto 3 reducers (the first 2 fields are used as keys for partition)

11.11.4.1
-----------
11.12.1.2
11.12.1.1
-----------
11.14.2.3
11.14.2.2

Sortingwithin each partition for the reducer(all 4 fields used for sorting)

11.11.4.1
-----------
11.12.1.1
11.12.1.2
-----------
11.14.2.2
11.14.2.3

Hadoop Comparator Class

Hadoophas a library class, KeyFieldBasedComparator,that is useful for many applications. This class provides a subset of featuresprovided by the Unix/GNU Sort. For example:

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
    -D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator \
    -D stream.map.output.field.separator=. \
    -D stream.num.map.output.key.fields=4 \
    -D map.output.key.field.separator=. \
    -D mapred.text.key.comparator.options=-k2,2nr \
    -D mapred.reduce.tasks=12 \
    -input myInputDirs \
    -output myOutputDir \
    -mapper org.apache.hadoop.mapred.lib.IdentityMapper \
    -reducer org.apache.hadoop.mapred.lib.IdentityReducer 

Themap output keys of the above Map/Reduce job normally have four fields separatedby ".". However, the Map/Reduce framework will sort the outputs bythe second field of the keys using the -Dmapred.text.key.comparator.options=-k2,2nr option. Here, -n specifies that the sortingis numerical sorting and -rspecifies that the result should be reversed. A simple illustration is shownbelow:

Outputof map (the keys)

11.12.1.2
11.14.2.3
11.11.4.1
11.12.1.1
11.14.2.2

Sortingoutput for the reducer(where second field used for sorting)

11.14.2.3
11.14.2.2
11.12.1.2
11.12.1.1
11.11.4.1

Hadoop Aggregate Package

Hadoophas a library package called Aggregate.Aggregate provides a special reducer class and a special combiner class, and alist of simple aggregators that perform aggregations such as "sum","max", "min" and so on over a sequence of values. Aggregateallows you to define a mapper plugin class that is expected to generate"aggregatable items" for each input key/value pair of the mappers.The combiner/reducer will aggregate those aggregatable items by invoking theappropriate aggregators.

Touse Aggregate, simply specify "-reducer aggregate":

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
    -D mapred.reduce.tasks=12 \
    -input myInputDirs \
    -output myOutputDir \
    -mapper myAggregatorForKeyCount.py \
    -reducer aggregate \
    -file myAggregatorForKeyCount.py \

Thepython program myAggregatorForKeyCount.py looks like:

#!/usr/bin/python
 
import sys;
 
def generateLongCountToken(id):
    return "LongValueSum:" + id + "\t" + "1"
 
def main(argv):
    line = sys.stdin.readline();
    try:
        while line:
            line = line[:-1];
            fields = line.split("\t");
            print generateLongCountToken(fields[0]);
            line = sys.stdin.readline();
    except "end of file":
        return None
if __name__ == "__main__":
     main(sys.argv)

Hadoop Field Selection Class

Hadoophas a library class, org.apache.hadoop.mapred.lib.FieldSelectionMapReduce, thateffectively allows you to process text data like the unix "cut"utility. The map function defined in the class treats each input key/value pairas a list of fields. You can specify the field separator (the default is thetab character). You can select an arbitrary list of fields as the map outputkey, and an arbitrary list of fields as the map output value. Similarly, thereduce function defined in the class treats each input key/value pair as a listof fields. You can select an arbitrary list of fields as the reduce output key,and an arbitrary list of fields as the reduce output value. For example:

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
    -D map.output.key.field.separa=. \
    -D mapred.text.key.partitioner.options=-k1,2 \
    -D mapred.data.field.separator=. \
    -D map.output.key.value.fields.spec=6,5,1-3:0- \
    -D reduce.output.key.value.fields.spec=0-2:5- \
    -D mapred.reduce.tasks=12 \
    -input myInputDirs \
    -output myOutputDir \
    -mapper org.apache.hadoop.mapred.lib.FieldSelectionMapReduce \
    -reducer org.apache.hadoop.mapred.lib.FieldSelectionMapReduce \
    -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner 

Theoption "-D map.output.key.value.fields.spec=6,5,1-3:0-" specifieskey/value selection for the map outputs. Key selection spec and value selectionspec are separated by ":". In this case, the map output key willconsist of fields 6, 5, 1, 2, and 3. The map output value will consist of allfields (0- means field 0 and all the subsequent fields).

Theoption "-D reduce.output.key.value.fields.spec=0-2:5-" specifieskey/value selection for the reduce outputs. In this case, the reduce output keywill consist of fields 0, 1, 2 (corresponding to the original fields 6, 5, 1).The reduce output value will consist of all fields starting from field 5(corresponding to all the original fields).

Frequently Asked Questions

How do I use Hadoop Streaming to runan arbitrary set of (semi) independent tasks?

Oftenyou do not need the full power of Map Reduce, but only need to run multipleinstances of the same program - either on different parts of the data, or onthe same data, but with different parameters. You can use Hadoop Streaming todo this.

How do I process files, one per map?

Asan example, consider the problem of zipping (compressing) a set of files acrossthe hadoop cluster. You can achieve this using either of these methods:

1.       Hadoop Streaming and custom mapper script:

o    Generate afile containing the full HDFS path of the input files. Each map task would getone file name as input.

o    Create amapper script which, given a filename, will get the file to local disk, gzipthe file and put it back in the desired output directory

2.       The existing Hadoop Framework:

o    Add thesecommands to your main function:

o        FileOutputFormat.setCompressOutput(conf, true);
o        FileOutputFormat.setOutputCompressorClass(conf, org.apache.hadoop.io.compress.GzipCodec.class);
o        conf.setOutputFormat(NonSplitableTextInputFormat.class);
o        conf.setNumReduceTasks(0);

o    Write yourmap function:

o        public void map(WritableComparable key, 
o                        Writable value, 
o                        OutputCollector output, 
o                        Reporter reporter) 
o            throws IOException {
o                output.collect((Text)value, null);
o            }

o    Note that theoutput filename will not be the same as the original filename

How many reducers should I use?

Seethe Hadoop Wiki for details: Reducer

If I set up an alias in my shellscript, will that work after -mapper?

Forexample, say I do: alias c1='cut -f1'. Will -mapper "c1" work?

Usingan alias will not work, but variable substitution is allowed as shown in thisexample:

$ hadoop dfs -cat samples/student_marks
alice   50
bruce   70
charlie 80
dan     75
 
$ c2='cut -f2'; $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
    -D mapred.job.name='Experiment'
    -input /user/me/samples/student_marks 
    -output /user/me/samples/student_out 
    -mapper \"$c2\" -reducer 'cat'  
    
$ hadoop dfs -ls samples/student_out
Found 1 items/user/me/samples/student_out/part-00000    <r 3>   16
 
$ hadoop dfs -cat samples/student_out/part-00000
50
70
75
80

Can I use UNIX pipes?

Forexample, will -mapper "cut -f1 | sed s/foo/bar/g" work?

Currentlythis does not work and gives an "java.io.IOException: Broken pipe"error. This is probably a bug that needs to be investigated.

What do I do if I get the "Nospace left on device" error?

Forexample, when I run a streaming job by distributing large executables (forexample, 3.6G) through the -file option, I get a "No space left on device"error.

Thejar packaging happens in a directory pointed to by the configuration variablestream.tmpdir. The default value of stream.tmpdir is /tmp. Set the value to adirectory with more space:

-D stream.tmpdir=/export/bigspace/...

How do I specify multiple inputdirectories?

Youcan specify multiple input directories with multiple '-input' options:

 hadoop jar hadoop-streaming.jar -input '/user/foo/dir1' -input '/user/foo/dir2' 

How do I generate output files withgzip format?

Insteadof plain text files, you can generate gzip files as your generated output. Pass'-D mapred.output.compress=true -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec' asoption to your streaming job.

How do I provide my own input/outputformat with streaming?

Atleast as late as version 0.14, Hadoop does not support multiple jar files. So,when specifying your own custom classes you will have to pack them along withthe streaming jar and use the custom jar instead of the default hadoopstreaming jar.

How do I parse XML documents usingstreaming?

Youcan use the record reader StreamXmlRecordReader to process XML documents.

hadoop jar hadoop-streaming.jar -inputreader "StreamXmlRecord,begin=BEGIN_STRING,end=END_STRING" ..... (rest of the command)

Anythingfound between BEGIN_STRING and END_STRING would be treated as one record formap tasks.

How do I update counters in streamingapplications?

Astreaming process can use the stderr to emit counter information. reporter:counter:<group>,<counter>,<amount> should besent to stderr to update the counter.

How do I update status in streamingapplications?

Astreaming process can use the stderr to emit status information. To set astatus, reporter:status:<message> should besent to stderr.

How do Iget the JobConf variables in a streaming job's mapper/reducer?

Duringthe execution of a streaming job, the names of the "mapred"parameters are transformed. The dots ( . ) become underscores ( _ ). Forexample, mapred.job.id becomes mapred_job_id and mapred.jar becomes mapred_jar.In your code, use the parameter names with the underscores.

How do I get the JobConf variables ina streaming job's mapper/reducer?

Duringthe execution of a streaming job, the names of the "mapred"parameters are transformed. The dots ( . ) become underscores ( _ ). Forexample, mapred.job.id becomes mapred_job_id and mapred.jar becomes mapred_jar.In your code, use the parameter names with the underscores.

Hadoop Streaming Command Details and Q&A的更多相关文章

  1. Hadoop Streaming

    原文地址:http://hadoop.apache.org/docs/r1.0.4/cn/streaming.html Hadoop Streaming Streaming工作原理 将文件打包到提交的 ...

  2. Hadoop Streaming:aggregate

    [Hadoop Streaming:aggregate] 1.实例1 测试文件test.txt mapper程序: 运行: $hadoop streaming -input /app/test.txt ...

  3. hadoop streaming anaconda python 计算平均值

    原始Liunx 的python版本不带numpy ,安装了anaconda 之后,使用hadoop streaming 时无法调用anaconda python  , 后来发现是参数没设置好... 进 ...

  4. Ubuntu15.10下Hadoop2.6.0伪分布式环境安装配置及Hadoop Streaming的体验

    Ubuntu用的是Ubuntu15.10Beta2版本,正式的版本好像要到这个月的22号才发布.参考的资料主要是http://www.powerxing.com/install-hadoop-clus ...

  5. hadoop streaming 多路输出 [转载]

    转载 http://www.cnblogs.com/shapherd/archive/2012/12/21/2827860.html hadoop 支持reduce多路输出的功能,一个reduce可以 ...

  6. Hadoop Streaming框架使用(一)

      Streaming简介 link:http://www.cnblogs.com/luchen927/archive/2012/01/16/2323448.html Streaming框架允许任何程 ...

  7. Hadoop Streaming例子(python)

    以前总是用java写一些MapReduce程序现举一个例子使用Python通过Hadoop Streaming来实现Mapreduce. 任务描述: HDFS上有两个目录/a和/b,里面数据均有3列, ...

  8. hadoop streaming 编程

    概况 Hadoop Streaming 是一个工具, 代替编写Java的实现类,而利用可执行程序来完成map-reduce过程.一个最简单的程序 $HADOOP_HOME/bin/hadoop jar ...

  9. Hadoop Streaming 得到mapreduce_map_input_file中遇到的问题的版本号

    1.Hadoop Streaming,您可以在任务获得hadoop设置环境变量, 例如,使用awk书面map从而能获得:filename = ENVIRON["mapreduce_map_i ...

随机推荐

  1. Linux学习笔记——怎样在交叉编译时使用共享库

    0.前言     在较为复杂的项目中会利用到交叉编译得到的共享库(*.so文件).在这样的情况下便会产生下面疑问,比如:     [1]交叉编译时的共享库是否须要放置于目标板中,假设须要放置在哪个文件 ...

  2. C#秘密武器之扩展方法

    原文:C#秘密武器之扩展方法 为何要用扩展方法? 作为一个.NET程序猿,我们经常要跟.net自带类库或者第三方dll类库打交道,有时候我们未必能够通过反编译来查看它们的代码,但是我们通常需要给它们扩 ...

  3. installshield 32位打包和64位打包的注意事项

    原文:installshield 32位打包和64位打包的注意事项 32/64位问题要把握几点:1. 明确你的产品是否需要区分32/64位2. 明确你的产品中是否有32/64位的服务注册3. 了解In ...

  4. 如何实现TWaver 3D颜色渐变

    一般而言,须要实现3D物体的渐变,通常的思路就是通过2D绘制一张渐变canvas图片作为3D对象的贴图.这样的方式是能够解决这类问题的.只是对于一般用户而言,通过2D生成一张渐变的图片.有一定的难度, ...

  5. 读书笔记—CLR via C#同步构造28-29章节

    前言 这本书这几年零零散散读过两三遍了,作为经典书籍,应该重复读反复读,既然我现在开始写博了,我也准备把以前觉得经典的好书重读细读一遍,并且将笔记整理到博客中,好记性不如烂笔头,同时也在写的过程中也可 ...

  6. .NET MVC通过反射获取数据修

    .NET MVC通过反射获取数据修 折磨了我一个晚上的问题,奈何对物理的反射印象太深了,整天去想着物理的反射.折射怎么解.感谢少将哥哥给我的指点,经过一个晚上对反射的恶补,最终搞定了.纪念一下. 1. ...

  7. WCF总结笔记

    ------------------------windowform承载服务步骤: (1)定义契约: using System; using System.Collections.Generic; u ...

  8. 对Extjs中时间的多种处理

    1.类型为datetime的json数据处理 (字段类型为datetime) new Date(parseInt(yourTime.substring(6, yourTime.length - 2)) ...

  9. c# 发邮件功能

    using System;using System.Collections.Generic;using System.Data;using System.Data.SqlClient;using Sy ...

  10. 【Apache ZooKeeper】命令行zkCli.sh使用指南

    ZooKeeper命令行 原文                   http://blog.csdn.net/ganglia/article/details/11606807 ZooKeeper客户端 ...