Hadoop Streaming Command Details and Q&A
Hadoop Streaming
Hadoopstreaming is a utility that comes with the Hadoop distribution. The utilityallows you to create and run Map/Reduce jobs with any executable or script asthe mapper and/or the reducer. For example:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper /bin/cat \
-reducer /bin/wc
How Streaming Works
Inthe above example, both the mapper and the reducer are executables that readthe input from stdin (line by line) and emit the output to stdout. The utilitywill create a Map/Reduce job, submit the job to an appropriate cluster, andmonitor the progress of the job until it completes.
Whenan executable is specified for mappers, each mapper task will launch theexecutable as a separate process when the mapper is initialized. As the mappertask runs, it converts its inputs into lines and feed the lines to the stdin ofthe process. In the meantime, the mapper collects the line oriented outputsfrom the stdout of the process and converts each line into a key/value pair,which is collected as the output of the mapper. By default, the prefix of a line up to the first tabcharacter is the keyand the rest of the line (excluding the tab character) will be the value. If there is no tabcharacter in the line, then entire line is considered as key and the value isnull. However, this can be customized, as discussed later.
Whenan executable is specified for reducers, each reducer task will launch theexecutable as a separate process then the reducer is initialized. As thereducer task runs, it converts its input key/values pairs into lines and feedsthe lines to the stdin of the process. In the meantime, the reducer collectsthe line oriented outputs from the stdout of the process, converts each lineinto a key/value pair, which is collected as the output of the reducer. Bydefault, the prefix of a line up to the first tab character is the key and therest of the line (excluding the tab character) is the value. However, this canbe customized, as discussed later.
Thisis the basis for the communication protocol between the Map/Reduce frameworkand the streaming mapper/reducer.
Youcan supply a Java class as the mapper and/or the reducer. The above example isequivalent to:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper org.apache.hadoop.mapred.lib.IdentityMapper \
-reducer /bin/wc
Usercan specify stream.non.zero.exit.is.failure as true or false to make astreaming task that exits with a non-zero status to be Failure or Successrespectively. By default, streaming tasks exiting with non-zero status areconsidered to be failed tasks.
Streaming Command Options
Streamingsupports streaming command options as well as genericcommand options. The general command line syntax is shown below.
Note: Be sure to place the generic options before thestreaming options, otherwise the command will fail. For an example, see MakingArchives Available to Tasks.
bin/hadoop command [genericOptions] [streamingOptions]
TheHadoop streaming command options are listed here:
Parameter |
Optional/Required |
Description |
-input directoryname or filename |
Required |
Input location for mapper |
-output directoryname |
Required |
Output location for reducer |
-mapper executable or JavaClassName |
Required |
Mapper executable |
-reducer executable or JavaClassName |
Required |
Reducer executable |
-file filename |
Optional |
Make the mapper, reducer, or combiner executable available locally on the compute nodes |
-inputformat JavaClassName |
Optional |
Class you supply should return key/value pairs of Text class. If not specified, TextInputFormat is used as the default |
-outputformat JavaClassName |
Optional |
Class you supply should take key/value pairs of Text class. If not specified, TextOutputformat is used as the default |
-partitioner JavaClassName |
Optional |
Class that determines which reduce a key is sent to |
-combiner streamingCommand or JavaClassName |
Optional |
Combiner executable for map output |
-cmdenv name=value |
Optional |
Pass environment variable to streaming commands |
-inputreader |
Optional |
For backwards-compatibility: specifies a record reader class (instead of an input format class) |
-verbose |
Optional |
Verbose output |
-lazyOutput |
Optional |
Create output lazily. For example, if the output format is based on FileOutputFormat, the output file is created only on the first call to output.collect (or Context.write) |
-numReduceTasks |
Optional |
Specify the number of reducers |
-mapdebug |
Optional |
Script to call when map task fails |
-reducedebug |
Optional |
Script to call when reduce task fails |
Specifying a Java Class as theMapper/Reducer
Youcan supply a Java class as the mapper and/or the reducer.
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper org.apache.hadoop.mapred.lib.IdentityMapper \
-reducer /bin/wc
Youcan specify stream.non.zero.exit.is.failure as true or false to make astreaming task that exits with a non-zero status to be Failure or Success respectively.By default, streaming tasks exiting with non-zero status are considered to befailed tasks.
Packaging Files With Job Submissions
Youcan specify any executable as the mapper and/or the reducer. The executables donot need to pre-exist on the machines in the cluster; however, if they don't,you will need to use "-file" option to tell the framework to packyour executable files as a part of job submission. For example:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper myPythonScript.py \
-reducer /bin/wc \
-file myPythonScript.py
Theabove example specifies a user defined Python executable as the mapper. Theoption "-file myPythonScript.py" causes the python executable shippedto the cluster machines as a part of job submission.
Inaddition to executable files, you can also package other auxiliary files (suchas dictionaries, configuration files, etc) that may be used by the mapperand/or the reducer. For example:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper myPythonScript.py \
-reducer /bin/wc \
-file myPythonScript.py \
-file myDictionary.txt
Specifying Other Plugins for Jobs
Justas with a normal Map/Reduce job, you can specify other plugins for a streamingjob:
-inputformat JavaClassName
-outputformat JavaClassName
-partitioner JavaClassName
-combiner streamingCommand or JavaClassName
Theclass you supply for the input format should return key/value pairs of Textclass. If you do not specify an input format class, the TextInputFormat is usedas the default. Since the TextInputFormat returns keys of LongWritable class,which are actually not part of the input data, the keys will be discarded; onlythe values will be piped to the streaming mapper.
Theclass you supply for the output format is expected to take key/value pairs ofText class. If you do not specify an output format class, the TextOutputFormatis used as the default.
Setting Environment Variables
Toset an environment variable in a streaming command use:
-cmdenv EXAMPLE_DIR=/home/example/dictionaries/
Generic Command Options
Streamingsupports streamingcommand options as well as generic command options. The general commandline syntax is shown below.
Note: Be sure to place the generic options before thestreaming options, otherwise the command will fail. For an example, see MakingArchives Available to Tasks.
bin/hadoop command [genericOptions] [streamingOptions]
TheHadoop generic command options you can use with streaming are listed here:
Parameter |
Optional/Required |
Description |
-conf configuration_file |
Optional |
Specify an application configuration file |
-D property=value |
Optional |
Use value for given property |
-fs host:port or local |
Optional |
Specify a namenode |
-jt host:port or local |
Optional |
Specify a job tracker |
-files |
Optional |
Specify comma-separated files to be copied to the Map/Reduce cluster |
-libjars |
Optional |
Specify comma-separated jar files to include in the classpath |
-archives |
Optional |
Specify comma-separated archives to be unarchived on the compute machines |
Specifying Configuration Variableswith the -D Option
Youcan specify additional configuration variables by using "-D<property>=<value>".
Specifying Directories
Tochange the local temp directory use:
-D dfs.data.dir=/tmp
Tospecify additional local temp directories use:
-D mapred.local.dir=/tmp/local
-D mapred.system.dir=/tmp/system
-D mapred.temp.dir=/tmp/temp
Note: For more details on jobconf parameters see: mapred-default.html
Specifying Map-Only Jobs
Often,you may want to process input data using a map function only. To do this,simply set mapred.reduce.tasks to zero. The Map/Reduce framework will notcreate any reducer tasks. Rather, the outputs of the mapper tasks will be thefinal output of the job.
-D mapred.reduce.tasks=0
Tobe backward compatible, Hadoop Streaming also supports the "-reduceNONE" option, which is equivalent to "-D mapred.reduce.tasks=0".
Specifying the Number of Reducers
Tospecify the number of reducers, for example two, use:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-D mapred.reduce.tasks=2 \
-input myInputDirs \
-output myOutputDir \
-mapper org.apache.hadoop.mapred.lib.IdentityMapper \
-reducer /bin/wc
Customizing How Lines are Split intoKey/Value Pairs
Asnoted earlier, when the Map/Reduce framework reads a line from the stdout ofthe mapper, it splits the line into a key/value pair. By default, the prefix ofthe line up to the first tab character is the key and the rest of the line(excluding the tab character) is the value.
However,you can customize this default. You can specify a field separator other thanthe tab character (the default), and you can specify the nth (n >= 1)character rather than the first character in a line (the default) as theseparator between the key and value. For example:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-D stream.map.output.field.separator=. \
-D stream.num.map.output.key.fields=4 \
-input myInputDirs \
-output myOutputDir \
-mapper org.apache.hadoop.mapred.lib.IdentityMapper \
-reducer org.apache.hadoop.mapred.lib.IdentityReducer
Inthe above example, "-D stream.map.output.field.separator=." specifies"." as the field separator for the map outputs, and the prefix up tothe fourth "." in a line will be the key and the rest of the line(excluding the fourth ".") will be the value. If a line has less thanfour "."s, then the whole line will be the key and the value will bean empty Text object (like the one created by new Text("")).
Similarly,you can use "-D stream.reduce.output.field.separator=SEP" and"-D stream.num.reduce.output.fields=NUM" to specify the nth fieldseparator in a line of the reduce outputs as the separator between the key andthe value.
Similarly,you can specify "stream.map.input.field.separator" and"stream.reduce.input.field.separator" as the input separator forMap/Reduce inputs. By default the separator is the tab character.
Working with Large Files and Archives
The-files and -archives options allow you to make files and archives available tothe tasks. The argument is a URI to the file or archive that you have alreadyuploaded to HDFS. These files and archives are cached across jobs. You canretrieve the host and fs_port values from the fs.default.name config variable.
Note: The -files and -archives options are generic options.Be sure to place the generic options before the command options, otherwise thecommand will fail. For an example, see The-archives Option. Also see OtherSupported Options.
Making Files Available to Tasks
The-files option creates a symlink in the current working directory of the tasksthat points to the local copy of the file.
Inthis example, Hadoop automatically creates a symlink named testfile.txt in thecurrent working directory of the tasks. This symlink points to the local copyof testfile.txt.
-files hdfs://host:fs_port/user/testfile.txt
Usercan specify a different symlink name for -files using #.
-files hdfs://host:fs_port/user/testfile.txt#testfile
Multipleentries can be specified like this:
-files hdfs://host:fs_port/user/testfile1.txt,hdfs://host:fs_port/user/testfile2.txt
Making Archives Available to Tasks
The-archives option allows you to copy jars locally to the current workingdirectory of tasks and automatically unjar the files.
Inthis example, Hadoop automatically creates a symlink named testfile.jar in thecurrent working directory of tasks. This symlink points to the directory thatstores the unjarred contents of the uploaded jar file.
-archives hdfs://host:fs_port/user/testfile.jar
Usercan specify a different symlink name for -archives using #.
-archives hdfs://host:fs_port/user/testfile.tgz#tgzdir
Inthis example, the input.txt file has two lines specifying the names of the twofiles: cachedir.jar/cache.txt and cachedir.jar/cache2.txt."cachedir.jar" is a symlink to the archived directory, which has thefiles "cache.txt" and "cache2.txt".
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-archives 'hdfs://hadoop-nn1.example.com/user/me/samples/cachefile/cachedir.jar' \
-D mapred.map.tasks=1 \
-D mapred.reduce.tasks=1 \
-D mapred.job.name="Experiment" \
-input "/user/me/samples/cachefile/input.txt" \
-output "/user/me/samples/cachefile/out" \
-mapper "xargs cat" \
-reducer "cat"
$ ls test_jar/
cache.txt cache2.txt
$ jar cvf cachedir.jar -C test_jar/ .
added manifest
adding: cache.txt(in = 30) (out= 29)(deflated 3%)
adding: cache2.txt(in = 37) (out= 35)(deflated 5%)
$ hadoop dfs -put cachedir.jar samples/cachefile
$ hadoop dfs -cat /user/me/samples/cachefile/input.txt
cachedir.jar/cache.txt
cachedir.jar/cache2.txt
$ cat test_jar/cache.txt
This is just the cache string
$ cat test_jar/cache2.txt
This is just the second cache string
$ hadoop dfs -ls /user/me/samples/cachefile/out
Found 1 items
/user/me/samples/cachefile/out/part-00000 <r 3> 69
$ hadoop dfs -cat /user/me/samples/cachefile/out/part-00000
This is just the cache string
This is just the second cache string
More Usage Examples
Hadoop Partitioner Class
Hadoophas a library class, KeyFieldBasedPartitioner,p> that is useful for many applications. This class allows the Map/Reduceframework to partition the map outputs based on certain key fields, not thewhole keys. For example:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-D stream.map.output.field.separator=. \
-D stream.num.map.output.key.fields=4 \
-D map.output.key.field.separator=. \
-D mapred.text.key.partitioner.options=-k1,2 \
-D mapred.reduce.tasks=12 \
-input myInputDirs \
-output myOutputDir \
-mapper org.apache.hadoop.mapred.lib.IdentityMapper \
-reducer org.apache.hadoop.mapred.lib.IdentityReducer \
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
Here,-Dstream.map.output.field.separator=. and -D stream.num.map.output.key.fields=4 are asexplained in previous example. The two variables are used by streaming toidentify the key/value pair of mapper.
Themap output keys of the above Map/Reduce job normally have four fields separatedby ".". However, the Map/Reduce framework will partition the mapoutputs by the first two fields of the keys using the -Dmapred.text.key.partitioner.options=-k1,2 option. Here, -D map.output.key.field.separator=.specifies the separator for the partition. This guarantees that all thekey/value pairs with the same first two fields in the keys will be partitionedinto the same reducer.
Thisis effectively equivalent to specifying the first two fields as the primary keyand the next two fields as the secondary. The primary key is used forpartitioning, and the combination of the primary and secondary keys is used forsorting. A simple illustration is shown here:
Outputof map (the keys)
11.12.1.2
11.14.2.3
11.11.4.1
11.12.1.1
11.14.2.2
Partitioninto 3 reducers (the first 2 fields are used as keys for partition)
11.11.4.1
-----------
11.12.1.2
11.12.1.1
-----------
11.14.2.3
11.14.2.2
Sortingwithin each partition for the reducer(all 4 fields used for sorting)
11.11.4.1
-----------
11.12.1.1
11.12.1.2
-----------
11.14.2.2
11.14.2.3
Hadoop Comparator Class
Hadoophas a library class, KeyFieldBasedComparator,that is useful for many applications. This class provides a subset of featuresprovided by the Unix/GNU Sort. For example:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator \
-D stream.map.output.field.separator=. \
-D stream.num.map.output.key.fields=4 \
-D map.output.key.field.separator=. \
-D mapred.text.key.comparator.options=-k2,2nr \
-D mapred.reduce.tasks=12 \
-input myInputDirs \
-output myOutputDir \
-mapper org.apache.hadoop.mapred.lib.IdentityMapper \
-reducer org.apache.hadoop.mapred.lib.IdentityReducer
Themap output keys of the above Map/Reduce job normally have four fields separatedby ".". However, the Map/Reduce framework will sort the outputs bythe second field of the keys using the -Dmapred.text.key.comparator.options=-k2,2nr option. Here, -n specifies that the sortingis numerical sorting and -rspecifies that the result should be reversed. A simple illustration is shownbelow:
Outputof map (the keys)
11.12.1.2
11.14.2.3
11.11.4.1
11.12.1.1
11.14.2.2
Sortingoutput for the reducer(where second field used for sorting)
11.14.2.3
11.14.2.2
11.12.1.2
11.12.1.1
11.11.4.1
Hadoop Aggregate Package
Hadoophas a library package called Aggregate.Aggregate provides a special reducer class and a special combiner class, and alist of simple aggregators that perform aggregations such as "sum","max", "min" and so on over a sequence of values. Aggregateallows you to define a mapper plugin class that is expected to generate"aggregatable items" for each input key/value pair of the mappers.The combiner/reducer will aggregate those aggregatable items by invoking theappropriate aggregators.
Touse Aggregate, simply specify "-reducer aggregate":
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-D mapred.reduce.tasks=12 \
-input myInputDirs \
-output myOutputDir \
-mapper myAggregatorForKeyCount.py \
-reducer aggregate \
-file myAggregatorForKeyCount.py \
Thepython program myAggregatorForKeyCount.py looks like:
#!/usr/bin/python
import sys;
def generateLongCountToken(id):
return "LongValueSum:" + id + "\t" + "1"
def main(argv):
line = sys.stdin.readline();
try:
while line:
line = line[:-1];
fields = line.split("\t");
print generateLongCountToken(fields[0]);
line = sys.stdin.readline();
except "end of file":
return None
if __name__ == "__main__":
main(sys.argv)
Hadoop Field Selection Class
Hadoophas a library class, org.apache.hadoop.mapred.lib.FieldSelectionMapReduce, thateffectively allows you to process text data like the unix "cut"utility. The map function defined in the class treats each input key/value pairas a list of fields. You can specify the field separator (the default is thetab character). You can select an arbitrary list of fields as the map outputkey, and an arbitrary list of fields as the map output value. Similarly, thereduce function defined in the class treats each input key/value pair as a listof fields. You can select an arbitrary list of fields as the reduce output key,and an arbitrary list of fields as the reduce output value. For example:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-D map.output.key.field.separa=. \
-D mapred.text.key.partitioner.options=-k1,2 \
-D mapred.data.field.separator=. \
-D map.output.key.value.fields.spec=6,5,1-3:0- \
-D reduce.output.key.value.fields.spec=0-2:5- \
-D mapred.reduce.tasks=12 \
-input myInputDirs \
-output myOutputDir \
-mapper org.apache.hadoop.mapred.lib.FieldSelectionMapReduce \
-reducer org.apache.hadoop.mapred.lib.FieldSelectionMapReduce \
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
Theoption "-D map.output.key.value.fields.spec=6,5,1-3:0-" specifieskey/value selection for the map outputs. Key selection spec and value selectionspec are separated by ":". In this case, the map output key willconsist of fields 6, 5, 1, 2, and 3. The map output value will consist of allfields (0- means field 0 and all the subsequent fields).
Theoption "-D reduce.output.key.value.fields.spec=0-2:5-" specifieskey/value selection for the reduce outputs. In this case, the reduce output keywill consist of fields 0, 1, 2 (corresponding to the original fields 6, 5, 1).The reduce output value will consist of all fields starting from field 5(corresponding to all the original fields).
Frequently Asked Questions
How do I use Hadoop Streaming to runan arbitrary set of (semi) independent tasks?
Oftenyou do not need the full power of Map Reduce, but only need to run multipleinstances of the same program - either on different parts of the data, or onthe same data, but with different parameters. You can use Hadoop Streaming todo this.
How do I process files, one per map?
Asan example, consider the problem of zipping (compressing) a set of files acrossthe hadoop cluster. You can achieve this using either of these methods:
1. Hadoop Streaming and custom mapper script:
o Generate afile containing the full HDFS path of the input files. Each map task would getone file name as input.
o Create amapper script which, given a filename, will get the file to local disk, gzipthe file and put it back in the desired output directory
2. The existing Hadoop Framework:
o Add thesecommands to your main function:
o FileOutputFormat.setCompressOutput(conf, true);
o FileOutputFormat.setOutputCompressorClass(conf, org.apache.hadoop.io.compress.GzipCodec.class);
o conf.setOutputFormat(NonSplitableTextInputFormat.class);
o conf.setNumReduceTasks(0);
o Write yourmap function:
o public void map(WritableComparable key,
o Writable value,
o OutputCollector output,
o Reporter reporter)
o throws IOException {
o output.collect((Text)value, null);
o }
o Note that theoutput filename will not be the same as the original filename
How many reducers should I use?
Seethe Hadoop Wiki for details: Reducer
If I set up an alias in my shellscript, will that work after -mapper?
Forexample, say I do: alias c1='cut -f1'. Will -mapper "c1" work?
Usingan alias will not work, but variable substitution is allowed as shown in thisexample:
$ hadoop dfs -cat samples/student_marks
alice 50
bruce 70
charlie 80
dan 75
$ c2='cut -f2'; $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-D mapred.job.name='Experiment'
-input /user/me/samples/student_marks
-output /user/me/samples/student_out
-mapper \"$c2\" -reducer 'cat'
$ hadoop dfs -ls samples/student_out
Found 1 items/user/me/samples/student_out/part-00000 <r 3> 16
$ hadoop dfs -cat samples/student_out/part-00000
50
70
75
80
Can I use UNIX pipes?
Forexample, will -mapper "cut -f1 | sed s/foo/bar/g" work?
Currentlythis does not work and gives an "java.io.IOException: Broken pipe"error. This is probably a bug that needs to be investigated.
What do I do if I get the "Nospace left on device" error?
Forexample, when I run a streaming job by distributing large executables (forexample, 3.6G) through the -file option, I get a "No space left on device"error.
Thejar packaging happens in a directory pointed to by the configuration variablestream.tmpdir. The default value of stream.tmpdir is /tmp. Set the value to adirectory with more space:
-D stream.tmpdir=/export/bigspace/...
How do I specify multiple inputdirectories?
Youcan specify multiple input directories with multiple '-input' options:
hadoop jar hadoop-streaming.jar -input '/user/foo/dir1' -input '/user/foo/dir2'
How do I generate output files withgzip format?
Insteadof plain text files, you can generate gzip files as your generated output. Pass'-D mapred.output.compress=true -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec' asoption to your streaming job.
How do I provide my own input/outputformat with streaming?
Atleast as late as version 0.14, Hadoop does not support multiple jar files. So,when specifying your own custom classes you will have to pack them along withthe streaming jar and use the custom jar instead of the default hadoopstreaming jar.
How do I parse XML documents usingstreaming?
Youcan use the record reader StreamXmlRecordReader to process XML documents.
hadoop jar hadoop-streaming.jar -inputreader "StreamXmlRecord,begin=BEGIN_STRING,end=END_STRING" ..... (rest of the command)
Anythingfound between BEGIN_STRING and END_STRING would be treated as one record formap tasks.
How do I update counters in streamingapplications?
Astreaming process can use the stderr to emit counter information. reporter:counter:<group>,<counter>,<amount> should besent to stderr to update the counter.
How do I update status in streamingapplications?
Astreaming process can use the stderr to emit status information. To set astatus, reporter:status:<message> should besent to stderr.
How do Iget the JobConf variables in a streaming job's mapper/reducer?
Duringthe execution of a streaming job, the names of the "mapred"parameters are transformed. The dots ( . ) become underscores ( _ ). Forexample, mapred.job.id becomes mapred_job_id and mapred.jar becomes mapred_jar.In your code, use the parameter names with the underscores.
How do I get the JobConf variables ina streaming job's mapper/reducer?
Duringthe execution of a streaming job, the names of the "mapred"parameters are transformed. The dots ( . ) become underscores ( _ ). Forexample, mapred.job.id becomes mapred_job_id and mapred.jar becomes mapred_jar.In your code, use the parameter names with the underscores.
Hadoop Streaming Command Details and Q&A的更多相关文章
- Hadoop Streaming
原文地址:http://hadoop.apache.org/docs/r1.0.4/cn/streaming.html Hadoop Streaming Streaming工作原理 将文件打包到提交的 ...
- Hadoop Streaming:aggregate
[Hadoop Streaming:aggregate] 1.实例1 测试文件test.txt mapper程序: 运行: $hadoop streaming -input /app/test.txt ...
- hadoop streaming anaconda python 计算平均值
原始Liunx 的python版本不带numpy ,安装了anaconda 之后,使用hadoop streaming 时无法调用anaconda python , 后来发现是参数没设置好... 进 ...
- Ubuntu15.10下Hadoop2.6.0伪分布式环境安装配置及Hadoop Streaming的体验
Ubuntu用的是Ubuntu15.10Beta2版本,正式的版本好像要到这个月的22号才发布.参考的资料主要是http://www.powerxing.com/install-hadoop-clus ...
- hadoop streaming 多路输出 [转载]
转载 http://www.cnblogs.com/shapherd/archive/2012/12/21/2827860.html hadoop 支持reduce多路输出的功能,一个reduce可以 ...
- Hadoop Streaming框架使用(一)
Streaming简介 link:http://www.cnblogs.com/luchen927/archive/2012/01/16/2323448.html Streaming框架允许任何程 ...
- Hadoop Streaming例子(python)
以前总是用java写一些MapReduce程序现举一个例子使用Python通过Hadoop Streaming来实现Mapreduce. 任务描述: HDFS上有两个目录/a和/b,里面数据均有3列, ...
- hadoop streaming 编程
概况 Hadoop Streaming 是一个工具, 代替编写Java的实现类,而利用可执行程序来完成map-reduce过程.一个最简单的程序 $HADOOP_HOME/bin/hadoop jar ...
- Hadoop Streaming 得到mapreduce_map_input_file中遇到的问题的版本号
1.Hadoop Streaming,您可以在任务获得hadoop设置环境变量, 例如,使用awk书面map从而能获得:filename = ENVIRON["mapreduce_map_i ...
随机推荐
- Android Notification通知详细解释
Android Notification通知具体解释 Notification: (一).简单介绍: 显示在手机状态栏的通知. Notification所代表的是一种具有全局效果的通 ...
- asp.net mvc上传头像加剪裁功能
原文:asp.net mvc上传头像加剪裁功能 正好项目用到上传+剪裁功能,发上来便于以后使用. 我不能告诉你们其实是从博客园扒的前台代码,哈哈. 前端是jquery+fineuploader+jqu ...
- nodejs开发aspnet5项目
结合nodejs开发aspnet5项目 1.安装kvm 官方教程地址:https://github.com/ligershark/Kulture 打开 powershell命令窗口,找不到可以在开 ...
- MonkeyRunner源码分析之-谁动了我的截图?
本文章的目的是通过分析monkeyrunner是如何实现截屏来作为一个例子尝试投石问路为下一篇文章做准备,往下一篇文章本人有意分析下monkeyrunner究竟是如何和目标测试机器通信的,所以最好的办 ...
- 基于科大讯飞语音云windows平台开发
前记: 前段时间公司没事干,突发奇想想做一个语音识别系统,看起来应该非常easy的,但做起来却是各种问题,这个对电气毕业的我,却是挺为难的.谷姐已经离我们而去,感谢度娘,感谢CSDN各位大神,好歹也做 ...
- SQL点滴14—编辑数据
原文:SQL点滴14-编辑数据 数据库中的数据编辑是我们遇到的最频繁的工作,这一个随笔中我来总结一下最常用的数据编辑. select into 经常遇到一种情况是,我们希望创建一个新表,表中的数据来源 ...
- 快速构建Windows 8风格应用22-MessageDialog
原文:快速构建Windows 8风格应用22-MessageDialog 本篇博文主要介绍MessageDialog概述.MessageDialog常用属性和方法.如何构建MessageDialog ...
- 亮点面试题&&实现Singleton(辛格尔顿)模式-JAVA版本
称号:设计一个类.我们只能产生这个类的一个实例.(来自<剑指Offer>) 解析:仅仅能生产一个实例的类是实现Singleton(单例)模式的类型.因为设计模式在面向对象程序设计中起着举足 ...
- Android单元测试Junit (一)
1.在eclips中建立一个Android工程,具体信息如下: 2.配置单元测试环境,打开AndroidManifest.xml,具体代码如下所示: <?xml version="1. ...
- Xamarin移动跨平台解决方案是如何工作
Xamarin移动跨平台解决方案是如何工作的? 概述 上一篇 C#移动跨平台开发(1)环境准备发布之后不久,无独有偶,微软宣布了开放.NET框架源代码并且会为Windows.Mac和Linux开发一个 ...