hadoop2.2编程:MRUnit
examples:
Overview
This document explains how to write unit tests for your map reduce code, and testing your mapper and reducer logic on your desktop without having any Hadoop environment setup.
Let's look at some code
For testing your map and reduce logic, we will need 4 blocks of code: Mapper code, Reducer code, Driver code, and finally the Unit Testing code.
Sample Mapper
In our sample Mapper code, we are simply counting the frequency of words and emitting <word, 1=""> for each word found.
package com.kodkast.analytics;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.JobConf;
import org.apache.log4j.Logger;
import java.lang.Runtime;
import java.io.*;
public class UnitTestDemoMapper extends MapReduceBase implements Mapper<Object, Text, Text, Text> {
public static final Logger Log = Logger.getLogger(UnitTestDemoMapper.class.getName());
private final static Text one = new Text("1");
public void configure(JobConf conf) {
// mapper initialization code, if needed
}
public void map(Object key, Text value, OutputCollector<Text, Text> collector, Reporter rep) throws IOException {
try {
String input = value.toString();
String[] words = processInput(input);
for(int i = 0; i < words.length; i++) {
Text textInput = new Text(words[i]);
collector.collect(textInput, one);
}
} catch(IOException e) {
e.printStackTrace();
}
}
private String[] processInput(String input) {
String words[] = input.split(" ");
return words;
}
}
Sample Reducer
In our sample Reducer code, we are simply adding all the word counts and emitting the final result as <word, totalfrequency=""> for each word.
package com.kodkast.analytics;
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
public class UnitTestDemoReducer extends MapReduceBase implements Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
int count = 0;
while (values.hasNext()) {
String value = values.next().toString();
count += Integer.parseInt(value);
}
String countStr = "" + count;
output.collect(key, new Text(countStr));
}
}
Sample Driver
Simple invocation of Mapper and Reducer code.
package com.kodkast.analytics;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class UnitTestDemo {
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(UnitTestDemo.class);
conf.setJobName("unit-test-demo");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Text.class);
conf.setMapperClass(UnitTestDemoMapper.class);
conf.setReducerClass(UnitTestDemoReducer.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
}
Unit Testing Class
Now, this is the new class which we are adding to test our mapper and reducer logic using mrunit framework built on top of junit.
package com.kodkast.analytics;
import java.util.ArrayList;
import java.util.List;
import java.io.*;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mrunit.MapDriver;
import org.apache.hadoop.mrunit.MapReduceDriver;
import org.apache.hadoop.mrunit.ReduceDriver;
import org.apache.hadoop.mapred.JobConf;
import org.junit.Before;
import org.junit.Test;
public class UnitTestDemoTest {
MapDriver<Object, Text, Text, Text> mapDriver;
ReduceDriver<Text, Text, Text, Text> reduceDriver;
@Before
public void setUp() {
// create mapper and reducer objects
UnitTestDemoMapper mapper = new UnitTestDemoMapper();
UnitTestDemoReducer reducer = new UnitTestDemoReducer();
// call mapper initialization code
mapper.configure(new JobConf());
// create mapdriver and reducedriver objects for unit testing
mapDriver = new MapDriver<Object, Text, Text, Text>();
mapDriver.setMapper(mapper);
reduceDriver = new ReduceDriver<Text, Text, Text, Text>();
reduceDriver.setReducer(reducer);
}
@Test
public void testMapper() {
// prepare mapper input
String input = "Hadoop is nice and Java is also very nice";
// test mapper logic
mapDriver.withInput(new LongWritable(1), new Text(input));
mapDriver.withOutput(new Text("Hadoop"), new Text("1"));
mapDriver.withOutput(new Text("is"), new Text("1"));
mapDriver.withOutput(new Text("nice"), new Text("1"));
mapDriver.withOutput(new Text("and"), new Text("1"));
mapDriver.withOutput(new Text("Java"), new Text("1"));
mapDriver.withOutput(new Text("is"), new Text("1"));
mapDriver.withOutput(new Text("also"), new Text("1"));
mapDriver.withOutput(new Text("very"), new Text("1"));
mapDriver.withOutput(new Text("nice"), new Text("1"));
mapDriver.runTest();
}
@Test
public void testReducer() {
// prepare mapper output values
List<Text> values = new ArrayList<Text>();
String mapperValues[] = "1,1".split(",");
for (int i = 0; i <= mapperValues.length - 1; i++) {
values.add(new Text(mapperValues[i]));
}
// test reducer logic
reduceDriver.withInput(new Text("nice"), values);
reduceDriver.withOutput(new Text("nice"), new Text("2"));
reduceDriver.runTest();
}
}
- Add Unit tests for testing the Map Reduce logic
The use of this framework is quite straightforward, especially in our business case. So I will just show the unit test code and some comments if necessary but I think it is quite obvious how to use it.
The unit test for the Mapper ‘MapperTest’:
package net.pascalalma.hadoop;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mrunit.mapreduce.MapDriver;
import org.junit.Before;
import org.junit.Test;
import java.io.IOException;
/**
* Created with IntelliJ IDEA.
* User: pascal
*/
public class MapperTest {
MapDriver<Text, Text, Text, Text> mapDriver;
@Before
public void setUp() {
WordMapper mapper = new WordMapper();
mapDriver = MapDriver.newMapDriver(mapper);
}
@Test
public void testMapper() throws IOException {
mapDriver.withInput(new Text("a"), new Text("ein"));
mapDriver.withInput(new Text("a"), new Text("zwei"));
mapDriver.withInput(new Text("c"), new Text("drei"));
mapDriver.withOutput(new Text("a"), new Text("ein"));
mapDriver.withOutput(new Text("a"), new Text("zwei"));
mapDriver.withOutput(new Text("c"), new Text("drei"));
mapDriver.runTest();
}
}
This test class is actually even simpler than the Mapper implementation itself. You just define the input of the mapper and the expected output and then let the configured MapDriver run the test. In our case the Mapper doesn’t do anything specific but you see how easy it is to setup a testcase.
For completeness here is the test class of the Reducer:
package net.pascalalma.hadoop;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mrunit.mapreduce.ReduceDriver;
import org.junit.Before;
import org.junit.Test;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
/**
* Created with IntelliJ IDEA.
* User: pascal
*/
public class ReducerTest {
ReduceDriver<Text, Text, Text, Text> reduceDriver;
@Before
public void setUp() {
AllTranslationsReducer reducer = new AllTranslationsReducer();
reduceDriver = ReduceDriver.newReduceDriver(reducer);
}
@Test
public void testReducer() throws IOException {
List<Text> values = new ArrayList<Text>();
values.add(new Text("ein"));
values.add(new Text("zwei"));
reduceDriver.withInput(new Text("a"), values);
reduceDriver.withOutput(new Text("a"), new Text("|ein|zwei"));
reduceDriver.runTest();
}
}
Debugging MapReduce Programs With MRUnit
The distributed nature of MapReduce programs makes debugging a challenge. Attaching a debugger to a remote process is cumbersome, and the lack of a single console makes it difficult to inspect what is occurring when several distributed copies of a mapper or reducer are running concurrently. Furthermore, operations that work on small amounts of input (e.g., saving the inputs to a reducer in an array) fail when running at scale, causing out-of-memory exceptions or other unintended effects.
A full discussion of how to debug MapReduce programs is beyond the scope of a single blog post, but I’d like to introduce you to a tool we designed at Cloudera to assist you with MapReduce debugging: MRUnit.
MRUnit helps bridge the gap between MapReduce programs and JUnit by providing a set of interfaces and test harnesses, which allow MapReduce programs to be more easily tested using standard tools and practices.
While this doesn’t solve the problem of distributed debugging, many common bugs in MapReduce programs can be caught and debugged locally. For this purpose, developers often try to use JUnit to test their MapReduce programs. The current state of the art often involves writing a set of tests that each create a JobConf object, which is configured to use a mapper and reducer, and then set to use the LocalJobRunner (via JobConf.set(”mapred.job.tracker”, “local”)). A MapReduce job will then run in a single thread, reading its input from test files stored on the local filesystem and writing its output to another local directory.
This process provides a solid mechanism for end-to-end testing, but has several drawbacks. Developing new tests requires adding test inputs to files that are stored alongside one’s program. Validating correct output also requires filesystem access and parsing of the emitted data files. This involves writing a great deal of test harness code, which itself may contain subtle bugs. Finally, this process is slow. Each test requires several seconds to run. Users often find themselves aggregating several unrelated inputs into a single test (violating a unit testing principle of isolating unrelated tests) or performing less exhaustive testing due to the high barriers to test authorship.
The easiest way to test MapReduce programs is to include as little Hadoop-specific code as possible in one’s application. Parsers can operate on instances of String instead of Text, and mappers should instantiate instances of MySpecificParser to tokenize input data rather than embed parsing code in the body of MyMapper.map(). Your MySpecificParser implementation can then be tested with ordinary JUnit tests. Another class or method could then be used to perform processing on parsed lines.
But even with those components separately tested, your map() and reduce() calls should still be tested individually, as the composition of separate classes may cause unintended bugs to surface. MRUnit provides test drivers that accept programmatically specified inputs and outputs, which validate the correct behavior of mappers and reducers in isolation, as well as when composed in a MapReduce job. For instance, the following code checks whether the IdentityMapper emits the same (key, value) pair as output that it receives as input:
import junit.framework.TestCase;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.lib.IdentityMapper;
import org.junit.Before;
import org.junit.Test;
public class TestExample extends TestCase {
private Mapper mapper;
private MapDriver driver;
@Before
public void setUp() {
mapper = new IdentityMapper();
driver = new MapDriver(mapper);
}
@Test
public void testIdentityMapper() {
driver.withInput(new Text("foo"), new Text("bar"))
.withOutput(new Text("foo"), new Text("bar"))
.runTest();
}
}
The MapDriver orchestrates the test process, feeding the input (“foo” and “bar”) record to the IdentityMapper when its runTest() method is called. It also passes a mock OutputCollector implementation to the mapper. The driver then validates the output received by the OutputCollector against the expected output (”foo” and “bar”) record. If the actual and expected outputs mismatch, a JUnit assertion failure is raised, informing the developer of the error. More test drivers exist for testing individual reducers, as well as mapper/reducer compositions.
End-to-end tests involving JobConf configuration code, InputFormat and OutputFormat implementations, filesystem access, and larger scale testing are still necessary. But many errors can be quickly identified with small tests involving a single, well-chosen input record, and a suite of regression tests allows correct behavior to be assured in the face of ongoing changes to your data processing pipeline. We hope MRUnit helps your organization test code, find bugs, and improve its use of Hadoop by facilitating faster and more thorough test cycles.
MRUnit is open source and is included in Cloudera’s Distribution for Hadoop. For more information about MRUnit, including where to get it and how to use its API, see the MRUnit documentation page.
How to run MRUnit with Command line?
注意: 需要下载MRUnit并编译,之后修改$HADOOP_HOME/libexec/hadoop-config.sh,将$MRUnit_HOME/lib/*.jar添加进去, 之后source $HADOOP_HOME/libexec/hadoop-config.sh,再执行下面操作:
javac -d class/ MaxTemperatureMapper.java MaxTemperatureMapperTest.java jar -cvf test.jar -C class ./ java -cp test.jar:$CLASSPATH org.junit.runner.JUnitCore MaxTemperatureMapperTest # or yarn -cp test.jar:$CLASSPATH org.junit.runner.JUnitCore MaxTemperatureMapperTest
hadoop2.2编程:MRUnit的更多相关文章
- hadoop2.2编程:MRUnit测试
引用地址:http://www.cnblogs.com/lucius/p/3442381.html examples: Overview This document explains how to w ...
- hadoop2.2编程:MRUnit——Test MaxTemperatureMapper
继承关系1 1. java.lang.Object |__ org.apache.hadoop.mapreduce.JobContext |__org.apache.hadoop.mapreduce. ...
- hadoop2.2编程:各种API
hadoop2.2 API http://hadoop.apache.org/docs/r0.23.9/api/index.html junit API http://junit.org/javado ...
- hadoop2.2编程:使用MapReduce编程实例(转)
原文链接:http://www.cnblogs.com/xia520pi/archive/2012/06/04/2534533.html 从网上搜到的一篇hadoop的编程实例,对于初学者真是帮助太大 ...
- hadoop2.2编程:DFS API 操作
1. Reading data from a hadoop URL 说明:想要让java从hadoop的dfs里读取数据,则java 必须能够识别hadoop hdfs URL schema, 因此我 ...
- hadoop2.2编程: 重写comparactor
要点: 类型比较在hadoop的mapreduce中非常重要,主要用来比较keys; hadoop中的RawComparator<T>接口继承自java的comparator, 主要用来比 ...
- hadoop2.2编程: SequenceFileWritDemo
import java.io.IOException; import java.net.URI; import org.apache.hadoop.fs.FileSystem; import org. ...
- hadoop2.2编程:从default mapreduce program 来理解mapreduce
下面写一个default mapreduce 的程序: import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapr ...
- Hadoop2.2编程:新旧API的区别
Hadoop最新版本的MapReduce Release 0.20.0的API包括了一个全新的Mapreduce JAVA API,有时候也称为上下文对象. 新的API类型上不兼容以前的API,所以, ...
随机推荐
- JavaScript中事件绑定的方法总结
最近收集了一些关于JavaScript绑定事件的方法,汇总了一下,不全面,但是,希望便于以后自己查看. JavaScript中绑定事件的方法主要有三种: 1 在DOM元素中直接绑定 2 JavaScr ...
- HDOJ 1042 N! -- 大数运算
题目地址:http://acm.hdu.edu.cn/showproblem.php?pid=1042 Problem Description Given an integer N(0 ≤ N ≤ 1 ...
- iOS中XML的相关知识
1.什么是XML “当 XML(扩展标记语言)于 1998 年 2 月被引入软件工业界时,它给整个行业带来了一场风暴.有史以来第一次,这个世界拥有了一种用来结构化文档和数据的通用且适应性强的格式,它不 ...
- 初涉GitHub
安装 访问https://help.github.com/articles/set-up-git/,选择对应OS平台.有文档参考,我的是OpenSuse. 在console中下载安装http://ww ...
- JS中判断JSON数据是否存在某字段的方法 JavaScript中判断json中是否有某个字段
方式一 !("key" in obj) 方式二 obj.hasOwnProperty("key") //obj为json对象. 实例: var jsonwor ...
- 微信video标签全屏无法退出bug
安卓(android)微信里面video播放视频,会被强制全屏,播放完毕后还有腾讯推荐的视频,非常讨厌..强制被全屏无法解决,但是视频播放完毕后退出播放器可以解决.方法就是视频播放完毕后,用音频aud ...
- JS判断输入框值是否为空
<!DOCTYPE html> <html> <head lang="en"> <meta charset="UTF-8&quo ...
- CentOS 根据命令查所在的包
在工作中经常会遇到想用某个命令机器没装却又不知道命令在哪个包(源码编译不再本文范围内),下面介绍个比较笨的方法可以帮助我们搞定这个问题. 说明:蓝色=命令名称 浅绿=命令参数 ...
- SSH搭建完美CURD,含分页算法
今日开始研究使用java平台上的框架解决web服务端的开发. 这是一个完整的SSH实例,在马士兵老师的SSH整合代码基础上,增加用户的增删改查,同时实现structs方式的分页 放出源代码供大家学习参 ...
- c语言的一些基础知识
c语言作为经典语言,这里不再多说了.咱从基础一起探讨吧! 一. 定义一个整型,如果作为局部变量,没有初始化的情况下,它是一个随机的值的,一般情况下输出会是0,但这个0是作为垃圾值的;而如果作为全局变量 ...