Hadoop 学习笔记3 Develping MapReduce
小笔记:
Mavon是一种项目管理工具,通过xml配置来设置项目信息。
Mavon POM(project of model).
Steps:
1. set up and configure the development environment.
2. writing your map and reduce functions and run them in local (standalone) mode from the command line or within your IDE.
3. unit test --> test on small dataset --> test on the full dataset after unleash in a cluster
--> tuning
1. Configuration API
- Components in Hadoop are configured using Hadoop’s own configuration API.
- org.apache.hadoop.conf package
- Configurations read their properties from resources — XML files with a simple structure for defining name-value pairs.
For example, write a configuration-1.xml like:
<?xml version="1.0"?>
<configuration>
<property>
<name>color</name>
<value>yellow</value>
<description>Color</description>
</property>
<property>
<name>size</name>
<value>10</value>
<description>Size</description>
</property>
<property>
<name>weight</name>
<value>heavy</value>
<final>true</final>
<description>Weight</description>
</property>
<property>
<name>size-weight</name>
<value>${size},${weight}</value>
<description>Size and weight</description>
</property>
</configuration>
then access it by coding below:
Configuration conf = new Configuration();
conf.addResource("configuration-1.xml");
conf.addResource("configuration-2.xml"); // more than one resource are added orderly, and the latter will overwrite the former. assertThat(conf.get("color"), is("yellow"));
assertThat(conf.getInt("size", 0), is(10));
assertThat(conf.get("breadth", "wide"), is("wide"));
Note:
- type information is not stored in the XML file;
- instead, properties can be interpreted as a given type when they are read.
- Also, the get() methods allow you to specify a default value, which is used if the property is not defined in the XML file, as in the case of breadth here.
- more than one resource are added orderly, and the latter properties will overwrite the former.
- However, properties that are marked as final cannot be overridden in later definitions.
- system properties take priority:
System.setProperty("size", "14")
Options specified with -D take priority over properties from the configuration files.
This will override the number of reducers set on the cluster or set in any client-side configuration files.
% hadoop ConfigurationPrinter -D color=yellow | grep color
2. Set up dev enviroment
The Maven POMs (Project Object Model) are used to show the dependencies needed for building and testing MapReduce programs. Actually a xml file.
- hadoop-client dependency, which contains all the Hadoop client-side classes needed to interact with HDFS and MapReduce.
- For running unit tests, we use junit,
- for writing MapReduce tests, we use mrunit.
- The hadoop-minicluster library contains the “mini-” clusters that are useful for testing with Hadoop clusters running in a single JVM.
Many IDEs can read Maven POMs directly, so you can just point them at the directory containing the pom.xml file and start writing code.
Alternatively, you can use Maven to generate configuration files for your IDE. For example, the following creates Eclipse configuration files so you can import the project into Eclipse:
% mvn eclipse:eclipse -DdownloadSources=true -DdownloadJavadocs=true
3. Managing switching
It is common to switch between running the application locally and running it on a cluster.
- have Hadoop configuration files containing the connection settings for each cluster
- we assume the existence of a directory called conf that contains three configuration files: hadoop-local.xml, hadoop-localhost.xml, and hadoopcluster.xml
For example, the following command shows a directory listing on the HDFS serverrunning in pseudodistributed mode on localhost:
- conf
% hadoop fs -conf conf/hadoop-localhost.xml -ls Found 2 items
drwxr-xr-x - tom supergroup 0 2014-09-08 10:19 input
drwxr-xr-x - tom supergroup 0 2014-09-08 10:19 output
4. Starts MapReduce example:
Mapper: to get year and temperature from an input string
public class MaxTemperatureMapper
extends Mapper<LongWritable, Text, Text, IntWritable> { @Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
String year = line.substring(15, 19);
int airTemperature = Integer.parseInt(line.substring(87, 92)); context.write(new Text(year), new IntWritable(airTemperature));
}
}
Unit test for the Mapper:
import java.io.IOException;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mrunit.mapreduce.MapDriver;
import org.junit.*; public class MaxTemperatureMapperTest {
@Test
public void processesValidRecord() throws IOException, InterruptedException {
Text value = new Text("0043011990999991950051518004+68750+023550FM-12+0382" +
// Year ^^^^
"99999V0203201N00261220001CN9999999N9-00111+99999999999");
// Temperature ^^^^^ new MapDriver<LongWritable, Text, Text, IntWritable>()
.withMapper(new MaxTemperatureMapper())
.withInput(new LongWritable(0), value)
.withOutput(new Text("1950"), new IntWritable(-11))
.runTest();
}
}
Reducer: to get the maxmium
public class MaxTemperatureReducer
extends Reducer<Text, IntWritable, Text, IntWritable> { @Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException { int maxValue = Integer.MIN_VALUE; for (IntWritable value : values) {
maxValue = Math.max(maxValue, value.get());
} context.write(key, new IntWritable(maxValue));
}
}
Unit test for the Reducer:
@Test
public void returnsMaximumIntegerInValues() throws IOException, InterruptedException { new ReduceDriver<Text, IntWritable, Text, IntWritable>()
.withReducer(new MaxTemperatureReducer())
.withInput(new Text("1950"),
Arrays.asList(new IntWritable(10), new IntWritable(5)))
.withOutput(new Text("1950"), new IntWritable(10))
.runTest();
}
5 . a write job driver
Using the Tool interface , it’s easy to write a driver to run a MapReduce job.
Then run the driver locally.
% mvn compile
% export HADOOP_CLASSPATH=target/classes/
% hadoop v2.MaxTemperatureDriver -conf conf/hadoop-local.xml \
input/ncdc/micro output
或
% hadoop v2.MaxTemperatureDriver -fs file:/// -jt local input/ncdc/micro output
The local job runner uses a single JVM to run a job, so as long as all the classes that your job needs are on its classpath, then things will just work.
6. Running on a cluster
a job’s classes must be packaged into a job JAR file to send to the cluster
Hadoop 学习笔记3 Develping MapReduce的更多相关文章
- Hadoop学习笔记—4.初识MapReduce
一.神马是高大上的MapReduce MapReduce是Google的一项重要技术,它首先是一个编程模型,用以进行大数据量的计算.对于大数据量的计算,通常采用的处理手法就是并行计算.但对许多开发者来 ...
- Hadoop学习笔记(2) 关于MapReduce
1. 查找历年最高的温度. MapReduce任务过程被分为两个处理阶段:map阶段和reduce阶段.每个阶段都以键/值对作为输入和输出,并由程序员选择它们的类型.程序员还需具体定义两个函数:map ...
- Hadoop学习笔记—22.Hadoop2.x环境搭建与配置
自从2015年花了2个多月时间把Hadoop1.x的学习教程学习了一遍,对Hadoop这个神奇的小象有了一个初步的了解,还对每次学习的内容进行了总结,也形成了我的一个博文系列<Hadoop学习笔 ...
- Hadoop学习笔记(7) ——高级编程
Hadoop学习笔记(7) ——高级编程 从前面的学习中,我们了解到了MapReduce整个过程需要经过以下几个步骤: 1.输入(input):将输入数据分成一个个split,并将split进一步拆成 ...
- Hadoop学习笔记(6) ——重新认识Hadoop
Hadoop学习笔记(6) ——重新认识Hadoop 之前,我们把hadoop从下载包部署到编写了helloworld,看到了结果.现是得开始稍微更深入地了解hadoop了. Hadoop包含了两大功 ...
- Hadoop学习笔记(2)
Hadoop学习笔记(2) ——解读Hello World 上一章中,我们把hadoop下载.安装.运行起来,最后还执行了一个Hello world程序,看到了结果.现在我们就来解读一下这个Hello ...
- Hadoop学习笔记(5) ——编写HelloWorld(2)
Hadoop学习笔记(5) ——编写HelloWorld(2) 前面我们写了一个Hadoop程序,并让它跑起来了.但想想不对啊,Hadoop不是有两块功能么,DFS和MapReduce.没错,上一节我 ...
- Hadoop学习笔记(2) ——解读Hello World
Hadoop学习笔记(2) ——解读Hello World 上一章中,我们把hadoop下载.安装.运行起来,最后还执行了一个Hello world程序,看到了结果.现在我们就来解读一下这个Hello ...
- Hadoop学习笔记(1) ——菜鸟入门
Hadoop学习笔记(1) ——菜鸟入门 Hadoop是什么?先问一下百度吧: [百度百科]一个分布式系统基础架构,由Apache基金会所开发.用户可以在不了解分布式底层细节的情况下,开发分布式程序. ...
随机推荐
- Maven私服Nexus3.x环境构建操作记录
Maven介绍Apache Maven是一个创新的软件项目管理和综合工具.Maven提供了一个基于项目对象模型(POM)文件的新概念来管理项目的构建,可以从一个中心资料片管理项目构建,报告和文件.Ma ...
- 您的项目引用了最新实体框架;但是,找不到数据链接所需的与版本兼容的实体框架数据库 EF6使用Mysql的技巧
转载至: http://www.cnblogs.com/Imaigne/p/4153397.html 您的项目引用了最新实体框架:但是,找不到数据链接所需的与版本兼容的实体框架数据库 EF6使用Mys ...
- [py]简易pick lucky num程序
程序功能: 1,用户输入数字,当用户输入指定数字时候,输出他输入的循环那次 2,第二次询问是否还要输 3,如果no 则 终止 4,如果yes则继续输入 判断输入是否大于首次输入的 如果大于则开始循环输 ...
- Palindrome Linked List
Given a singly linked list, determine if it is a palindrome. Follow up:Could you do it in O(n) time ...
- Android的媒体管理框架:Glide
Glide是一个高效.开源. Android设备上的媒体管理框架,它遵循BSD.MIT以及Apache 2.0协议发布.Glide具有获取.解码和展示视频剧照.图片.动画等功能,它还有灵活的API,这 ...
- [C]基本数据类型:整型(int)用法详解
1.整型int C语言提供了很多整数类型(整型),这些整型的区别在于它们的取值范围的大小,以及是否可以为负.int是整型之一,一般被称为整型.以后,在不产生歧义的情况下,我们把整数类型和int都称为整 ...
- 轻量级iOS蓝牙库:LGBluetooth(项目中用过的)
LGBluetooth 是个简单的,基于块的,轻量级 CoreBluetooth 库: iOS 6引入了Core Bluetooth,让低功耗蓝牙设备之间的通信变得简单.但如果CoreBluetoot ...
- 海王星给你好看!FineUI v4.0公测版发布暨《你找BUG我送书》活动开始(活动已结束!)
<FineUI v4.0 你找BUG我送书>活动已结束,恭喜如下三位网友获得由 FineUI 作者亲自翻译的图书<jQuery实战 第二版>! 奋斗~ 吉吉﹑ purplebo ...
- opencv4-highgui之视频的输入和输出以及滚动条
这是<opencv2.4.9tutorial.pdf>的highgui的三个例子.通过简短的介绍来实现不同函数的理解,省去一些不需要说的东西. 一.增加滑动条 这是opencv中为数不多的 ...
- NLPIR分词工具的使用(java环境下)
一.NLPIR是什么? NLPIR(汉语分词系统)由中科大张华平博士团队开发,主要功能包括:中文分词,词性标注,命名实体识别,用户词典功能,详情见官网:http://ictclas.nlpir.org ...