hadoop1中partition和combiner作用

---恢复内容开始---

1、解析Partiton

　　把map任务的输出的中间结果按照key的范围进行划分成r份，r代表reduce任务的个数。hadoop默认有个类HashPartition实现分区，通过key对reduce的个数取模(key%r),这样可以保证一段范围内的key交由一个reduce处理。以此来实现reduce的负载均衡。不至于使有些reduce处理的任务压力过大，有些reduce空闲。

　　如果我们对hadoop本身的分区算法不满意，或者我们因为我们的业务需求，我们可以自定义一个类实现Partition接口，实现里面的方法，在getPartiton()方法中实现自己的分区算法。在提交作业的main方法中通setPartitonclass()方法这个类，就可以了。

　以下为代码实例

package org.apache.hadoop.examples;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
/**
* 输入文本，以tab间隔
* kaka 1 28
* hua 0 26
* chao 1
* tao 1 22
* mao 0 29 22
* */
//Partitioner函数的使用
public class MyPartitioner {
// Map函数
public static class MyMap extends MapReduceBase implements
Mapper<LongWritable, Text, Text, Text> {
public void map(LongWritable key, Text value,
OutputCollector<Text, Text> output, Reporter reporter)
throws IOException {
String[] arr_value = value.toString().split("\t");
//测试输出
// for(int i=0;i<arr_value.length;i++)
// {
// System.out.print(arr_value[i]+"\t");
// }
// System.out.print(arr_value.length);
// System.out.println();
Text word1 = new Text();
Text word2 = new Text();
if (arr_value.length > 3) {
word1.set("long");
word2.set(value);
} else if (arr_value.length < 3) {
word1.set("short");
word2.set(value);
} else {
word1.set("right");
word2.set(value);
}
output.collect(word1, word2);
}
}
public static class MyReduce extends MapReduceBase implements
Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterator<Text> values,
OutputCollector<Text, Text> output, Reporter reporter)
throws IOException {
int sum = 0;
System.out.println(key);
while (values.hasNext()) {
output.collect(key, new Text(values.next().getBytes()));
}
}
}
// 接口Partitioner继承JobConfigurable，所以这里有两个override方法
public static class MyPartitionerPar implements Partitioner<Text, Text> {
/**
* getPartition()方法的
* 输入参数：键/值对<key,value>与reducer数量numPartitions
* 输出参数：分配的Reducer编号，这里是result
* */
@Override
public int getPartition(Text key, Text value, int numPartitions) {
// TODO Auto-generated method stub
int result = 0;
System.out.println("numPartitions--" + numPartitions);
if (key.toString().equals("long")) {
result = 0 % numPartitions;
} else if (key.toString().equals("short")) {
result = 1 % numPartitions;
} else if (key.toString().equals("right")) {
result = 2 % numPartitions;
}
System.out.println("result--" + result);
return result;
}
@Override
public void configure(JobConf arg0)
{
// TODO Auto-generated method stub
}
}
//输入参数：/home/hadoop/input/PartitionerExample /home/hadoop/output/Partitioner
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(MyPartitioner.class);
conf.setJobName("MyPartitioner");
//控制reducer数量，因为要分3个区，所以这里设定了3个reducer
conf.setNumReduceTasks(3);
conf.setMapOutputKeyClass(Text.class);
conf.setMapOutputValueClass(Text.class);
//设定分区类
conf.setPartitionerClass(MyPartitionerPar.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Text.class);
//设定mapper和reducer类
conf.setMapperClass(MyMap.class);
conf.setReducerClass(MyReduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
}

2、解析Combiner

　　在Partiton之前，我们还可以对中间结果进行Combiner,即将中间结果中有着相同key 的（key，value）键值对进行合并成一对，Combiner的过程与reduce的过程类似，很多情况下可以直接使用reduce，但是Combiner作为Map任务的一部分，在Map输出后紧接着执行，通过Combiner的执行，减少了中间结果中的（key，value）对数目，reduce在从map复制数据时将会大大减少网络流量，每个reduce需要和原许多个map任务节点通信以此来取得落到它负责key区间内的中间结果，然后执行reduce函数，得到一个最中结果文件。有R个reduce任务，就有R个最终结果，这R个最终结果并不需要合并成一个结果，因为这R个最终结果又可以作为另一次计算的输入，开始另一次计算。

　　combiner使用总结：

　　combiner的使用可以在满足业务需求的情况下，大大提高job的运行速度，如果不合适，则将到最后导致结果不一致(如：求平均值)。

　　以下为Combiner代码示例

package com;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class AveragingWithCombiner extends Configured implements Tool {
public static class MapClass extends Mapper<LongWritable,Text,Text,Text> {
static enum ClaimsCounters { MISSING, QUOTED };
// Map Method
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String fields[] = value.toString().split(",", -20);
String country = fields[4];
String numClaims = fields[8];
if (numClaims.length() > 0 && !numClaims.startsWith("\"")) {
context.write(new Text(country), new Text(numClaims + ",1"));
}
}
}
public static class Reduce extends Reducer<Text,Text,Text,DoubleWritable> {
// Reduce Method
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
double sum = 0;
int count = 0;
for (Text value : values) {
String fields[] = value.toString().split(",");
sum += Double.parseDouble(fields[0]);
count += Integer.parseInt(fields[1]);
}
context.write(key, new DoubleWritable(sum/count));
}
}
public static class Combine extends Reducer<Text,Text,Text,Text> {
// Reduce Method
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
double sum = 0;
int count = 0;
for (Text value : values) {
String fields[] = value.toString().split(",");
sum += Double.parseDouble(fields[0]);
count += Integer.parseInt(fields[1]);
}
context.write(key, new Text(sum+","+count));
}
}
// run Method
public int run(String[] args) throws Exception {
// Create and Run the Job
Job job = new Job();
job.setJarByClass(AveragingWithCombiner.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setJobName("AveragingWithCombiner");
job.setMapperClass(MapClass.class);
job.setCombinerClass(Combine.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
return 0;
}
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new AveragingWithCombiner(), args);
System.exit(res);
}
}

---恢复内容结束---

hadoop1中partition和combiner作用的更多相关文章

map/reduce之间的shuffle，partition，combiner过程的详解
Shuffle的本意是洗牌.混乱的意思,类似于java中的Collections.shuffle(List)方法,它会随机地打乱参数list里的元素顺序.MapReduce中的Shuffle过程.所谓 ...
24、redis中的sentinel的作用?
redis中的sentinel的作用? Redis-Sentinel是Redis官方推荐的高可用性(HA)解决方案,当用Redis做Master-slave的高可用方案时,假如master宕机了,Re ...
Sql中partition by的使用
partition by关键字是oracle中分析性函数的一部分,它和聚合函数不同的地方在于它能返回一个分组中的多条记录,而聚合函数一般只有一条反映统计值的记录,partition by用于给结果集分 ...
SQLSERVER中NULL位图的作用
SQLSERVER中NULL位图的作用首先感谢宋沄剑提供的文章和sqlskill网站:www.sqlskills.com,看下面文章之前请先看一下下面两篇文章 SQL Server误区30日谈-Da ...
PHP中的header()函数作用
PHP 中 header()函数的作用是给客户端发送头信息. 什么是头信息?这里只作简单解释,详细的自己看http协议.在 HTTP协议中,服务器端的回答(response)内容包括两部分:头信息(h ...
浅析python 中__name__ = '__main__' 的作用
引用http://www.jb51.net/article/51892.htm 很多新手刚开始学习python的时候经常会看到python 中__name__ = \'__main__\' 这样的代码 ...
log4net日志在app.config中assembly不起作用
log4net 1.2.15.0日志在app.config中assembly不起作用,必须 1.手动调用方法log4net.Config.XmlConfigurator.Configure()来初始化 ...
URL中“#” “？” &“”号的作用
URL中"#" "?" &""号的作用阅读目录 1. # 2. ? 3. & 回到顶部 1. # 10年9月,twit ...
【转】浅析python 中__name__ = '__main__' 的作用
原文链接:http://www.jb51.net/article/51892.htm 举例说明解释的非常清楚,应该是看到的类似博文里面最简单的一篇: 这篇文章主要介绍了python 中__name__ ...

随机推荐

生成N个不相等的随机数
近期项目中须要生成N个不相等的随机数.实现的时候.赶工期,又有项目中N非常小(0-100)直接谢了一个最直观的方法: public static List<Integer> randomS ...
Derby使用2—C/S模式
零.回顾这部分先来回顾一下上一篇博客中的主要内容.上一篇博客中主要简单介绍了Derby数据的历史,特点,安装以及使用的两种模式.这篇文章主要介绍这两种模式中的一种模式一.启动服务端程序第一部分主 ...
Linux crontab 命令格式与具体样例
基本格式 : * * * * * command 分时日月周命令第1列表示分钟1-59 每分钟用*或者 */1表示第2列表示小时1-23(0表示0点) 第3列表示日期1-31 第4列表示 ...
齐全的IP地址查询接口及调用方法(转)
设计蜂巢IP地址查询接口:http://www.hujuntao.com/api/ip/ip.php 腾讯IP地址查询接口:http://fw.qq.com/ipaddress 新浪IP地址查询接口: ...
特殊的forward_list操作
为了理解forward_list为什么有特殊版本的添加和删除操作,考虑当我们从一个单向链表中删除一个元素时会发生什么.当添加或删除一个元素时,删除或添加的元素之前的那个元素的后继会发生变化.为了添加或 ...
linux 修改文件时间
1.ls -l *.sh 2.touch -d "10/13/2013" *.sh [我想把所以的.sh文件修改到三个月前(2013年10月13)的时间.]3.ls -l *.sh ...
获取随机颜色js
获取随机颜色方法一: function randomColor1() { var rand = Math.floor(Math.random() * 0xFFFFFF).toString(16); i ...
arcgis 获得路径和环境变量信息
import arcpy import sysimport osimport string reload(sys) sys.setdefaultencoding("utf8")sc ...
JavaScript无限极菜单
<!DOCTYPE html> <html> <head> <title> New Document </title> <meta c ...
ecshop在PHP 5.4以上版本各种错误问题处理
在php5.4版本之后有很多的函数与功能进行丢弃与升级功能了,现在国内很多CMS都还未按php5.4标准来做了,下面我整理了一些在ecshop在PHP 5.4以上版本各种错误问题处理. 1.PHP 5 ...

hadoop1中partition和combiner作用

hadoop1中partition和combiner作用的更多相关文章

随机推荐

热门专题