如何使用Hadoop的Partitioner

博客分类：

Hadoop

Hadoop里面的MapReduce编程模型，非常灵活，大部分环节我们都可以重写它的API，来灵活定制我们自己的一些特殊需求。

今天散仙要说的这个分区函数Partitioner，也是一样如此，下面我们先来看下Partitioner的作用：

对map端输出的数据key作一个散列，使数据能够均匀分布在各个reduce上进行后续操作，避免产生热点区。

Hadoop默认使用的分区函数是Hash Partitioner，源码如下：

/**
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.hadoop.mapreduce.lib.partition;
import org.apache.hadoop.mapreduce.Partitioner;
/** Partition keys by their {@link Object#hashCode()}. */
public class HashPartitioner<K, V> extends Partitioner<K, V> {
/** Use {@link Object#hashCode()} to partition. */
public int getPartition(K key, V value,
int numReduceTasks) {
//默认使用key的hash值与上int的最大值，避免出现数据溢出的情况
return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}
}

大部分情况下，我们都会使用默认的分区函数，但有时我们又有一些，特殊的需求，而需要定制Partition来完成我们的业务，案例如下：

对如下数据，按字符串的长度分区，长度为1的放在一个，2的一个，3的各一个。

这时候，我们使用默认的分区函数，就不行了，所以需要我们定制自己的Partition，首先分析下，我们需要3个分区输出，所以在设置reduce的个数时，一定要设置为3，其次在partition里，进行分区时，要根据长度具体分区，而不是根据字符串的hash码来分区。核心代码如下：

/**
* Partitioner
*
*
* */
public static class PPartition extends Partitioner<Text, Text>{
@Override
public int getPartition(Text arg0, Text arg1, int arg2) {
/**
* 自定义分区，实现长度不同的字符串，分到不同的reduce里面
*
* 现在只有3个长度的字符串，所以可以把reduce的个数设置为3
* 有几个分区，就设置为几
* */
String key=arg0.toString();
){
%arg2;
){
%arg2;
){
%arg2;
}
;
}
}

全部代码如下：

package com.partition.test;
import java.io.IOException;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Partitioner;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.db.DBConfiguration;
import org.apache.hadoop.mapreduce.lib.db.DBInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.MultipleOutputs;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import com.qin.operadb.PersonRecoder;
import com.qin.operadb.ReadMapDB;
/**
* @author qindongliang
*
* 大数据交流群：376932160
*
*
* **/
public class MyTestPartition {
/**
* map任务
*
* */
public static class PMapper extends Mapper<LongWritable, Text, Text, Text>{
@Override
protected void map(LongWritable key, Text value,Context context)
throws IOException, InterruptedException {
// System.out.println("进map了");
//mos.write(namedOutput, key, value);
String ss[]=value.toString().split(";");
]), new Text(ss[1]));
}
}
/**
* Partitioner
*
*
* */
public static class PPartition extends Partitioner<Text, Text>{
@Override
public int getPartition(Text arg0, Text arg1, int arg2) {
/**
* 自定义分区，实现长度不同的字符串，分到不同的reduce里面
*
* 现在只有3个长度的字符串，所以可以把reduce的个数设置为3
* 有几个分区，就设置为几
* */
String key=arg0.toString();
){
%arg2;
){
%arg2;
){
%arg2;
}
;
}
}
/***
* Reduce任务
*
* **/
public static class PReduce extends Reducer<Text, Text, Text, Text>{
@Override
protected void reduce(Text arg0, Iterable<Text> arg1, Context arg2)
throws IOException, InterruptedException {
];
System.out.println("key==> "+key);
for(Text t:arg1){
//System.out.println("Reduce: "+arg0.toString()+" "+t.toString());
arg2.write(arg0, t);
}
}
}
public static void main(String[] args) throws Exception{
JobConf conf=new JobConf(ReadMapDB.class);
//Configuration conf=new Configuration();
conf.set("mapred.job.tracker","192.168.75.130:9001");
//读取person中的数据字段
conf.setJar("tt.jar");
//注意这行代码放在最前面，进行初始化，否则会报
/**Job任务**/
Job job=new Job(conf, "testpartion");
job.setJarByClass(MyTestPartition.class);
System.out.println("模式： "+conf.get("mapred.job.tracker"));;
// job.setCombinerClass(PCombine.class);
job.setPartitionerClass(PPartition.class);
);//设置为3
job.setMapperClass(PMapper.class);
// MultipleOutputs.addNamedOutput(job, "hebei", TextOutputFormat.class, Text.class, Text.class);
// MultipleOutputs.addNamedOutput(job, "henan", TextOutputFormat.class, Text.class, Text.class);
job.setReducerClass(PReduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
String path="hdfs://192.168.75.130:9000/root/outputdb";
FileSystem fs=FileSystem.get(conf);
Path p=new Path(path);
if(fs.exists(p)){
fs.delete(p, true);
System.out.println("输出路径存在，已删除！");
}
FileInputFormat.setInputPaths(job, "hdfs://192.168.75.130:9000/root/input");
FileOutputFormat.setOutputPath(job,p );
: 1);
}
}

运行情况如下：

输出路径存在，已删除！
) | Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
) | Total input paths to process : 1
) | Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
) | Snappy native library not loaded
) | Running job: job_201404101853_0005
) | map 0% reduce 0%
) | map 100% reduce 0%
) | map 100% reduce 11%
) | map 100% reduce 22%
) | map 100% reduce 55%
) | map 100% reduce 100%
) | Job complete: job_201404101853_0005
) | Counters: 29
) | Job Counters
) | Launched reduce tasks=3
) | SLOTS_MILLIS_MAPS=7422
) | Total time spent by all reduces waiting after reserving slots (ms)=0
) | Total time spent by all maps waiting after reserving slots (ms)=0
) | Launched map tasks=1
) | Data-local map tasks=1
) | SLOTS_MILLIS_REDUCES=30036
) | File Output Format Counters
) | Bytes Written=61
) | FileSystemCounters
) | FILE_BYTES_READ=93
) | HDFS_BYTES_READ=179
) | FILE_BYTES_WRITTEN=218396
) | HDFS_BYTES_WRITTEN=61
) | File Input Format Counters
) | Bytes Read=68
) | Map-Reduce Framework
) | Map output materialized bytes=93
) | Map input records=7
) | Reduce shuffle bytes=93
) | Spilled Records=14
) | Map output bytes=61
) | Total committed heap usage (bytes)=207491072
) | CPU time spent (ms)=2650
) | Combine input records=0
) | SPLIT_RAW_BYTES=111
) | Reduce input records=7
) | Reduce input groups=7
) | Combine output records=0
) | Physical memory (bytes) snapshot=422174720
) | Reduce output records=7
) | Virtual memory (bytes) snapshot=2935713792
) | Map output records=7

运行后的结果文件如下：

其中，part-r-000000里面的数据

其中，part-r-000001里面的数据

其中，part-r-000002里面的数据

至此，我们使用自定义的分区策略完美的实现了，数据分区了。

总结：引用一段话

(Partition)分区出现的必要性，如何使用Hadoop产生一个全局排序的文件？最简单的方法就是使用一个分区，但是该方法在处理大型文件时效率极低，因为一台机器必须处理所有输出文件，从而完全丧失了MapReduce所提供的并行架构的优势。事实上我们可以这样做，首先创建一系列排好序的文件；其次，串联这些文件（类似于归并排序）；最后得到一个全局有序的文件。主要的思路是使用一个partitioner来描述全局排序的输出。比方说我们有1000个1-10000的数据，跑10个ruduce任务，如果我们运行进行partition的时候，能够将在1-1000中数据的分配到第一个reduce中，1001-2000的数据分配到第二个reduce中，以此类推。即第n个reduce所分配到的数据全部大于第n-1个reduce中的数据。这样，每个reduce出来之后都是有序的了，我们只要cat所有的输出文件，变成一个大的文件，就都是有序的了

基本思路就是这样，但是现在有一个问题，就是数据的区间如何划分，在数据量大，还有我们并不清楚数据分布的情况下。一个比较简单的方法就是采样，假如有一亿的数据，我们可以对数据进行采样，如取10000个数据采样，然后对采样数据分区间。在Hadoop中，patition我们可以用TotalOrderPartitioner替换默认的分区。然后将采样的结果传给他，就可以实现我们想要的分区。在采样时，我们可以使用hadoop的几种采样工具，RandomSampler,InputSampler,IntervalSampler。

这样，我们就可以对利用分布式文件系统进行大数据量的排序了，我们也可以重写Partitioner类中的compare函数，来定义比较的规则，从而可以实现字符串或其他非数字类型的排序，也可以实现二次排序乃至多次排序。

如何使用Hadoop的Partitioner的更多相关文章

Hadoop的partitioner、全排序
按数值排序示例:按气温字段对天气数据集排序问题:不能将气温视为Text对象并以字典顺序排序正统做法:用顺序文件存储数据,其IntWritable键代表气温,其Text值就是数据行常用简单做法:首先, ...
python 实现Hadoop的partitioner和二次排序
我们知道,一个典型的Map-Reduce过程包括:Input->Map->Partition->Reduce->Output. Partition负责把Map任务输出的中间结 ...
Hadoop里的Partitioner
人们对于Mapreduce程序刚開始时都觉得仅仅须要一个reduce就够了. 毕竟,在你处理数据之前一个reducer已经把数据都分好类了,有谁不喜欢分好类的数据呢. 可是这样我们就忽略了并行计算的优 ...
hadoop编程技巧（3）---定义自己的区划类别Partitioner
Hadoop代码测试环境:Hadoop2.4 原则:在Hadoop的MapReduce过程.Mapper阅读过程完成后数据.它将数据发送到Partitioner.由Partitioner每个记录应当采 ...
Hadoop 综合揭秘——MapReduce 基础编程（介绍 Combine、Partitioner、WritableComparable、WritableComparator 使用方式）
前言本文主要介绍 MapReduce 的原理及开发,讲解如何利用 Combine.Partitioner.WritableComparator等组件对数据进行排序筛选聚合分组的功能.由于文章是针对开 ...
Hadoop基础-MapReduce的Partitioner用法案例
Hadoop基础-MapReduce的Partitioner用法案例作者:尹正杰版权声明:原创作品,谢绝转载!否则将追究法律责任. 一.Partitioner关键代码剖析 1>.返回的分区号 ...
hadoop之定制自己的Partitioner
partitioner负责shuffle过程的分组部分,目的是让map出来的数据均匀分布在reducer上,当然,如果我们不需要数据均匀,那么这个时候可以自己定制符合要求的partitioner. 下 ...
Hadoop日记Day17---计数器、map规约、分区学习
一.Hadoop计数器 1.1 什么是Hadoop计数器 Haoop是处理大数据的,不适合处理小数据,有些大数据问题是小数据程序是处理不了的,他是一个高延迟的任务,有时处理一个大数据需要花费好几个小时 ...
基于Hadoop 2.6.0运行数字排序的计算
上个博客写了Hadoop2.6.0的环境部署,下面写一个简单的基于数字排序的小程序,真正实现分布式的计算,原理就是对多个文件中的数字进行排序,每个文件中每个数字占一行,排序原理是按行读取后分块进行排序 ...

随机推荐

spring-boot2
1.Spring Boot 1.1.什么是Spring Boot Java是静态语言,先变异后运行都是静态语言,不编译直接运行是动态语言(js是动态语言不需要编译,因为浏览器可以直接解析).Java笨 ...
python中reduce()函数
reduce()函数也是Python内置的一个高阶函数.reduce()函数接收的参数和 map()类似,一个函数 f,一个list,但行为和 map()不同,reduce()传入的函数 f 必须接收 ...
mysql 根据sql查询语句导出数据
在这里提供2中方式: 建议:可以使用方式二,就不使用方式一. 方式一: 在linux下支持,window下不支持. 进入到mysql的bin目录,或者已经给mysql配置了环境变量就不用进入bin目录 ...
HDU 4089 Activation：概率dp + 迭代【手动消元】
题目链接:http://acm.hdu.edu.cn/showproblem.php?pid=4089 题意: 有n个人在排队激活游戏,Tomato排在第m个. 每次队列中的第一个人去激活游戏,有可能 ...
django admin扩展user表password验证及set_password
一般如果扩展了django user内置表,在admin后台创建新用户的时候密码将会变成明文,故而导致登录不成功.所以我们在admin.py可以通过form自定义进行对password进行操作,可以双 ...
stl_tree.h
stl_tree.h G++ ,cygnus\cygwin-b20\include\g++\stl_tree.h 完整列表 /* * * Copyright (c) 1996,1997 * Silic ...
3.3 CCSprite 精灵详解
3.3.1 创建精灵常用的 4 种方式 (当然还有其他方式,只不过这四种比较常用) //创建精灵常用的 4 种方式 CCSprite* spr1 = CCSprite::create(const c ...
NYOJ-小猴子下落
描述有一颗二叉树,最大深度为D,且所有叶子的深度都相同.所有结点从左到右从上到下的编号为1,2,3,·····,2的D次方减1.在结点1处放一个小猴子,它会往下跑.每个内结点上都有一个开关,初始全部 ...
ACM学习历程—HDU 3949 XOR（xor高斯消元）
题目链接:http://acm.hdu.edu.cn/showproblem.php?pid=3949 题目大意是给n个数,然后随便取几个数求xor和,求第k小的.(重复不计算) 首先想把所有xor的 ...
bzoj 3280: 小R的烦恼费用流
题目: Description 小R最近遇上了大麻烦,他的程序设计挂科了.于是他只好找程设老师求情.善良的程设老师答应不挂他,但是要求小R帮助他一起解决一个难题. 问题是这样的,程设老师最近要进行一项 ...

如何使用Hadoop的Partitioner

如何使用Hadoop的Partitioner

如何使用Hadoop的Partitioner的更多相关文章

随机推荐

热门专题