示例文件:

100 99
100 98
100 56
100 78
20 100
30 100
20 50
30 50
30 60
20 80

需求:首先按第一个数字分组,组成按第二个数字排序。

解决方案:

首先,第一个数字相同的情况下,应该分到同一个reduce去处理,这就需要重写了Partitioner

因为默认的HashPartitioner会根据key值的hash值进行分配reduce task,但这里我们的key类型是自定义的intPair,

所以需要特别处理一下,根据第一个值进行分配reduce task即可。

默认的排序是根据key值排序的,这不需要特别处理。

另外,如何实现分组呢?即第一个数字相同,则第二个数字就在reduce的value 迭代器里面,而且值是有序的。

默认的情况下,如果key相同,value自然会被汇总到一起,但现在我们使用的技巧就是让key值不同的情况下,

我们也让它们的value汇总到一起。

关键代码是下面:

job.setGroupingComparatorClass(FirstGroupingComparator.class);

这个函数设定了按什么进行分组,进一步查看源码:

conf.setOutputValueGroupingComparator(cls);

相关说明如下:

* <p>For key-value pairs (K1,V1) and (K2,V2), the values (V1, V2) are passed
  * in a single call to the reduce function if K1 and K2 compare as equal.</p>
  *
  * <p>Since {@link #setOutputKeyComparatorClass(Class)} can be used to control
  * how keys are sorted, this can be used in conjunction to simulate
  * <i>secondary sort on values</i>.</p>

这些设定是作用在reduce的shuffle阶段的,这个时候把从map复制过来的数据进行merge sort,仅获取

分组的第一个值,然后value被聚合在一起。这个时候key中first相同的只保留了第一个,其他的被抛弃,

但我们已经把值放在value中,所以second不会丢失,实现了辅助排序。

结果:

------------------------------------------------
20    50
20    80
20    100
------------------------------------------------
30    50
30    60
30    100
------------------------------------------------
100    56
100    78
100    98
100    99

这个示例体现了hadoop里面最核心的一些东西,一个是writable,一个是RawComparator.

前者体现了hadoop进行序列化的方式,后者体现了hadoop排序的比较机制。

/**
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/ package org.apache.hadoop.examples; import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.RawComparator;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Partitioner;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.util.GenericOptionsParser; /**
* This is an example Hadoop Map/Reduce application.
* It reads the text input files that must contain two integers per a line.
* The output is sorted by the first and second number and grouped on the
* first number.
*
* To run: bin/hadoop jar build/hadoop-examples.jar secondarysort
* <i>in-dir</i> <i>out-dir</i>
*/
public class SecondarySort { /**
* Define a pair of integers that are writable.
* They are serialized in a byte comparable format.
*/
public static class IntPair
implements WritableComparable<IntPair> {
private int first = 0;
private int second = 0; /**
* Set the left and right values.
*/
public void set(int left, int right) {
first = left;
second = right;
}
public int getFirst() {
return first;
}
public int getSecond() {
return second;
}
/**
* Read the two integers.
* Encoded as: MIN_VALUE -> 0, 0 -> -MIN_VALUE, MAX_VALUE-> -1
*/
@Override
public void readFields(DataInput in) throws IOException {
first = in.readInt() + Integer.MIN_VALUE;
second = in.readInt() + Integer.MIN_VALUE;
}
@Override
public void write(DataOutput out) throws IOException {
out.writeInt(first - Integer.MIN_VALUE);
out.writeInt(second - Integer.MIN_VALUE);
}
@Override
public int hashCode() {
return first * 157 + second;// why multiply 157?
}
@Override
public boolean equals(Object right) {
if (right instanceof IntPair) {
IntPair r = (IntPair) right;
return r.first == first && r.second == second;
} else {
return false;
}
}
/** A Comparator that compares serialized IntPair. */
public static class Comparator extends WritableComparator {
public Comparator() {
super(IntPair.class);
} public int compare(byte[] b1, int s1, int l1,
byte[] b2, int s2, int l2) {
return compareBytes(b1, s1, l1, b2, s2, l2);
}
} static { // register this comparator
WritableComparator.define(IntPair.class, new Comparator());
} @Override
public int compareTo(IntPair o) {
if (first != o.first) {
return first < o.first ? -1 : 1;
} else if (second != o.second) {
return second < o.second ? -1 : 1;
} else {
return 0;
}
}
} /**
* Partition based on the first part of the pair.
*/
public static class FirstPartitioner extends Partitioner<IntPair,IntWritable>{
@Override
public int getPartition(IntPair key, IntWritable value,
int numPartitions) {
return Math.abs(key.getFirst() * 127) % numPartitions;
}
} /**
* Compare only the first part of the pair, so that reduce is called once
* for each value of the first part.
*/
public static class FirstGroupingComparator
implements RawComparator<IntPair> {
@Override
public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
return WritableComparator.compareBytes(b1, s1, Integer.SIZE/8,
b2, s2, Integer.SIZE/8);
} @Override
public int compare(IntPair o1, IntPair o2) {
int l = o1.getFirst();
int r = o2.getFirst();
return l == r ? 0 : (l < r ? -1 : 1);
} } /**
* Read two integers from each line and generate a key, value pair
* as ((left, right), right).
*/
public static class MapClass
extends Mapper<LongWritable, Text, IntPair, IntWritable> { private final IntPair key = new IntPair();
private final IntWritable value = new IntWritable(); @Override
public void map(LongWritable inKey, Text inValue,
Context context) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(inValue.toString());
int left = 0;
int right = 0;
if (itr.hasMoreTokens()) {
left = Integer.parseInt(itr.nextToken());
if (itr.hasMoreTokens()) {
right = Integer.parseInt(itr.nextToken());
}
key.set(left, right);
value.set(right);
context.write(key, value);
}
}
} /**
* A reducer class that just emits the sum of the input values.
*/
public static class Reduce
extends Reducer<IntPair, IntWritable, Text, IntWritable> {
private static final Text SEPARATOR =
new Text("------------------------------------------------");
private final Text first = new Text(); @Override
public void reduce(IntPair key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
context.write(SEPARATOR, null);
first.set(Integer.toString(key.getFirst()));
for(IntWritable value: values) {
context.write(first, value);
}
}
} public static void main(String[] args) throws Exception {
args = "-Dio.sort.mb=10 hdfs://namenode:9000/user/hadoop/test/intpair.txt hdfs://namenode:9000/user/hadoop/secsortout".split(" "); Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: secondarysrot <in> <out>");
System.exit(2);
} Job job = new Job(conf, "secondary sort");
job.setJarByClass(SecondarySort.class);
job.setMapperClass(MapClass.class);
job.setReducerClass(Reduce.class); // group and partition by the first int in the pair
job.setPartitionerClass(FirstPartitioner.class);
job.setGroupingComparatorClass(FirstGroupingComparator.class); // the map output is IntPair, IntWritable
job.setMapOutputKeyClass(IntPair.class);
job.setMapOutputValueClass(IntWritable.class); // the reduce output is Text, IntWritable
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
myUtils.myUtils.DeleteFolder(conf, otherArgs[1]); System.exit(job.waitForCompletion(true) ? 0 : 1);
} }

Sample SecondarySort 浅析的更多相关文章

  1. 洗牌算法及 random 中 shuffle 方法和 sample 方法浅析

    对于算法书买了一本又一本却没一本读完超过 10%,Leetcode 刷题从来没坚持超过 3 天的我来说,算法能力真的是渣渣.但是,今天决定写一篇跟算法有关的文章.起因是读了吴师兄的文章<扫雷与算 ...

  2. Direct3D学习笔记 - 浅析HDR Lighting Sample

    一.HDR简介 HDR(High Dynamic Range,高动态范围)是一种图像后处理技术,是一种表达超过了显示器所能表现的亮度范围的图像映射技术.高动态范围技术能够很好地再现现实生活中丰富的亮度 ...

  3. MS SQL统计信息浅析下篇

       MS SQL统计信息浅析上篇对SQL SERVER 数据库统计信息做了一个整体的介绍,随着我对数据库统计信息的不断认识.理解,于是有了MS SQL统计信息浅析下篇. 下面是我对SQL Serve ...

  4. 【浅析】IMU代码

    IMU的代码的引自https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/imumargalgo ...

  5. Android AIDL浅析及异步使用

    AIDL:Android Interface Definition Language,即 Android 接口定义语言. AIDL 是什么 Android 系统中的进程之间不能共享内存,因此,需要提供 ...

  6. 【转载】DXUT11框架浅析(4)--调试相关

    原文:DXUT11框架浅析(4)--调试相关 DXUT11框架浅析(4)--调试相关 1. D3D8/9和D3D10/11的调试区别 只要安装了DXSDK,有个调试工具DirectX ControlP ...

  7. 【Spark】Stage生成和Stage源代码浅析

    引入 上一篇文章<DAGScheduler源代码浅析>中,介绍了handleJobSubmitted函数,它作为生成finalStage的重要函数存在.这一篇文章中,我将就DAGSched ...

  8. 浅析微软的网关项目 -- ReverseProxy

    浅析微软的网关项目 ReverseProxy Intro 最近微软新开了一个项目 ReverseProxy ,也叫做 YARP(A Reverse Proxy) 官方介绍如下: YARP is a r ...

  9. SQL Server on Linux 理由浅析

    SQL Server on Linux 理由浅析 今天的爆炸性新闻<SQL Server on Linux>基本上在各大科技媒体上刷屏了 大家看到这个新闻都觉得非常震精,而美股,今天微软开 ...

随机推荐

  1. 【C#】第1章 VS2015中C#6的新特性

    分类:C#.VS2015 创建日期:2016-06-12 一.简介 VS2015内置的C#版本为6.0,该版本提供了一些新的语法糖,这里仅列出个人感觉比较有用的几个新功能. 二.几个很有用的新特性 注 ...

  2. 【JS复习笔记】01 基本语法

    数字: JS只有一种数字类型,相当于double.(不知道为什么,我每次打double输入法都会出现逗比了三个字) NaN是一个数值,可以用isNaN(number)检测NaN Infinity表示所 ...

  3. 用于dbnull的数据转换。因为用convert.to无法转换dbnull类型

    /// <summary> /// add by wolf /// </summary> public static class ExtendObject { public s ...

  4. 重载赋值运算符 && 对象

    class CMessage { private: char * m_pMessage; public: void showIt()const { cout << m_pMessage & ...

  5. sqlite3之基本操作(二)

    作者:Vamei 出处:http://www.cnblogs.com/vamei 欢迎转载,也请保留这段声明.谢谢! Python自带一个轻量级的关系型数据库SQLite.这一数据库使用SQL语言.S ...

  6. Ahjesus Nodejs01 环境搭建及运行

    访问http://nodejs.org/,根据系统选择下载文件,我用的win7 64 安装一路下一步直到完成 运行cmd输入node -v查看是否安装成功 成功会显示版本号 到此环境搭建完毕 ==== ...

  7. Android5.0新特性——新增的Widget(Widget)

    新增的Widget RecyclerView RecyclerView是ListView的升级版,它具备了更好的性能,且更容易使用.和ListView一样,RecyclerView是用来显示大量数据的 ...

  8. HTML JavaScript的DOM操作

    1.DOM的基本概念 DOM是文档对象模型,这种模型为树模型:文档是指标签文档:对象是指文档中每个元素:模型是指抽象化的东西. 2.Window对象操作 一.属性和方法: 属性(值或者子对象): op ...

  9. IOS6学习笔记(二)

    四.使用关联引用为分类添加数据 虽然不能在分类中创建实例变量,但是可以创建关联引用(associative reference).通过关联引用,你可以向任何对象中添加键-值(key-value)数据. ...

  10. RecyclerView添加头部和底部视图的实现

    ListView是有addHeaderView和 addFooterView两个方法的. 但是作为官方推荐的ListView的升级版RecyclerView缺无法实现这两个方法. 那么如果使用Recy ...