Sample SecondarySort 浅析

示例文件：

100 99
100 98
100 56
100 78
20 100
30 100
20 50
30 50
30 60
20 80

需求：首先按第一个数字分组，组成按第二个数字排序。

解决方案：

首先，第一个数字相同的情况下，应该分到同一个reduce去处理，这就需要重写了Partitioner，

因为默认的HashPartitioner会根据key值的hash值进行分配reduce task,但这里我们的key类型是自定义的intPair,

所以需要特别处理一下，根据第一个值进行分配reduce task即可。

默认的排序是根据key值排序的，这不需要特别处理。

另外，如何实现分组呢？即第一个数字相同，则第二个数字就在reduce的value 迭代器里面,而且值是有序的。

默认的情况下，如果key相同，value自然会被汇总到一起，但现在我们使用的技巧就是让key值不同的情况下，

我们也让它们的value汇总到一起。

关键代码是下面：

job.setGroupingComparatorClass(FirstGroupingComparator.class);

这个函数设定了按什么进行分组，进一步查看源码：

conf.setOutputValueGroupingComparator(cls);

相关说明如下：

* For key-value pairs (K1,V1) and (K2,V2), the values (V1, V2) are passed
* in a single call to the reduce function if K1 and K2 compare as equal.
*
* Since {@link #setOutputKeyComparatorClass(Class)} can be used to control
* how keys are sorted, this can be used in conjunction to simulate
* secondary sort on values.

这些设定是作用在reduce的shuffle阶段的，这个时候把从map复制过来的数据进行merge sort，仅获取

分组的第一个值，然后value被聚合在一起。这个时候key中first相同的只保留了第一个，其他的被抛弃，

但我们已经把值放在value中，所以second不会丢失，实现了辅助排序。

结果：

------------------------------------------------
20    50
20    80
20    100
------------------------------------------------
30    50
30    60
30    100
------------------------------------------------
100    56
100    78
100    98
100    99

这个示例体现了hadoop里面最核心的一些东西，一个是writable,一个是RawComparator.

前者体现了hadoop进行序列化的方式，后者体现了hadoop排序的比较机制。

/**

 * Licensed to the Apache Software Foundation (ASF) under one

 * or more contributor license agreements.  See the NOTICE file

 * distributed with this work for additional information

 * regarding copyright ownership.  The ASF licenses this file

 * to you under the Apache License, Version 2.0 (the

 * "License"); you may not use this file except in compliance

 * with the License.  You may obtain a copy of the License at

 *

 *     http://www.apache.org/licenses/LICENSE-2.0

 *

 * Unless required by applicable law or agreed to in writing, software

 * distributed under the License is distributed on an "AS IS" BASIS,

 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

 * See the License for the specific language governing permissions and

 * limitations under the License.

 */

package org.apache.hadoop.examples;

import java.io.DataInput;

import java.io.DataOutput;

import java.io.IOException;

import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.RawComparator;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.io.WritableComparable;

import org.apache.hadoop.io.WritableComparator;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Partitioner;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.util.GenericOptionsParser;

/**

 * This is an example Hadoop Map/Reduce application.

 * It reads the text input files that must contain two integers per a line.

 * The output is sorted by the first and second number and grouped on the

 * first number.

 *

 * To run: bin/hadoop jar build/hadoop-examples.jar secondarysort

 *            <i>in-dir</i> <i>out-dir</i>

 */

public class SecondarySort {

  /**

   * Define a pair of integers that are writable.

   * They are serialized in a byte comparable format.

   */

  public static class IntPair

                      implements WritableComparable<IntPair> {

    private int first = 0;

    private int second = 0;

    /**

     * Set the left and right values.

     */

    public void set(int left, int right) {

      first = left;

      second = right;

    }

    public int getFirst() {

      return first;

    }

    public int getSecond() {

      return second;

    }

    /**

     * Read the two integers.

     * Encoded as: MIN_VALUE -> 0, 0 -> -MIN_VALUE, MAX_VALUE-> -1

     */

    @Override

    public void readFields(DataInput in) throws IOException {

      first = in.readInt() + Integer.MIN_VALUE;

      second = in.readInt() + Integer.MIN_VALUE;

    }

    @Override

    public void write(DataOutput out) throws IOException {

      out.writeInt(first - Integer.MIN_VALUE);

      out.writeInt(second - Integer.MIN_VALUE);

    }

    @Override

    public int hashCode() {

      return first * 157 + second;// why multiply 157?

    }

    @Override

    public boolean equals(Object right) {

      if (right instanceof IntPair) {

        IntPair r = (IntPair) right;

        return r.first == first && r.second == second;

      } else {

        return false;

      }

    }

    /** A Comparator that compares serialized IntPair. */

    public static class Comparator extends WritableComparator {

      public Comparator() {

        super(IntPair.class);

      }

      public int compare(byte[] b1, int s1, int l1,

                         byte[] b2, int s2, int l2) {

        return compareBytes(b1, s1, l1, b2, s2, l2);

      }

    }

    static {                                        // register this comparator

      WritableComparator.define(IntPair.class, new Comparator());

    }

    @Override

    public int compareTo(IntPair o) {

      if (first != o.first) {

        return first < o.first ? -1 : 1;

      } else if (second != o.second) {

        return second < o.second ? -1 : 1;

      } else {

        return 0;

      }

    }

  }

  /**

   * Partition based on the first part of the pair.

   */

  public static class FirstPartitioner extends Partitioner<IntPair,IntWritable>{

    @Override

    public int getPartition(IntPair key, IntWritable value,

                            int numPartitions) {

      return Math.abs(key.getFirst() * 127) % numPartitions;

    }

  }

  /**

   * Compare only the first part of the pair, so that reduce is called once

   * for each value of the first part.

   */

  public static class FirstGroupingComparator

                implements RawComparator<IntPair> {

       @Override

        public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {

          return WritableComparator.compareBytes(b1, s1, Integer.SIZE/8,

                                                 b2, s2, Integer.SIZE/8);

        }

    @Override

    public int compare(IntPair o1, IntPair o2) {

      int l = o1.getFirst();

      int r = o2.getFirst();

      return l == r ? 0 : (l < r ? -1 : 1);

    }

  }

  /**

   * Read two integers from each line and generate a key, value pair

   * as ((left, right), right).

   */

  public static class MapClass

         extends Mapper<LongWritable, Text, IntPair, IntWritable> {

    private final IntPair key = new IntPair();

    private final IntWritable value = new IntWritable();

    @Override

    public void map(LongWritable inKey, Text inValue,

                    Context context) throws IOException, InterruptedException {

      StringTokenizer itr = new StringTokenizer(inValue.toString());

      int left = 0;

      int right = 0;

      if (itr.hasMoreTokens()) {

        left = Integer.parseInt(itr.nextToken());

        if (itr.hasMoreTokens()) {

          right = Integer.parseInt(itr.nextToken());

        }

        key.set(left, right);

        value.set(right);

        context.write(key, value);

      }

    }

  }

  /**

   * A reducer class that just emits the sum of the input values.

   */

  public static class Reduce

         extends Reducer<IntPair, IntWritable, Text, IntWritable> {

    private static final Text SEPARATOR =

      new Text("------------------------------------------------");

    private final Text first = new Text();

    @Override

    public void reduce(IntPair key, Iterable<IntWritable> values,

                       Context context

                       ) throws IOException, InterruptedException {

      context.write(SEPARATOR, null);

      first.set(Integer.toString(key.getFirst()));

      for(IntWritable value: values) {

        context.write(first, value);

      }

    }

  }

  public static void main(String[] args) throws Exception {

     args = "-Dio.sort.mb=10 hdfs://namenode:9000/user/hadoop/test/intpair.txt hdfs://namenode:9000/user/hadoop/secsortout".split(" ");

    Configuration conf = new Configuration();

    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();

    if (otherArgs.length != 2) {

      System.err.println("Usage: secondarysrot <in> <out>");

      System.exit(2);

    }

    Job job = new Job(conf, "secondary sort");

    job.setJarByClass(SecondarySort.class);

    job.setMapperClass(MapClass.class);

    job.setReducerClass(Reduce.class);

    // group and partition by the first int in the pair

    job.setPartitionerClass(FirstPartitioner.class);

    job.setGroupingComparatorClass(FirstGroupingComparator.class);

    // the map output is IntPair, IntWritable

    job.setMapOutputKeyClass(IntPair.class);

    job.setMapOutputValueClass(IntWritable.class);

    // the reduce output is Text, IntWritable

    job.setOutputKeyClass(Text.class);

    job.setOutputValueClass(IntWritable.class);

    FileInputFormat.addInputPath(job, new Path(otherArgs[0]));

    FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));

    myUtils.myUtils.DeleteFolder(conf, otherArgs[1]);

    System.exit(job.waitForCompletion(true) ? 0 : 1);

  }

}

Sample SecondarySort 浅析的更多相关文章

洗牌算法及 random 中 shuffle 方法和 sample 方法浅析
对于算法书买了一本又一本却没一本读完超过 10%,Leetcode 刷题从来没坚持超过 3 天的我来说,算法能力真的是渣渣.但是,今天决定写一篇跟算法有关的文章.起因是读了吴师兄的文章<扫雷与算 ...
Direct3D学习笔记 - 浅析HDR Lighting Sample
一.HDR简介 HDR(High Dynamic Range,高动态范围)是一种图像后处理技术,是一种表达超过了显示器所能表现的亮度范围的图像映射技术.高动态范围技术能够很好地再现现实生活中丰富的亮度 ...
MS SQL统计信息浅析下篇
MS SQL统计信息浅析上篇对SQL SERVER 数据库统计信息做了一个整体的介绍,随着我对数据库统计信息的不断认识.理解,于是有了MS SQL统计信息浅析下篇. 下面是我对SQL Serve ...
【浅析】IMU代码
IMU的代码的引自https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/imumargalgo ...
Android AIDL浅析及异步使用
AIDL:Android Interface Definition Language,即 Android 接口定义语言. AIDL 是什么 Android 系统中的进程之间不能共享内存,因此,需要提供 ...
【转载】DXUT11框架浅析(4)--调试相关
原文:DXUT11框架浅析(4)--调试相关 DXUT11框架浅析(4)--调试相关 1. D3D8/9和D3D10/11的调试区别只要安装了DXSDK,有个调试工具DirectX ControlP ...
【Spark】Stage生成和Stage源代码浅析
引入上一篇文章<DAGScheduler源代码浅析>中,介绍了handleJobSubmitted函数,它作为生成finalStage的重要函数存在.这一篇文章中,我将就DAGSched ...
浅析微软的网关项目 -- ReverseProxy
浅析微软的网关项目 ReverseProxy Intro 最近微软新开了一个项目 ReverseProxy ,也叫做 YARP(A Reverse Proxy) 官方介绍如下: YARP is a r ...
SQL Server on Linux 理由浅析
SQL Server on Linux 理由浅析今天的爆炸性新闻<SQL Server on Linux>基本上在各大科技媒体上刷屏了大家看到这个新闻都觉得非常震精,而美股,今天微软开 ...

随机推荐

用PHP实现Windows域验证
系统集成中,可能会有这种需求 Windows 域验证本质上是LDAP验证但在网上居然找不到详细的技术文档,可见不受待见之极.
CMS如何提供XML格式的接口
在做APP的过程中,需要服务端的接口数据. 是用Json格式还是Xml格式呢,很多人会说还是xml习惯. 然而PHP更适合返回的还是json,php核心库中就包含了json编码的函数,可以直接将数组转 ...
strtr对用户输入的敏感词汇进行过滤
/** * 过滤用户输入的基本数据,防止script攻击 * * @access public * @return string */ function compile_str($str) { $ar ...
（旧）子数涵数·DW——图文混排页面
一.首先,打开Dreamweaver,新建一个的HTML项目. 二.在设计区里,写一些文字,随便写一点(也可以在代码区中的<body>和</body>之间写). 三.插入一张图 ...
HttpController的激活
Web API调用请求的目标是定义在某个HttpController类型中的某个Action方法,所以消息处理管道的最终需要激活目标HttpController对象.调用请求的URI会携带目标Http ...
C#6.0语法糖剖析（一）
1.自动属性默认初始化使用代码 "; 编译器生成的代码: public class Customer { [CompilerGenerated] private string kBacki ...
Android:TextView 自动滚动(跑马灯) （转）
Android:TextView 自动滚动(跑马灯) TextView实现文字滚动需要以下几个要点: 1.文字长度长于可显示范围:android:singleLine="true ...
js判断用户的浏览器设备是移动端还是pc端
最近做的一个网站页面中需要根据用户的访问设备的不同来显示不同的页面样式,主要是判断移动设备还是电脑浏览器访问的. 下面给出js判断处理代码,以作参考. <script type="te ...
MSCRM 2011/2013 单点登录实现
通过自定义的ASP.NET程序,输入相关信息后,直接进入MSCRM 2011/2013中.
解决SharePoint文档库文件在搜索结果页面显示的标题和文档的标题不一致问题（search result）
问题表现: SharePoint 2013 爬网后,搜索一个文档,虽然搜到了,但是显示有点问题,如图: 原因分析: 造成该问题的原因是,该文档除了本身有一个名称外,在文档metadata的title属 ...

Sample SecondarySort 浅析

Sample SecondarySort 浅析的更多相关文章

随机推荐

热门专题