一、对于二次排序案例部分理解

1. 分析需求(首先对第一个字段排序,然后在对第二个字段排序)
杂乱的原始数据 排序完成的数据
a,1 a,1
b,1 a,2
a,2 [排序] a,100
b,6 ===> b,-3
c,2 b,-2
b,-2 b,1
a,100 b,6
b,-3 c,-7
c,-7 c,2
2. 分析[MapRedice过程]
1> 分析数据传入通过input()传入map()
2> map()对数据进行层层过滤,以达到我们想要的数据源,
3> 过滤方法中可添加自定义计数器
4> 过滤后写入context,转入shuffle阶段
5> 可以说大部分shuffle阶段是map()端的shuffle
6> 具体shullfe中,数据经过默认分区(hashPartitioner),而默认分区规则是获取
(key.getHashCode() & Integer.MAX_VALUE)%numReudeceTasks;当然默认reduce数目就一个,
reduce输出的文件也就一个,我是这样认为的,经过输出测试,就算你设置了自定义的分区,但你的partition数目
并没设置,仍然走默认分区
7> 分区之后对是分区的一个排序,再对分区中的数据进行排序,排序规则按照key排序,我们可以自定义数据类型对其
设置排序规则,比如二次排序,可以自定义一个组合的key,在组合key中定义根据第一个字段排序,如果第一个字段
相同,那么再进行对第二个字段排序,以达到二次排序的目的,在分区排序后进入分组阶段也是默认按照key分的,
分组需要实现RawComparator
8> 分组之后是merge个并归排序然后进入reduce,其中分组决定数据进入某个reduce,而分区决定了reduce阶段生成
文件的数目,分组算是shuffle阶段对程序运行的一个优化吧我是这么理解的
3. 分析[二次排序]
1> 从上面的数据可以看出,我们可以自定义一个数据类型,来存放第一个和第二个字段,然后自定义一个比较器来
说明排序规则按照key中的第一个字段进行排序,这里涉及到自定义数据需要实现WritableComparable也可以
分别继承Writable和Comparable,反正越方便越好
2> 接下来看看分区操作,该例只生成一个排好序的文件,不用自定义分区,自定义分区后也不会走该类,自定义分区需要
继承Partitioner,注意是继承,我们自己要重写分区规则
3> 然后是分组操作,分组为优化考虑还是有必要的,我们设计分组规则为按照自定义数据类型的第一个字段进行分组,
分组需要实现RawComparator
4> 考虑哪里还需要优化,根据数据源的数据量,字段是否必在,长度情况,
类型情况,是否使用combine与自定义压缩类,数值为负数等,在比较器中既然定义了根据第二个字段比较,我想也
没必要加个大数减个大数
  效果展示:
  数据源 map()后 shuffle阶段后 reduce()后
  a,1 a#1,1 a#1 [1,2,100] a 1
  b,1 b#1,1 b#-3 [-3,-2,1,6] a 2
  a,2 a#2,2 c#-7 [-7,2] a 100
  b,6 b#6,6 b -3
  c,2 c#2,2 b -2
  b,-2 b#-2,-2 b 1
  a,100 a#100,100 b 6
  b,-3 b#-3,-3 c -7
  c,-7 c#-7,-7 c 2

二、二次排序示例代码

SSortMr.java ## 主类
============
package com.bigdata_senior.SSortMr;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class SSortMr { //Mapper Class
private static class SSortMapper
extends Mapper<LongWritable, Text, SecondaryWritable, LongWritable>{
private SecondaryWritable mapOutKey = new SecondaryWritable();
private LongWritable mapOutValue = new LongWritable();
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException { String lineValue = value.toString();
String [] strValue = lineValue.split(",");
mapOutKey.set(strValue[0],Integer.valueOf(strValue[1]));
mapOutValue.set(Integer.valueOf(strValue[1]));
context.write(mapOutKey, mapOutValue);
System.out.println("key-->"+mapOutKey+" value-->"+mapOutValue);
}
} //Reduce Class
private static class SSortReduce
extends Reducer<SecondaryWritable, LongWritable, Text, LongWritable>{
private Text reduceOutKey = new Text();
@Override
public void reduce(SecondaryWritable key, Iterable<LongWritable> values,Context context)
throws IOException, InterruptedException { for(LongWritable value : values){
reduceOutKey.set(key.getFirst()+"#"+key.getSecond());
context.write(reduceOutKey, value);
}
}
} //Driver
public int run(String[] args) throws Exception { Configuration configuration = new Configuration();
Job job = Job.getInstance(configuration, this.getClass().getSimpleName());
job.setJarByClass(this.getClass());
//job.setNumReduceTasks(3); //input
Path inPath = new Path(args[0]);
FileInputFormat.addInputPath(job,inPath); //output
Path outPath = new Path(args[1]);
FileOutputFormat.setOutputPath(job, outPath); //mapper
job.setMapperClass(SSortMapper.class);
job.setMapOutputKeyClass(SecondaryWritable.class);
job.setMapOutputValueClass(LongWritable.class); //partitioner
//job.setPartitionerClass(SecondaryPartionerCLass.class); //group
job.setGroupingComparatorClass(SecondaryGroupClass.class); //Reduce
job.setReducerClass(SSortReduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class); //submit job
boolean isSuccess = job.waitForCompletion(true); return isSuccess ? 0 : 1;
} public static void main(String[] args) throws Exception { args = new String[]{
"hdfs://hadoop09-linux-01.ibeifeng.com:8020/user/liuwl/tmp/sortmr/input",
"hdfs://hadoop09-linux-01.ibeifeng.com:8020/user/liuwl/tmp/sortmr/output13"
};
//run job
int status = new SSortMr().run(args);
System.exit(status);
}
}
SecondaryWritable.java ## 自定义数据类型
======================
package com.bigdata_senior.SSortMr; import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException; import org.apache.hadoop.io.WritableComparable; public class SecondaryWritable implements WritableComparable<SecondaryWritable> { private String first;
private int second; public SecondaryWritable() {} public SecondaryWritable(String first,int second){
this.set(first, second);
} public void set(String fist,int second){
this.first = fist;
this.second = second;
} public String getFirst() {
return first;
} public void setFirst(String first) {
this.first = first;
} public int getSecond() {
return second ;
} public void setSecond(int second) {
this.second = second ;
} @Override
public void write(DataOutput out) throws IOException { out.writeUTF(this.first);
out.writeInt(this.second);
} @Override
public void readFields(DataInput in) throws IOException { this.first = in.readUTF();
this.second = in.readInt();
} @Override
public int compareTo(SecondaryWritable o) { int comp = this.first.compareTo(o.first);
if(0 != comp){
return comp;
}
return Integer.valueOf(this.second).compareTo(Integer.valueOf(o.second));
} @Override
public String toString() {
return first + "#" + second;
} @Override
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + ((first == null) ? 0 : first.hashCode());
result = prime * result + second;
return result;
} @Override
public boolean equals(Object obj) {
if (this == obj)
return true;
if (obj == null)
return false;
if (getClass() != obj.getClass())
return false;
SecondaryWritable other = (SecondaryWritable) obj;
if (first == null) {
if (other.first != null)
return false;
} else if (!first.equals(other.first))
return false;
if (second != other.second)
return false;
return true;
}
}
SecondaryPartionerCLass.java ## 自定义分区规则(已注释不用)
============================
package com.bigdata_senior.SSortMr; import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.mapreduce.Partitioner; public class SecondaryPartionerCLass extends Partitioner<SecondaryWritable, LongWritable> { @Override
public int getPartition(SecondaryWritable key, LongWritable value,
int numPartitions) {
return (key.getFirst().hashCode() & Integer.MAX_VALUE) % numPartitions;
}
}
SecondaryGroupClass.java ## 自定义分组规则
========================
package com.bigdata_senior.SSortMr; import java.util.Arrays; import org.apache.hadoop.io.RawComparator;
import org.apache.hadoop.io.WritableComparator; public class SecondaryGroupClass implements RawComparator<SecondaryWritable> { @Override
public int compare(SecondaryWritable o1, SecondaryWritable o2) {
System.out.println("o1: "+o1.toString()+" o2: "+o2.toString());
return o1.getFirst().compareTo(o2.getFirst());
} @Override
public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
System.out.println("b1: "+Arrays.toString(b1)+" b2: "+Arrays.toString(b2));
return WritableComparator.compareBytes(b1, 0, l1-4, b2, 0, l2-4);
}
}
另外还可以: ## 但这个对于小数据可用,大数据将非常消耗资源
SSortMr2.java
=============
package com.bigdata_senior.SSortMr2;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collection;
import java.util.Collections;
import java.util.Iterator;
import java.util.List;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class SSortMr2 { //Mapper Class
private static class SSortMapper extends Mapper<LongWritable, Text, Text, LongWritable>{
private Text mapOutKey = new Text();
private LongWritable mapOutValue = new LongWritable();
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException { String lineValue = value.toString();
String [] strValue = lineValue.split(",");
mapOutKey.set(strValue[0]);
mapOutValue.set(Integer.valueOf(strValue[1]));
context.write(mapOutKey, mapOutValue);
System.out.println("key-->"+mapOutKey+" value-->"+mapOutValue);
}
} //Reduce Class
private static class SSortReduce extends Reducer<Text, LongWritable, Text, Long>{
@Override
public void reduce(Text key, Iterable<LongWritable> values,Context context)
throws IOException, InterruptedException { List<Long> longList = new ArrayList<Long>();
for(LongWritable value: values){
longList.add(value.get());
}
Collections.sort(longList);
for(Long value : longList){
System.out.println("key--> "+key+" value--> "+value);
context.write(key, value);
}
}
} //Driver
public int run(String[] args) throws Exception { Configuration configuration = new Configuration();
Job job = Job.getInstance(configuration, this.getClass().getSimpleName());
job.setJarByClass(this.getClass()); //input
Path inPath = new Path(args[0]);
FileInputFormat.addInputPath(job,inPath); //output
Path outPath = new Path(args[1]);
FileOutputFormat.setOutputPath(job, outPath); //mapper
job.setMapperClass(SSortMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(LongWritable.class); //Reduce
job.setReducerClass(SSortReduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Long.class); //submit job
boolean isSuccess = job.waitForCompletion(true); return isSuccess ? 0 : 1;
} public static void main(String[] args) throws Exception { args = new String[]{
"hdfs://hadoop09-linux-01.ibeifeng.com:8020/user/liuwl/tmp/sortmr/input",
"hdfs://hadoop09-linux-01.ibeifeng.com:8020/user/liuwl/tmp/sortmr/output22"
};
//run job
int status = new SSortMr2().run(args);
System.exit(status);
}
}

三、MapReduce join简单理解

1. join(组合)
2. 即两张或两张以上的数据源数据组合输出
3. 由于学了hive,感觉MapReduce的join不再是重点,因为在MapReduce处理
1> 为止join表数目
2> 操作繁琐,过滤多样,可能会考虑不全
3> 资源消耗较重
4. MapReduce的join大致就是将两张表加载进内存,在数据混淆情况下,为其设置自定义数据类型以区分两张表,
然后在reudece()中分别获取表并指定输出结果,当然处理join的方式还有很多,比如setup()加载一张表存进集合处理

四、MapReduce join代码示例

JoinMr.java ## 主类
===========
package com.bigdata_senior.joinMr;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List; import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class JoinMr { //Mapper Class
private static class WordCountMapper extends
Mapper<LongWritable, Text, LongWritable, JoinWritable>{ private LongWritable mapoutputkey = new LongWritable();
private JoinWritable mapoutputvalue = new JoinWritable(); @Override
protected void setup(Context context) throws IOException,
InterruptedException {
} @Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException { String lineValue = value.toString();
String [] strValue = lineValue.split(","); int length = strValue.length;
if(3 != length && 4 != length){
return;
} //get cid
Long cid = Long.valueOf(strValue[0]);
//get cname
String cname = strValue[1];
//set customer
if(3 == length){
String phone = strValue[2];
mapoutputkey.set(cid);
mapoutputvalue.set("customer", cname + "," + phone);
} //set order
if(4 == length){
String price = strValue[2];
String date = strValue[3];
mapoutputkey.set(cid);
mapoutputvalue.set("order", cname +","+price +","+ date);
}
context.write(mapoutputkey, mapoutputvalue);
}
} //Reduce Class
private static class WordCountReduce extends
Reducer<LongWritable, JoinWritable, NullWritable, Text>{ private Text outputValue = new Text();
@Override
public void reduce(LongWritable key, Iterable<JoinWritable> values,Context context)
throws IOException, InterruptedException { String customerInfo = null;
List<String> orderList = new ArrayList<String>();
for(JoinWritable value : values){
if("customer".equals(value.getTag())){
customerInfo = value.getData();
System.out.println(customerInfo);
}else if("order".equals(value.getTag())){
orderList.add(value.getData());
}
}
for(String order: orderList){
outputValue.set(key.get()+","+customerInfo+","+order);
context.write(NullWritable.get(), outputValue);
}
}
} //Driver
public int run(String[] args) throws Exception { Configuration configuration = new Configuration();
Job job = Job.getInstance(configuration, this.getClass().getSimpleName());
job.setJarByClass(this.getClass()); //input
Path inPath = new Path(args[0]);
FileInputFormat.addInputPath(job,inPath); //output
Path outPath = new Path(args[1]);
FileOutputFormat.setOutputPath(job, outPath); //mapper
job.setMapperClass(WordCountMapper.class);
job.setMapOutputKeyClass(LongWritable.class);
job.setMapOutputValueClass(JoinWritable.class); //Reduce
job.setReducerClass(WordCountReduce.class);
job.setOutputKeyClass(NullWritable.class);
job.setOutputValueClass(Text.class); //submit job
boolean isSuccess = job.waitForCompletion(true); return isSuccess ? 0 : 1;
} public static void main(String[] args) throws Exception { args = new String[]{
"hdfs://hadoop09-linux-01.ibeifeng.com:8020/user/liuwl/tmp/join/input",
"hdfs://hadoop09-linux-01.ibeifeng.com:8020/user/liuwl/tmp/join/output2"
};
//run job
int status = new JoinMr().run(args);
System.exit(status);
}
}
JoinWritable.java ## 自定义数据类型
package com.bigdata_senior.joinMr; import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.io.Writable; public class JoinWritable implements Writable { private String tag;
private String data; public JoinWritable(){} public JoinWritable(String tag,String data){
this.set(tag, data);
}
public void set(String tag,String data){
this.setTag(tag);
this.setData(data);
} public String getTag() {
return tag;
} public void setTag(String tag) {
this.tag = tag;
} public String getData() {
return data;
} public void setData(String data) {
this.data = data;
} @Override
public void write(DataOutput out) throws IOException { out.writeUTF(this.getTag());
out.writeUTF(this.getData());
} @Override
public void readFields(DataInput in) throws IOException {
this.setTag(in.readUTF());
this.setData(in.readUTF());
} @Override
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + ((data == null) ? 0 : data.hashCode());
result = prime * result + ((tag == null) ? 0 : tag.hashCode());
return result;
} @Override
public boolean equals(Object obj) {
if (this == obj)
return true;
if (obj == null)
return false;
if (getClass() != obj.getClass())
return false;
JoinWritable other = (JoinWritable) obj;
if (data == null) {
if (other.data != null)
return false;
} else if (!data.equals(other.data))
return false;
if (tag == null) {
if (other.tag != null)
return false;
} else if (!tag.equals(other.tag))
return false;
return true;
} @Override
public String toString() {
return tag + "," +data;
}
}

Hadoop.2.x_高级应用_二次排序及MapReduce端join的更多相关文章

  1. 大数据【四】MapReduce(单词计数;二次排序;计数器;join;分布式缓存)

       前言: 根据前面的几篇博客学习,现在可以进行MapReduce学习了.本篇博客首先阐述了MapReduce的概念及使用原理,其次直接从五个实验中实践学习(单词计数,二次排序,计数器,join,分 ...

  2. python 实现Hadoop的partitioner和二次排序

    我们知道,一个典型的Map-Reduce过程包 括:Input->Map->Partition->Reduce->Output. Partition负责把Map任务输出的中间结 ...

  3. 分别使用Hadoop和Spark实现二次排序

    零.序(注意本部分与标题无太大关系,可直接调至第一部分) 既然没用为啥会有序?原因不想再开一篇文章,来抒发点什么感想或者计划了,就在这里写点好了: 前些日子买了几本书,打算学习和研究大数据方面的知识, ...

  4. matlab学习笔记9 高级绘图命令_2 图形的高级控制_视点控制和图形旋转_色图和颜色映像_光照和着色

    一起来学matlab-matlab学习笔记9 高级绘图命令_2 图形的高级控制_视点控制和图形旋转_色图和颜色映像_光照和着色 觉得有用的话,欢迎一起讨论相互学习~Follow Me 参考书籍 < ...

  5. 3、尚硅谷_SSM高级整合_使用ajax操作实现删除的功能

    点击删除的时候,要删除联系人,这里同点击编辑按钮一样给删除按钮添加点击事件的时候不能使用 $(".delete_btn").click(function(){ }); 这种方式,因 ...

  6. 【Big Data】HADOOP集群的配置(二)

    Hadoop集群的配置(二) 摘要: hadoop集群配置系列文档,是笔者在实验室真机环境实验后整理而得.以便随后工作所需,做以知识整理,另则与博客园朋友分享实验成果,因为笔者在学习初期,也遇到不少问 ...

  7. 进击的Python【第五章】:Python的高级应用(二)常用模块

    Python的高级应用(二)常用模块学习 本章学习要点: Python模块的定义 time &datetime模块 random模块 os模块 sys模块 shutil模块 ConfigPar ...

  8. Hadoop Mapreduce分区、分组、二次排序过程详解[转]

    原文地址:Hadoop Mapreduce分区.分组.二次排序过程详解[转]作者: 徐海蛟 教学用途 1.MapReduce中数据流动   (1)最简单的过程:  map - reduce   (2) ...

  9. Hadoop学习笔记: MapReduce二次排序

    本文给出一个实现MapReduce二次排序的例子 package SortTest; import java.io.DataInput; import java.io.DataOutput; impo ...

随机推荐

  1. UI中 frame 与 transform的用法与总结

    在iOS中,我们是不可以直接访问控件中frame的结构体的成员的,因此我们需要分三步来改变一个UI控件的位置,大小 一.frame用法 frame的结构体类型为: struct CGRect { CG ...

  2. 4类 JavaScript 内存泄露及如何避免

    原文:4 Types of Memory Leaks in JavaScript and How to Get Rid Of Them笔记:涂鸦码龙 译者注:本文并没有逐字逐句的翻译,而是把我认为重要 ...

  3. 近实时运算的利器---presto在公司实践

    1.起因 公司hadoop集群里的datanonde和tasktracker节点负载主要集中于晚上到凌晨,平日工作时间负载不是很高.但在工作时间内,公司业务人员有实时查询需求,现在主要 借助于hive ...

  4. 解决phpMyAdmin“登录超时 (1440 秒未活动),请重新登录”的问题

    问题描述 phpMyAdmin打开一段时间后会出现“登录超时 (1440 秒未活动),请重新登录”的问题: 解决方法 永久有效: vim config.inc.php 然后在里面加上下面这一行: $c ...

  5. MySQL主主复制3

    一.创建并授权用户 在每一台(主)服务器上创建一个用户,并为之授权,使它们可以互相访问彼此的数据库 在Server-1上: 创建一个充许Server-2来访问的用户server2,密码为:server ...

  6. 记一次小团队Git实践(下)

    在上篇中,我们已经能基本使用git了,接下来继续更深入的挖掘一下git. 更多的配置自定义信息 除了前面讲的用户名和邮箱的配置,还可以自定义其他配置: # 自定义你喜欢的编辑器,可选 git conf ...

  7. 门店 车销 批发送货 商超 快销专用扫描打印开单手持PDA移动销售管理系统

    门店 车销 批发送货 商超 快销专用扫描打印开单手持PDA移动销售管理系统的详细介绍 一. 以PDA等移动终端为媒介,随时随地掌握门店信息. 二. 后台集成了数据统计.多指标分析.销售.库存.会员管理 ...

  8. ural 2067. Friends and Berries

    2067. Friends and Berries Time limit: 2.0 secondMemory limit: 64 MB There is a group of n children. ...

  9. 【Oracle】Oracle时间日期格式

    to_date("要转换的字符串","转换的格式")   两个参数的格式必须匹配,否则会报错. 即按照第二个参数的格式解释第一个参数. to_char(日期,& ...

  10. Java 集合系列08之 List总结(LinkedList, ArrayList等使用场景和性能分析)

    概要 前面,我们学完了List的全部内容(ArrayList, LinkedList, Vector, Stack). Java 集合系列03之 ArrayList详细介绍(源码解析)和使用示例 Ja ...