mapreduce 函数入门三

一、mapreduce多job串联

1、需求

一个稍复杂点的处理逻辑往往需要多个 mapreduce 程序串联处理，多 job 的串联可以借助 mapreduce 框架的 JobControl 实现

2、实例

以下有两个 MapReduce 任务，分别是 Flow 的 SumMR 和 SortMR，其中有依赖关系： SumMR 的输出是 SortMR 的输入，所以 SortMR 的启动得在 SumMR 完成之后

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

Job jobsum = Job.getInstance(conf);

jobsum.setJarByClass(RunManyJobMR.class);

jobsum.setMapperClass(FlowSumMapper.class);

jobsum.setReducerClass(FlowSumReducer.class);

jobsum.setMapOutputKeyClass(Text.class);

jobsum.setMapOutputValueClass(Flow.class);

jobsum.setCombinerClass(FlowSumReducer.class);

jobsum.setOutputKeyClass(Text.class);

jobsum.setOutputValueClass(Text.class);

FileInputFormat.setInputPaths(jobsum, "d:/flow/input");

FileOutputFormat.setOutputPath(jobsum, new Path("d:/flow/output12"));<br>

Job jobsort = Job.getInstance(conf);

jobsort.setJarByClass(RunManyJobMR.class);

jobsort.setMapperClass(FlowSortMapper.class);

jobsort.setReducerClass(FlowSortReducer.class);

jobsort.setMapOutputKeyClass(Flow.class);

jobsort.setMapOutputValueClass(Text.class);

jobsort.setOutputKeyClass(NullWritable.class);

jobsort.setOutputValueClass(Flow.class);

FileInputFormat.setInputPaths(jobsort, "d:/flow/output12");

FileOutputFormat.setOutputPath(jobsort, new Path("d:/flow/sortoutput12"));<br>

ControlledJob sumcj = new ControlledJob(jobsum.getConfiguration());

ControlledJob sortcj = new ControlledJob(jobsort.getConfiguration());

sumcj.setJob(jobsum);

sortcj.setJob(jobsort);

// 设置作业依赖关系

sortcj.addDependingJob(sumcj);

JobControl jc = new JobControl("flow sum and sort");

jc.addJob(sumcj);

jc.addJob(sortcj);

Thread jobThread = new Thread(jc);

jobThread.start();

while(!jc.allFinished()){

Thread.sleep(500);

}

jc.stop();

}

二、topn算法实现——自定义GroupComparator

1、需求

在统计学生成绩的小项目中，现在有一个需求：
求出每个班参考学生成绩最高的学生的信息，班级，姓名和平均分

2、分析

（1）利用“班级和平均分”作为 key，可以将 map 阶段读取到的所有学生成绩数据按照班级和成绩排倒序，发送到 reduce
（2）在 reduce 端利用 GroupingComparator 将班级相同的 kv 聚合成组，然后取第一个即是最大值
3、实现

数据类似于

computer    huangxiaoming   85  86  41  75  93  42  85

computer    xuzheng 54  52  86  91  42

computer    huangbo 85  42  96  38

english zhaobenshan 54  52  86  91  42  85  75

english liuyifei    85  41  75  21  85  96  14

algorithm   liuyifei    75  85  62  48  54  96  15

computer    huangjiaju  85  75  86  85  85

english liuyifei    76  95  86  74  68  74  48

第一步：先把分组和排序字段都综合到一个自定义对象里

package com.ghgj.mr.topn;

import java.io.DataInput;

import java.io.DataOutput;

import java.io.IOException;

import org.apache.hadoop.io.WritableComparable;

public class ClazzScore implements WritableComparable<ClazzScore>{

private String clazz;

private Double score;

public String getClazz() {

return clazz;

}

public void setClazz(String clazz) {

this.clazz = clazz;

}

public Double getScore() {

return score;

}

public void setScore(Double score) {

this.score = score;

}

public ClazzScore(String clazz, Double score) {

super();

this.clazz = clazz;

this.score = score;

}

public ClazzScore() {

super();

// TODO Auto-generated constructor stub

}

@Override

public String toString() {

return clazz + "\t" + score;

}

@Override

public void write(DataOutput out) throws IOException {

out.writeUTF(clazz);

out.writeDouble(score);

}

@Override

public void readFields(DataInput in) throws IOException {

// TODO Auto-generated method stub

this.clazz = in.readUTF();

this.score = in.readDouble();

}

/**

* key 排序

*/

@Override

public int compareTo(ClazzScore cs) {

int it = cs.getClazz().compareTo(this.clazz);

if(it == 0){

return (int) (cs.getScore() - this.score);

}else{

return it;

}

}

}

第二步：编写排序之后的 ClazzScore 数据传入 ReduceTask 的分组规则

package com.ghgj.mr.topn;

import org.apache.hadoop.io.WritableComparable;

import org.apache.hadoop.io.WritableComparator;

public class ClazzScoreGroupComparator extends WritableComparator{

ClazzScoreGroupComparator(){

super(ClazzScore.class, true);

}

/**

* 决定输入到 reduce 的数据的分组规则

*/

@Override

public int compare(WritableComparable a, WritableComparable b) {

// TODO Auto-generated method stub

ClazzScore cs1 = (ClazzScore)a;

ClazzScore cs2 = (ClazzScore)b;

int it = cs1.getClazz().compareTo(cs2.getClazz());

return it;

}

}

第三步：编写mapreduce程序

package com.ghgj.mr.topn;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.DoubleWritable;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

/**

* TopN 问题

*/

public class ScoreTop1MR {

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

Job job = Job.getInstance(conf);

job.setJarByClass(ScoreTop1MR.class);

job.setMapperClass(ScoreTop1MRMapper.class);

job.setReducerClass(ScoreTop1MRReducer.class);

job.setOutputKeyClass(ClazzScore.class);

job.setOutputValueClass(DoubleWritable.class);

// 设置传入 reducer 的数据分组规则

job.setGroupingComparatorClass(ClazzScoreGroupComparator.class);

FileInputFormat.setInputPaths(job, "d:/score_all/input");

Path p = new Path("d:/score_all/output1");

FileSystem fs = FileSystem.newInstance(conf);

if(fs.exists(p)){

fs.delete(p, true);

}

FileOutputFormat.setOutputPath(job, p);

boolean status = job.waitForCompletion(true);

System.exit(status ? 0 : 1);

}

static class ScoreTop1MRMapper extends Mapper<LongWritable, Text, ClazzScore,

DoubleWritable>{

@Override

protected void map(LongWritable key, Text value, Context context) throws IOException,

InterruptedException {

String[] splits = value.toString().split("\t");

ClazzScore cs = new ClazzScore(splits[0], Double.parseDouble(splits[2]));

context.write(cs, new DoubleWritable(Double.parseDouble(splits[2])));

}

}

static class ScoreTop1MRReducer extends Reducer<ClazzScore, DoubleWritable, ClazzScore,

DoubleWritable>{

@Override

protected void reduce(ClazzScore cs, Iterable<DoubleWritable> scores, Context

context) throws IOException, InterruptedException {

// 按照规则，取每组的第一个就是 Top1

context.write(cs, scores.iterator().next());

}

}

}

三、Mapreduce全局计数器

1、需求

在实际生产代码中，常常需要将数据处理过程中遇到的不合规数据行进行全局计数，类似这种需求可以借助 MapReduce 框架中提供的全局计数器来实现
2、实例

以下是一个利用全局计数器来统计一个目录下所有文件出现的单词总数和总行数

package com.ghgj.mr.counter;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

    enum MyWordCounter{COUNT_LINES,COUNT_WORD}

//  enum Weekday{MONDAY, TUESDAY, WENSDAY, THURSDAY, FRIDAY, SATURDAY, SUNDAY}

    public static void main(String[] args) throws Exception {

        // 指定hdfs相关的参数

        Configuration conf = new Configuration();

        Job job = Job.getInstance(conf);

        // 设置jar包所在路径

        job.setJarByClass(WordCount.class);

        job.setMapperClass(WCMapper.class);

        job.setReducerClass(WCReducer.class);

        // 指定reducetask的输出类型

        job.setOutputKeyClass(Text.class);

        job.setOutputValueClass(LongWritable.class);

        // 本地路径

        Path inputPath = new Path("d:/wordcount/input");

        Path outputPath = new Path("d:/wordcount/output");

        FileSystem fs = FileSystem.get(conf);

        if(fs.exists(outputPath)){

            fs.delete(outputPath, true);

        }

        FileInputFormat.setInputPaths(job, inputPath);

        FileOutputFormat.setOutputPath(job, outputPath);

        // 最后提交任务

        boolean waitForCompletion = job.waitForCompletion(true);

        System.exit(waitForCompletion?0:1);

    }

    private static class WCMapper extends Mapper<LongWritable, Text, Text, LongWritable>{

        @Override

        protected void map(LongWritable key, Text value, Context context)

                throws IOException, InterruptedException {

//          COUNT_LINES++;

            context.getCounter(MyWordCounter.COUNT_LINES).increment(1L);

            // 在此写maptask的业务代码

            String[] words = value.toString().split(" ");

            for(String word: words){

                context.write(new Text(word), new LongWritable(1));

                context.getCounter(MyWordCounter.COUNT_WORD).increment(1L);

            }

        }

    }

    private static class WCReducer extends Reducer<Text, LongWritable, Text, LongWritable>{

        @Override

        protected void reduce(Text key, Iterable<LongWritable> values, Context context)

                throws IOException, InterruptedException {

            // 在此写reducetask的业务代码

            long sum = 0;

            for(LongWritable v: values){

                sum += v.get();

            }

            context.write(key, new LongWritable(sum));

        }

    }

}

或者：另一种情况

package com.ghgj.mr.counter;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class CounterWordCount {

enum CouterWordCountC{COUNT_WORDS, COUNT_LINES}

public static void main(String[] args) throws Exception {

// 指定 hdfs 相关的参数

Configuration conf = new Configuration();

Job job = Job.getInstance(conf);

// 设置 jar 包所在路径

job.setJarByClass(CounterWordCount.class);

job.setMapperClass(WCCounterMapper.class);

job.setMapOutputKeyClass(Text.class);

job.setMapOutputValueClass(LongWritable.class);

// 本地路径

Path inputPath = new Path("d:/wordcount/input");

FileInputFormat.setInputPaths(job, inputPath);

job.setNumReduceTasks(0);

Path outputPath = new Path("d:/wordcount/output");

FileSystem fs = FileSystem.get(conf);

if(fs.exists(outputPath)){

fs.delete(outputPath, true);

}

FileOutputFormat.setOutputPath(job, outputPath);

// 最后提交任务

boolean waitForCompletion = job.waitForCompletion(true);

System.exit(waitForCompletion?0:1);

}

private static class WCCounterMapper extends Mapper<LongWritable, Text, Text,

LongWritable>{

@Override

protected void map(LongWritable key, Text value, Context context)

throws IOException, InterruptedException {

// 统计行数，因为默认读取文本是逐行读取，所以 map 执行一次，行数+1

context.getCounter(CouterWordCountC.COUNT_LINES).increment(1L);

String[] words = value.toString().split(" ");

for(String word: words){

// 统计单词总数，遇见一个单词就+1

context.getCounter(CouterWordCountC.COUNT_WORDS).increment(1L);

}

}

}

}

参考：https://www.cnblogs.com/liuwei6/p/6724070.html

mapreduce 函数入门三的更多相关文章

mapreduce 函数入门二
m apreduce三大组件:Combiner\Sort\Partitioner 默认组件:排序,分区(不设置,系统有默认值) 一.mapreduce中的Combiner 1.什么是combiner ...
mapreduce 函数入门一
MapReduce 程序的业务编码分为两个大部分,一部分配置程序的运行信息,一部分编写该 MapReduce 程序的业务逻辑,并且业务逻辑的 map 阶段和 reduce 阶段的代码分别继承 Ma ...
MongoDB入门三步曲2－－基本操作(续)--聚合、索引、游标及mapReduce
mongodb 基本操作(续)--聚合.索引.游标及mapReduce 目录聚合操作 MapReduce 游标索引聚合操作像大多关系数据库一样,Mongodb也提供了聚合操作,这里仅列取常见到 ...
Swift语法基础入门三(函数, 闭包)
Swift语法基础入门三(函数, 闭包) 函数: 函数是用来完成特定任务的独立的代码块.你给一个函数起一个合适的名字,用来标识函数做什么,并且当函数需要执行的时候,这个名字会被用于“调用”函数格式: ...
javascript封装函数入门
封装函数的入门一.使用函数有两步: 1.定义函数,又叫声明函数, 封装函数. 定义函数的三个要素:功能,参数,返回值. function 函数名(形参){ 函数代码 return 结果} //2.调 ...
python之函数入门
python之函数入门一. 什么是函数二. 函数定义, 函数名, 函数体以及函数的调用三. 函数的返回值四. 函数的参数五.函数名->第一类对象六.闭包一,什么是函数函数: 对代 ...
C语言第七讲,函数入门.
C语言第七讲,函数入门. 一丶了解面向过程和面向对象的区别. 为什么要先讲面向过程和面向对象的区别? 面向过程,就是什么都要自己做. 比如你要吃饭. 那么你得自己做饭. 面向对象, 面向对象就是我要 ...
C#基础入门三
C#基础入门三类类使用class关键字进行声明,前面加一个访问修饰符,public class car{} 访问修饰符:修师傅可以用来修饰类和类成员等,控制它们的可见度修饰符关键字分别为:pu ...
redis入门(三)
目录 redis入门(三) 目录前言事务原理 Lua脚本安装脚本命令集群搭建工具 redis-trib.rb redis官方集群搭建集群横向扩展故障转移 redis管理参考文档 re ...

随机推荐

ASP.NET MVC IOC 之 Autofac（三）-webform中应用
在webform中应用autofac,只有global中的写法不一样,其他使用方式都一样 nuget上引用: global中的写法: private void AutoFacRegister() { ...
disable_function绕过--利用LD_PRELOAD
0x00 前言有时候直接执行命令的函数被ban了,那么有几种思路可以bypass 1.使用php本身自带的能够调用外部程序的函数 2.使用第三方插件入(如imagick) 但是这两种无非就是利用ph ...
解决ubuntu安装ssh服务无法打开解析包问题
Windows下做Linux开发需要SSH强大功能的支持.安装SSH的过程会出现了很多问题,看完这篇文章可以让你少走些弯路,PS:折腾一下午的成果. Ubuntu的apt-get工具的牛逼之处简直无人 ...
Linux下virtualenv与virtualenvwrapper详解
在使用 Python 开发的过程中,工程一多,难免会碰到不同的工程依赖不同版本的库的问题: 亦或者是在开发过程中不想让物理环境里充斥各种各样的库,引发未来的依赖灾难. 此时,我们需要对于不同的工程使用 ...
Unity 渲染教程（一）：矩阵
转载:http://gad.qq.com/program/translateview/7181958 创建立方体网格.· 支持缩放.位移和旋转. · 使用变换矩阵. · 创建简单的相机投影. 这是关于 ...
DVWA的搭建
DVWA的搭建一.DVWA是什么? 一款渗透测试演练系统,俗称靶机. 二.如何搭建? Linux有成套的靶机,直接打开使用就可以,下面开始介绍Windows 下DVWA的搭建. 运行phpstudy ...
洛谷P2463 [SDOI2008]Sandy的卡片(后缀数组SA + 差分 + 二分答案)
题目链接:https://www.luogu.org/problem/P2463 [题意] 求出N个串中都出现的相同子串的最长长度,相同子串的定义如题:所有元素加上一个数变成另一个,则这两个串相同,可 ...
浅析 fstab 与移动硬盘挂载方法
本文转自 Steins;Lab,非常详细地介绍了树莓派上 fstab 的配置项. 近期自己的Raspberry Pi出了点问题,总结总结便有了这篇文章. 本文首先记录“移动硬盘挂载”实际发生的问题,然 ...
Caused by: org.springframework.beans.factory.NoSuchBeanDefinitionException: No qualifying bean of type 'com.qingmu.mybaitsplus.mapper.UserMapper' available:
java.lang.IllegalStateException: Failed to load ApplicationContext at org.springframework.test.conte ...
NOIP 2008 传球游戏
洛谷 P1057 传球游戏洛谷传送门 JDOJ 1536: [NOIP2008]传球游戏 T3 JDOJ传送门 Description 上体育课的时候,小蛮的老师经常带着同学们一起做游戏.这次, ...

mapreduce 函数入门 三

mapreduce 函数入门 三的更多相关文章

随机推荐

热门专题

mapreduce 函数入门三

mapreduce 函数入门三的更多相关文章