MapReduce学习总结之Combiner、Partitioner、Jobhistory

一、Combiner

在MapReduce编程模型中，在Mapper和Reducer之间有一个非常重要的组件，主要用于解决MR性能瓶颈问题

combiner其实属于优化方案，由于带宽限制，应该尽量map和reduce之间的数据传输数量。它在Map端把同一个key的键值对合并在一起并计算，计算规则和reduce一致，所以combiner也可以看作特殊的Reducer(本地reduce)。
执行combiner操作要求开发者必须在程序中设置了combiner（程序中通过job.setCombinerClass(myCombine.class)自定义combiner操作）

wordcount中直接使用myreduce作为combiner:

// 设置Map规约Combiner

    job.setCombinerClass(MyReducer.class);

参考资料：https://www.tuicool.com/articles/qAzUjav

二、Partitioner

Partitioner也是MR的重要组件，主要功能如下：

1）Partitioner决定MapTask输出的数据交由哪个ReduceTask处理

2）默认实现：分发的key的hash值对reduceTask 个数取模

which reducer=(key.hashCode() & Integer.MAX_VALUE) % numReduceTasks，得到当前的目的reducer。

例子：

文件内容：xiaomi 200

            huawei 500

            xiaomi 300

            huawei 700

            iphonex 100

            iphonex 30

            iphone7 60

对上面文件内容按手机品牌分类分发到四个reduce处理计算：

package rdb.com.hadoop01.mapreduce;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Partitioner;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

/**

 *

 * @author rdb

 *

 */

public class PartitionerApp {

    /**

     * map读取输入文件

     * @author rdb

     *

     */

    public static class MyMapper extends Mapper<LongWritable, Text, Text, LongWritable>{

        @Override

        protected void map(LongWritable key, Text value,

                Mapper<LongWritable, Text, Text, LongWritable>.Context context)

                throws IOException, InterruptedException {

            //接收每一行数据

            String line = value.toString();

            //按空格进行分割

            String[] words = line.split(" ");

            //通过上下文把map处理结果输出

            context.write(new Text(words[0]), new LongWritable(Long.parseLong(words[1])));

        }

    }

    /**

     * reduce程序，归并统计

     * @author rdb

     *

     */

    public static class MyReduce extends Reducer<Text, LongWritable, Text, LongWritable>{

        @Override

        protected void reduce(Text key, Iterable<LongWritable> values,

                Reducer<Text, LongWritable, Text, LongWritable>.Context context)

                throws IOException, InterruptedException {

            long sum = 0;

            for (LongWritable value : values){

                //求单词次数

                sum += value.get();

            }

            //通过上下文把reduce处理结果输出

            context.write(key, new LongWritable(sum));

        }

    }

    /**

     * 自定义partition

     * @author rdb

     *

     */

    public static class MyPartitioner extends Partitioner<Text, LongWritable>{

        @Override

        public int getPartition(Text key, LongWritable value, int numPartitions) {

            if(key.toString().equals("xiaomi")){

                return 0;

            }

            if(key.toString().equals("huawei")){

                return 1;

            }

            if(key.toString().equals("iphonex")){

                return 2;

            }

            return 3;

        }

    }

    /**

     * 自定义driver:封装mapreduce作业所有信息

     *@param args

     * @throws IOException

     */

    public static void main(String[] args) throws Exception {

        //创建配置

        Configuration configuration = new Configuration();

        //清理已经存在的输出目录

        Path out = new Path(args[1]);

        FileSystem fileSystem = FileSystem.get(configuration);

        if(fileSystem.exists(out)){

            fileSystem.delete(out, true);

            System.out.println("output exists,but it has deleted");

        }

        //创建job

        Job job = Job.getInstance(configuration,"WordCount");

        //设置job的处理类

        job.setJarByClass(PartitionerApp.class);

        //设置作业处理的输入路径

        FileInputFormat.setInputPaths(job, new Path(args[0]));

        //设置map相关的参数

        job.setMapperClass(MyMapper.class);

        job.setMapOutputKeyClass(Text.class);

        job.setMapOutputValueClass(LongWritable.class);

        //设置reduce相关参数

        job.setReducerClass(MyReduce.class);

        job.setOutputKeyClass(Text.class);

        job.setOutputValueClass(LongWritable.class);

        //设置combiner处理类，逻辑上和reduce是一样的

        //job.setCombinerClass(MyReduce.class);

        //设置job partition

        job.setPartitionerClass(MyPartitioner.class);

        //设置4个reducer,每个分区一个

        job.setNumReduceTasks(4);

        //设置作业处理的输出路径

        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        System.exit(job.waitForCompletion(true)? 0 : 1) ;

    }

}

打包后调用：hadoop jar ~/lib/hadoop01-0.0.1-SNAPSHOT.jar rdb.com.hadoop01.mapreduce.PartitionerApp

hdfs://hadoop01:8020/partitioner.txt  hdfs://hadoop01:8020/output/partitioner

结果： -rw-r--r--   1 hadoop supergroup         11 2018-05-09 06:35 /output/partitioner/part-r-00000

      -rw-r--r--   1 hadoop supergroup         12 2018-05-09 06:35 /output/partitioner/part-r-00001

      -rw-r--r--   1 hadoop supergroup         12 2018-05-09 06:35 /output/partitioner/part-r-00002

      -rw-r--r--   1 hadoop supergroup         11 2018-05-09 06:35 /output/partitioner/part-r-00003

[hadoop@hadoop01 lib]$ hadoop fs -text /output/partitioner/part-r-00000

18/05/09 06:36:37 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

xiaomi  500

[hadoop@hadoop01 lib]$ hadoop fs -text /output/partitioner/part-r-00001

18/05/09 06:36:43 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

huawei  1200

[hadoop@hadoop01 lib]$ hadoop fs -text /output/partitioner/part-r-00002

18/05/09 06:36:49 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

iphonex 130

[hadoop@hadoop01 lib]$ hadoop fs -text /output/partitioner/part-r-00003

18/05/09 06:36:57 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

iphone7 60

三、Jobhistory

JobHistory用来记录已经finished的mapreduce运行日志，日志信息存放于HDFS目录中，默认情况下没有开启此功能。需要配置。

1）配置hadoop-2.6.0-cdh5.7.0/etc/hadoop/mapred-site.xml

<property>

    <name>mapreduce.jobhistory.address</name>

    <value>hadoop01:10020</value>

    <description>MR JobHistory Server管理的日志的存放位置</description>

</property>

<property>

    <name>mapreduce.jobhistory.webapp.address</name>

    <value>hadoop01:19888</value>

    <description>查看历史服务器已经运行完的Mapreduce作业记录的web地址，需要启动该服务才行</description>

</property>

<property>

    <name>mapreduce.jobhistory.done-dir</name>

    <value>/history/done</value>

    <description>MR JobHistory Server管理的日志的存放位置,默认:/mr-history/done</description>

</property>

<property>

    <name>mapreduce.jobhistory.intermediate-done-dir</name>

    <value>/history/done_intermediate</value>

    <description>MapReduce作业产生的日志存放位置，默认值:/mr-history/tmp</description>

</property>

2）配置好后重启yarn.启动jobhistory服务：hadoop-2.6.0-cdh5.7.0/sbin/mr-jobhistory-daemon.sh start historyserver

[hadoop@hadoop01 sbin]$ jps

24321 JobHistoryServer

24353 Jps

23957 NodeManager

7880 DataNode

8060 SecondaryNameNode

23854 ResourceManager

7791 NameNode

[hadoop@hadoop01 sbin]$

3）浏览器访问：http://192.168.44.183:19888/

后台跑一个MapReduce程序：hadoop jar ~/lib/hadoop01-0.0.1-SNAPSHOT.jar rdb.com.hadoop01.mapreduce.WordCountApp hdfs://hadoop01:8020/hello.txt hdfs://hadoop01:8020/output/wc

刷新下浏览器可以看到刚才程序的日志：

点击页面中对应mr程序中的logs可以看详细日志。

问题记录：

MapReduce学习总结之Combiner、Partitioner、Jobhistory的更多相关文章

第2节 mapreduce深入学习：7、MapReduce的规约过程combiner
第2节 mapreduce深入学习:7.MapReduce的规约过程combiner 每一个 map 都可能会产生大量的本地输出,Combiner 的作用就是对 map 端的输出先做一次合并,以减少在 ...
Hadoop之MapReduce学习笔记（二）
主要内容: mapreduce编程模型再解释: ob提交方式: windows->yarn windows->local : linux->local linux->yarn: ...
MapReduce学习
参考文章参考文章2 shuffle的过程分析 Hadoop学习笔记:MapReduce框架详解谈mapreduce运行机制,可以从很多不同的角度来描述,比如说从mapreduce运行流程来讲解,也 ...
Hadoop学习笔记—8.Combiner与自定义Combiner
一.Combiner的出现背景 1.1 回顾Map阶段五大步骤在第四篇博文<初识MapReduce>中,我们认识了MapReduce的八大步凑,其中在Map阶段总共五个步骤,如下图所示: ...
MapReducer Counter计数器的使用,Combiner ,Partitioner,Sort,Grop的使用,
一:Counter计数器的使用 hadoop计数器:可以让开发人员以全局的视角来审查程序的运行情况以及各项指标,及时做出错误诊断并进行相应处理. 内置计数器(MapReduce相关.文件系统相关和作业 ...
mapreduce学习指导及疑难解惑汇总
原文链接http://www.aboutyun.com/thread-7091-1-1.html 1.思想起源: 我们在学习mapreduce,首先我们从思想上来认识.其实任何的奇思妙想,抽象的,好的 ...
mapreduce (二) MapReduce实现倒排索引(一) combiner是把同一个机器上的多个map的结果先聚合一次
1 思路:0.txt MapReduce is simple1.txt MapReduce is powerfull is simple2.txt Hello MapReduce bye MapRed ...
MapReduce学习总结之简介
执行步骤:1)准备Map处理的输入数据 2)Mapper处理 3)Shuffle 4)Reduce处理 5)结果输出三.mapreduce核心概念: 1)split:交由MapReduce作业来处理 ...
Hadoop - MapReduce学习笔记（详细）
第1章 MapReduce概述定义:是一个分布式运算程序的编程框架优缺点:易于编程.良好的扩展性.高容错性.适合PB级以上数据的离线处理核心思想:MapReduce 编程模型只能包含一个Map ...

随机推荐

NVIDIA® TensorRT™ supports different data formats
NVIDIA TensorRT supports different data formats NVIDIATensorRT公司支持不同的数据格式.需要考虑两个方面:数据类型和布局. ...
基于Kaggle的图像分类（CIFAR-10）
基于Kaggle的图像分类(CIFAR-10) Image Classification (CIFAR-10) on Kaggle 一直在使用Gluon's data package数据包直接获得张量 ...
python常识系列14-->正则表达式基础之re模块
前言勤奋的含义是今天的热血,而不是明天的决心,后天的保证. 一.正则表达式是什么? 描述了一种字符串匹配的模式(pattern) 功能一:用来检查一个字符串串是否含有某种子字符串功能二:将匹配的子 ...
【NX二次开发】指定矢量控件，记住上次选择的方向
block UI控件如果有RetainValue属性,就用这个属性.没有这个属性可以参考下面这种方法.以矢量控件为例: 1.在apply_cb回调中,将控件值保存到文本中 double TopForT ...
Pytest测试框架入门到精通（一）
Python测试框架之前一直用的是unittest+HTMLTestRunner,听到有人说Pytest很好用,所以这边给大家介绍一下Pytest的使用 pytest是一个非常成熟的全功能的Pytho ...
Linux shell是什么
shell概念: shell是一个命令行解释器,它为用户提供了一个向Linux内核发送请求以便运行程序的界面系统级程序,用户可以用shell启动,挂起,停止甚至编写一些程序. shell还是一个功能强 ...
头条面试题：判断一个数是否是happy number(每一位的平方和最终为1)
朋友面试头条二轮了,一轮的题目请看这一篇:头条面试题:求用户在线峰值和持续时间这次的面试题目是:判断一个数是否是happy number(每一位的平方和最终为1) 知道题目首先要理解题目.所谓hap ...
诸多改进！Superset 1.2.0 正式发布！
Apache Superset 是一个现代的.企业级的轻量BI平台,提供了大量数据可视化组件. 距离superset上一个版本发布已经过了近三个月的时间,我们终于等到了1.2.0版本. 之前就曾提到过 ...
JUnit5编写基本测试
JUnit5的测试不是通过名称,而是通过注解来标识的. 测试类与方法 Test Class:测试类,必须包含至少一个test方法,包括: 最外层的class static member class @ ...
资源：Kafka消息队列下载路径
Kafka下载路径 http://kafka.apache.org/downloads.html

MapReduce学习总结之Combiner、Partitioner、Jobhistory

MapReduce学习总结之Combiner、Partitioner、Jobhistory的更多相关文章

随机推荐

热门专题