Hadoop学习之路(6)MapReduce自定义分区实现

MapReduce自带的分区器是HashPartitioner

原理：先对map输出的key求hash值，再模上reduce task个数，根据结果，决定此输出kv对，被匹配的reduce任务取走。

自定义分分区需要继承Partitioner，复写getpariton()方法

自定义分区类：

注意：map的输出是<K,V>键值对

其中int partitionIndex = dict.get(text.toString())，partitionIndex是获取K的值

附：被计算的的文本

Dear Dear Bear Bear River Car Dear Dear  Bear Rive

Dear Dear Bear Bear River Car Dear Dear  Bear Rive

需要在main函数中设置，指定自定义分区类

自定义分区类：

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Partitioner;

import java.util.HashMap;

public class CustomPartitioner extends Partitioner<Text, IntWritable> {

    public static HashMap<String, Integer> dict = new HashMap<String, Integer>();

    //Text代表着map阶段输出的key,IntWritable代表着输出的值

    static{

        dict.put("Dear", 0);

        dict.put("Bear", 1);

        dict.put("River", 2);

        dict.put("Car", 3);

    }

    public int getPartition(Text text, IntWritable intWritable, int i) {

        //

        int partitionIndex = dict.get(text.toString());

        return partitionIndex;

    }

}

注意：map的输出结果是键值对<K,V>,int partitionIndex = dict.get(text.toString());中的partitionIndex是map输出键值对中的键的值，也就是K的值。

Maper类：

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class WordCountMap extends Mapper<LongWritable, Text, Text, IntWritable> {

    public void map(LongWritable key, Text value, Context context)

            throws IOException, InterruptedException {

        String[] words = value.toString().split("\t");

        for (String word : words) {

            // 每个单词出现１次，作为中间结果输出

            context.write(new Text(word), new IntWritable(1));

        }

    }

}

Reducer类：

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class WordCountMap extends Mapper<LongWritable, Text, Text, IntWritable> {

    public void map(LongWritable key, Text value, Context context)

            throws IOException, InterruptedException {

        String[] words = value.toString().split("\t");

        for (String word : words) {

            // 每个单词出现１次，作为中间结果输出

            context.write(new Text(word), new IntWritable(1));

        }

    }

}

main函数：

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class WordCountMain {

    public static void main(String[] args) throws IOException,

            ClassNotFoundException, InterruptedException {

        if (args.length != 2 || args == null) {

            System.out.println("please input Path!");

            System.exit(0);

        }

        Configuration configuration = new Configuration();

        configuration.set("mapreduce.job.jar","/home/bruce/project/kkbhdp01/target/com.kaikeba.hadoop-1.0-SNAPSHOT.jar");

        Job job = Job.getInstance(configuration, WordCountMain.class.getSimpleName());

        // 打jar包

        job.setJarByClass(WordCountMain.class);

        // 通过job设置输入/输出格式

        //job.setInputFormatClass(TextInputFormat.class);

        //job.setOutputFormatClass(TextOutputFormat.class);

        // 设置输入/输出路径

        FileInputFormat.setInputPaths(job, new Path(args[0]));

        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        // 设置处理Map/Reduce阶段的类

        job.setMapperClass(WordCountMap.class);

        //map combine

        //job.setCombinerClass(WordCountReduce.class);

        job.setReducerClass(WordCountReduce.class);

        //如果map、reduce的输出的kv对类型一致，直接设置reduce的输出的kv对就行；如果不一样，需要分别设置map, reduce的输出的kv类型

        //job.setMapOutputKeyClass(.class)

        // 设置最终输出key/value的类型m

        job.setOutputKeyClass(Text.class);

        job.setOutputValueClass(IntWritable.class);

        job.setPartitionerClass(CustomPartitioner.class);

        job.setNumReduceTasks(4);

        // 提交作业

        job.waitForCompletion(true);

    }

}

main函数参数设置：

Hadoop学习之路(6)MapReduce自定义分区实现的更多相关文章

Hadoop学习之路(7)MapReduce自定义排序
本文测试文本: tom 20 8000 nancy 22 8000 ketty 22 9000 stone 19 10000 green 19 11000 white 39 29000 socrate ...
Hadoop学习之路(5)Mapreduce程序完成wordcount
程序使用的测试文本数据: Dear River Dear River Bear Spark Car Dear Car Bear Car Dear Car River Car Spark Spark D ...
阿里封神谈hadoop学习之路
阿里封神谈hadoop学习之路封神 2016-04-14 16:03:51 浏览3283 评论3 发表于: 阿里云E-MapReduce >> 开源大数据周刊 hadoop 学生 s ...
《Hadoop学习之路》学习实践
(实践机器:blog-bench) 本文用作博文<Hadoop学习之路>实践过程中遇到的问题记录. 本文所学习的博文为博主“扎心了,老铁” 博文记录.参考链接https://www.cnb ...
Hadoop学习之路（十三）MapReduce的初识
MapReduce是什么首先让我们来重温一下 hadoop 的四大组件: HDFS:分布式存储系统 MapReduce:分布式计算系统 YARN:hadoop 的资源调度系统 Common:以上三大 ...
Hadoop mapreduce自定义分区HashPartitioner
本文发表于本人博客. 在上一篇文章我写了个简单的WordCount程序,也大致了解了下关于mapreduce运行原来,其中说到还可以自定义分区.排序.分组这些,那今天我就接上一次的代码继续完善实现自定 ...
Hadoop 学习之路（三）—— 分布式计算框架 MapReduce
一.MapReduce概述 Hadoop MapReduce是一个分布式计算框架,用于编写批处理应用程序.编写好的程序可以提交到Hadoop集群上用于并行处理大规模的数据集. MapReduce作业通 ...
【Hadoop】MapReduce自定义分区Partition输出各运营商的手机号码
MapReduce和自定义Partition MobileDriver主类 package Partition; import org.apache.hadoop.io.NullWritable; i ...
Hadoop学习之路（二十）MapReduce求TopN
前言在Hadoop中,排序是MapReduce的灵魂,MapTask和ReduceTask均会对数据按Key排序,这个操作是MR框架的默认行为,不管你的业务逻辑上是否需要这一操作. 技术点 MapR ...

随机推荐

Why Oracle VIP can not be switched to original node ?
Oracle RAC is an share everything database architecture. The article is how to check out why virtual ...
LAMP: 分布式 HTTP 2.4.25 + PHP 5.4.13 + MySQL 5.5.28 分离部署
目录 A. 环境说明:B. 效果截图:C. HTTP编译安装D. MySQL二进制安装E. PHP源码编译安装F. PHP连接HTTPG. PHP支持扩展功能xcacheH. PHP连接MySQLI. ...
Qt 中 this->size() this->rect() event->size() 三者差异
测试代码: void OsgEarthGraphicsView::resizeEvent(QResizeEvent* event) { //if (scene()) //{ // scene()-&g ...
centos系统重装python或yum 报There was a problem importing one of the Python modules required to run yum. The error leading to this problem was:错误
sudo vim /usr/bin/yum #修个python所在的路径,例如 #/usr/local/bin/python2.6 或 /usr/local/bin/python2.7要原本你的系统原 ...
2000_narrowband to wideband conversion of speech using GMM based transformation
论文地址:基于GMM的语音窄带到宽带转换博客作者:凌逆战博客地址:https://www.cnblogs.com/LXP-Never/p/12151027.html 摘要在不改变现有通信网络的情 ...
1，Python爬虫环境的安装
前言很早以前就听说了Python爬虫,但是一直没有去了解:想着先要把一个方面的知识学好再去了解其他新兴的技术. 但是现在项目有需求,要到网上爬取一些信息,然后做数据分析.所以便从零开始学习Pytho ...
SpringMVC版本报错解决办法
报错代码: <?xml version="1.0" encoding="UTF-8"?> <beans xmlns="http:// ...
Django 博客单元测试：测试评论应用
作者:HelloGitHub-追梦人物文中所涉及的示例代码,已同步更新到 HelloGitHub-Team 仓库评论应用的测试和博客应用测试的套路是一样的. 先来建立测试文件的目录结构.首先在 c ...
linux文件系统相关命令(df/du/fsck/dumpe2fs)
一.文件系统查看命令df 格式 df [选项] [挂载点] 选项名称作用 -a 显示所有的文件系统信息,包括特殊文件系统,如/proc,/sysfs -h 使用习惯单位显示容量,如KB,MB或GB ...
Caliburn.Micro框架之Action Convertions
首先新建一个项目,名称叫Caliburn.Micro.ActionConvertions 然后删掉MainWindow.xaml 然后去app.xaml删掉StartupUri这行代码其次,安装Ca ...

Hadoop学习之路(6)MapReduce自定义分区实现

Hadoop学习之路(6)MapReduce自定义分区实现的更多相关文章

随机推荐

热门专题