Map Reduce Application(Partitioninig/Binning)
Map Reduce Application(Partitioninig/Group data by a defined key)
Assuming we want to group data by the year(2008 to 2016) of their [last access date time]. For each year, we use a reducer to collect them and output the data in this group/partition(year of the last access datetime). So, we want the MR to partition our key by year. We will lean what's the default partitioner and see how to set custom partitioner.
The default partitioner:
public int getPartition(K key, V value,
int numReduceTasks) {
return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}
Custom Partitioner:
job.setPartitionerClass(CustomPartitioner.class)
With blew partitioner, the data of different year of [last access date time] will be assigned to different / unique partition. The num of reduce tasks is 9.
public static class CustomPartitioner extends Partitioner<Text, Text>{
@Override
public int getPartition(Text key, Text value, int numReduceTasks){
if(numReduceTasks == 0){
return 0;
}
return key-2008
}
Binning pattern
The text/comments/answer/question....contains the specific words will be written into the corresponding files from mapper.
See below picture to understand the binning pattern. It is easier than partitioning as it does not have partition/sorting/shuffling and reducer(job.setNumReduceTasks(0)). The outputs from mappers compose the final outputs.
MultipleOutputs.addNamedOutput(job,"namedoutput",TextOutputFormat.class, NullWritable.class, Text.class)

In the mapper setup function, create the MultipleOutputs intance by calling its constructor
MultipleOutputs(TaskInputOutputContext<?,?,KEYOUT,VALUEOUT> context)
Creates and initializes multiple outputs support, it should be instantiated in the Mapper/Reducer setup method.
@Override
protected void setup(Context context){
maltipleOutputs = new MultipleOurputs(context);
}
Write your logic in the mapper function and output the result. "$tag/$tag-tag" means folder $pag will be created and $tag-tag is the prefix of the files(to distinguish the different mappers with suffix).
See doc for MultipleOutputs:https://hadoop.apache.org/docs/r3.0.1/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html
if(tag.equalsIgnoreCase("pig"){
multipleOutputs.write("namedoutput",key,value,"pig/pig-tag");
}
if(tag.equalsIgnoreCase("hive"){
multipleOutputs.write("namedoutput",key,value,"hive/hive-tag");
}
.....
Map Reduce Application(Partitioninig/Binning)的更多相关文章
- Map Reduce Application(Join)
We are going to explain how join works in MR , we will focus on reduce side join and map side join. ...
- Map Reduce Application(Top 10 IDs base on their value)
Top 10 IDs base on their value First , we need to set the reduce to 1. For each map task, it is not ...
- mapreduce: 揭秘InputFormat--掌控Map Reduce任务执行的利器
随着越来越多的公司采用Hadoop,它所处理的问题类型也变得愈发多元化.随着Hadoop适用场景数量的不断膨胀,控制好怎样执行以及何处执行map任务显得至关重要.实现这种控制的方法之一就是自定义Inp ...
- MapReduce剖析笔记之三:Job的Map/Reduce Task初始化
上一节分析了Job由JobClient提交到JobTracker的流程,利用RPC机制,JobTracker接收到Job ID和Job所在HDFS的目录,够早了JobInProgress对象,丢入队列 ...
- python--函数式编程 (高阶函数(map , reduce ,filter,sorted),匿名函数(lambda))
1.1函数式编程 面向过程编程:我们通过把大段代码拆成函数,通过一层一层的函数,可以把复杂的任务分解成简单的任务,这种一步一步的分解可以称之为面向过程的程序设计.函数就是面向过程的程序设计的基本单元. ...
- 记一次MongoDB Map&Reduce入门操作
需求说明 用Map&Reduce计算几个班级中,每个班级10岁和20岁之间学生的数量: 需求分析 学生表的字段: db.students.insert({classid:1, age:14, ...
- filter,map,reduce,lambda(python3)
1.filter filter(function,sequence) 对sequence中的item依次执行function(item),将执行的结果为True(符合函数判断)的item组成一个lis ...
- map reduce
作者:Coldwings链接:https://www.zhihu.com/question/29936822/answer/48586327来源:知乎著作权归作者所有,转载请联系作者获得授权. 简单的 ...
- python基础——map/reduce
python基础——map/reduce Python内建了map()和reduce()函数. 如果你读过Google的那篇大名鼎鼎的论文“MapReduce: Simplified Data Pro ...
随机推荐
- 通用输入输出端口 - GPIO
一.概述 GPlO ( General Purpose I/0 Ports )意思为通用输入/输出端口, 通俗地说, 就是一些引脚.在芯片手册中I/O端口一般是分组的,比如有的芯片分为 A-J 共 9 ...
- JavaScript小练习3-用循环使三个DIV变色
题目 初始为黑色,点击后为红色,再次点击为黑色,以后每次点击一次变色. 分析 简单的onclick使用. button的居中可以在外套一个p元素,body中让p居中即可. 三个DIV块的居中,使用ma ...
- centos7.3上编译安装percona5.7.18
一,删除操作系统自带mariadb yum remove mariadb 二,下载需要的安装包 percona-toolkit-3.0.3-1.el7.x86_64.rpm boost_1_59_0. ...
- 帝国cms全文搜索 增加自定义字段搜索
帝国cms全站搜索功能只能调出固定的几个字段,如果想搜索其他字段的值,这时我们应该怎么办呢?开拓族网站有这个需求,所以研究了一下帝国的全站搜索,后来发现在/e/sch/index.php中可以直接对数 ...
- day 29 socketserver ftp功能的简单讲解
1.上传下载的简单示例 server: import socket import struct import json server =socket.socket() server.bind((' ...
- linux服务器项目部署【完整版】
之前总玩v8虚拟机,最近看到腾讯云学生套餐很实惠就租了个linux服务器搭一个项目,做下这个项目部署全记录,即为了方便以后查看,同时也分享下自己的经验,不足之处还请多多指教,废话不多说,直接开始!!! ...
- ACM1002:A + B Problem II
Problem Description I have a very simple problem for you. Given two integers A and B, your job is to ...
- Python写网络后台脚本
Python写网络后台脚本. 首先安装Python3.6.5,在centos中自带的Python是2.6版本的,现在早就出现了3.6版本了况且2和3 之间的差距还是比较大的,所以我选择更新一下Pyth ...
- PAT-A Java实现
1001 A+B Format (20) 输入:两个数a,b,-1000000 <= a, b <= 1000000 输出:a+b,并以每3个用逗号隔开的形式展示. 思路一: 1)计算出a ...
- javascript之input字符串不为空
今天我们来讲如何判断这个java中字符串输入是否为空 ------------------------当只有一个input的时候,我们来进行个判断这个值是否为空-------------------- ...