Map Reduce Application(Partitioninig/Group data by a defined key)

Assuming we want to group data by the year(2008 to 2016) of their [last access date time]. For each year, we use a reducer to collect them and output the data in this group/partition(year of the last access datetime). So, we want the MR to partition our key by year. We will lean what's the default partitioner and see how to set custom partitioner.

The default partitioner:

 public int getPartition(K key, V value,
int numReduceTasks) {
return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}

Custom Partitioner: 

job.setPartitionerClass(CustomPartitioner.class)    

With blew partitioner, the data of different year of [last access date time] will be assigned to different / unique partition. The num of reduce tasks is 9.

 public static class CustomPartitioner extends Partitioner<Text, Text>{
@Override
public int getPartition(Text key, Text value, int numReduceTasks){
if(numReduceTasks == 0){
return 0;
}
return key-2008
} 

Binning pattern

The text/comments/answer/question....contains the specific words will be written into the corresponding files from mapper.

See below picture to understand the binning pattern. It is easier than partitioning as it does not have partition/sorting/shuffling and reducer(job.setNumReduceTasks(0)). The outputs from mappers compose the final outputs.

MultipleOutputs.addNamedOutput(job,"namedoutput",TextOutputFormat.class, NullWritable.class, Text.class)

In the mapper setup function, create the MultipleOutputs intance by calling its constructor

MultipleOutputs(TaskInputOutputContext<?,?,KEYOUT,VALUEOUT> context)
Creates and initializes multiple outputs support, it should be instantiated in the Mapper/Reducer setup method.
 @Override
protected void setup(Context context){
maltipleOutputs = new MultipleOurputs(context);
}

Write your logic in the mapper function and output the result. "$tag/$tag-tag" means folder $pag will be created and $tag-tag is the prefix of the files(to distinguish the different mappers with suffix).

See doc for MultipleOutputs:https://hadoop.apache.org/docs/r3.0.1/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html

 if(tag.equalsIgnoreCase("pig"){
multipleOutputs.write("namedoutput",key,value,"pig/pig-tag");
} if(tag.equalsIgnoreCase("hive"){
multipleOutputs.write("namedoutput",key,value,"hive/hive-tag");
}
.....

                                                                                               

Map Reduce Application(Partitioninig/Binning)的更多相关文章

  1. Map Reduce Application(Join)

    We are going to explain how join works in MR , we will focus on reduce side join and map side join. ...

  2. Map Reduce Application(Top 10 IDs base on their value)

    Top 10 IDs base on their value First , we need to set the reduce to 1. For each map task, it is not ...

  3. mapreduce: 揭秘InputFormat--掌控Map Reduce任务执行的利器

    随着越来越多的公司采用Hadoop,它所处理的问题类型也变得愈发多元化.随着Hadoop适用场景数量的不断膨胀,控制好怎样执行以及何处执行map任务显得至关重要.实现这种控制的方法之一就是自定义Inp ...

  4. MapReduce剖析笔记之三:Job的Map/Reduce Task初始化

    上一节分析了Job由JobClient提交到JobTracker的流程,利用RPC机制,JobTracker接收到Job ID和Job所在HDFS的目录,够早了JobInProgress对象,丢入队列 ...

  5. python--函数式编程 (高阶函数(map , reduce ,filter,sorted),匿名函数(lambda))

    1.1函数式编程 面向过程编程:我们通过把大段代码拆成函数,通过一层一层的函数,可以把复杂的任务分解成简单的任务,这种一步一步的分解可以称之为面向过程的程序设计.函数就是面向过程的程序设计的基本单元. ...

  6. 记一次MongoDB Map&Reduce入门操作

    需求说明 用Map&Reduce计算几个班级中,每个班级10岁和20岁之间学生的数量: 需求分析 学生表的字段: db.students.insert({classid:1, age:14, ...

  7. filter,map,reduce,lambda(python3)

    1.filter filter(function,sequence) 对sequence中的item依次执行function(item),将执行的结果为True(符合函数判断)的item组成一个lis ...

  8. map reduce

    作者:Coldwings链接:https://www.zhihu.com/question/29936822/answer/48586327来源:知乎著作权归作者所有,转载请联系作者获得授权. 简单的 ...

  9. python基础——map/reduce

    python基础——map/reduce Python内建了map()和reduce()函数. 如果你读过Google的那篇大名鼎鼎的论文“MapReduce: Simplified Data Pro ...

随机推荐

  1. Inconsistant light map between PC and Mobile under Unity3D

    Author: http://www.cnblogs.com/open-coder/p/3898159.html The light mapping effects between PC and Mo ...

  2. 微信小程序调用api接口

    请求的第三方微信url大概有3种 1)$url = "https://api.weixin.qq.com/sns/oauth2/access_token?appid=$appid&s ...

  3. Vue learning experience

    一.内置指令[v-ref] Official-document-expression: 父组件在子组件上注册的索引,便于直接访问.不需要表达式,必须提供参数ID,可以通过父组件的$ref对象访问子组件 ...

  4. kafka初步学习

    消息系统 什么是消息系统? 消息系统负责将数据从一个应用程序传输到另一个应用程序,因此应用程序可以专注于数据,但不担心如何共享它.分布式消息传递给予可靠消息队列的概念.消息在客户端应用程序和消息传递系 ...

  5. Discuz被挂马 快照被劫持跳转该如何处理 如何修复discuz漏洞

    Discuz 3.4是目前discuz论坛的最新版本,也是继X3.2.X3.3来,最稳定的社区论坛系统.目前官方已经停止对老版本的补丁更新与升级,直接在X3.4上更新了,最近我们SINE安全在对其安全 ...

  6. ACM1004:Let the Balloon Rise

    Problem Description Contest time again! How excited it is to see balloons floating around. But to te ...

  7. python--模块之time,datetime时间模块

    time: 表示时间的三种方式:时间戳.格式化的时间字符串.元组时间戳是计算机能够识别的时间:时间字符串是我们能够看懂的时间:元组是用来操作时间: 导入时间模块import time 1,时间戳(ti ...

  8. 北京Uber优步司机奖励政策(12月27日)

    滴快车单单2.5倍,注册地址:http://www.udache.com/ 如何注册Uber司机(全国版最新最详细注册流程)/月入2万/不用抢单:http://www.cnblogs.com/mfry ...

  9. GDAL2.1.1库在Ubuntu14.04下编译时遇到的问题处理方法

    不用作任何调整,直接在Linux下编译GDAL2.1.1源码的步骤是: $ ./configure $ make $ make install 非常简单,这样也能正常生成gdal动态库.静态库,如果想 ...

  10. 离线安装Sharepoint工具

    1. 首先安装操作系统,Windows Server 2008 R2,可以是企业版,也可以是数据中心版.然后再安装上SP1. 2. 在"服务管理"里面,添加角色,安装IIS.    ...