Map Reduce Application(Partitioninig/Group data by a defined key)

Assuming we want to group data by the year(2008 to 2016) of their [last access date time]. For each year, we use a reducer to collect them and output the data in this group/partition(year of the last access datetime). So, we want the MR to partition our key by year. We will lean what's the default partitioner and see how to set custom partitioner.

The default partitioner:

 public int getPartition(K key, V value,
int numReduceTasks) {
return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}

Custom Partitioner: 

job.setPartitionerClass(CustomPartitioner.class)    

With blew partitioner, the data of different year of [last access date time] will be assigned to different / unique partition. The num of reduce tasks is 9.

 public static class CustomPartitioner extends Partitioner<Text, Text>{
@Override
public int getPartition(Text key, Text value, int numReduceTasks){
if(numReduceTasks == 0){
return 0;
}
return key-2008
} 

Binning pattern

The text/comments/answer/question....contains the specific words will be written into the corresponding files from mapper.

See below picture to understand the binning pattern. It is easier than partitioning as it does not have partition/sorting/shuffling and reducer(job.setNumReduceTasks(0)). The outputs from mappers compose the final outputs.

MultipleOutputs.addNamedOutput(job,"namedoutput",TextOutputFormat.class, NullWritable.class, Text.class)

In the mapper setup function, create the MultipleOutputs intance by calling its constructor

MultipleOutputs(TaskInputOutputContext<?,?,KEYOUT,VALUEOUT> context)
Creates and initializes multiple outputs support, it should be instantiated in the Mapper/Reducer setup method.
 @Override
protected void setup(Context context){
maltipleOutputs = new MultipleOurputs(context);
}

Write your logic in the mapper function and output the result. "$tag/$tag-tag" means folder $pag will be created and $tag-tag is the prefix of the files(to distinguish the different mappers with suffix).

See doc for MultipleOutputs:https://hadoop.apache.org/docs/r3.0.1/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html

 if(tag.equalsIgnoreCase("pig"){
multipleOutputs.write("namedoutput",key,value,"pig/pig-tag");
} if(tag.equalsIgnoreCase("hive"){
multipleOutputs.write("namedoutput",key,value,"hive/hive-tag");
}
.....

                                                                                               

Map Reduce Application(Partitioninig/Binning)的更多相关文章

  1. Map Reduce Application(Join)

    We are going to explain how join works in MR , we will focus on reduce side join and map side join. ...

  2. Map Reduce Application(Top 10 IDs base on their value)

    Top 10 IDs base on their value First , we need to set the reduce to 1. For each map task, it is not ...

  3. mapreduce: 揭秘InputFormat--掌控Map Reduce任务执行的利器

    随着越来越多的公司采用Hadoop,它所处理的问题类型也变得愈发多元化.随着Hadoop适用场景数量的不断膨胀,控制好怎样执行以及何处执行map任务显得至关重要.实现这种控制的方法之一就是自定义Inp ...

  4. MapReduce剖析笔记之三:Job的Map/Reduce Task初始化

    上一节分析了Job由JobClient提交到JobTracker的流程,利用RPC机制,JobTracker接收到Job ID和Job所在HDFS的目录,够早了JobInProgress对象,丢入队列 ...

  5. python--函数式编程 (高阶函数(map , reduce ,filter,sorted),匿名函数(lambda))

    1.1函数式编程 面向过程编程:我们通过把大段代码拆成函数,通过一层一层的函数,可以把复杂的任务分解成简单的任务,这种一步一步的分解可以称之为面向过程的程序设计.函数就是面向过程的程序设计的基本单元. ...

  6. 记一次MongoDB Map&Reduce入门操作

    需求说明 用Map&Reduce计算几个班级中,每个班级10岁和20岁之间学生的数量: 需求分析 学生表的字段: db.students.insert({classid:1, age:14, ...

  7. filter,map,reduce,lambda(python3)

    1.filter filter(function,sequence) 对sequence中的item依次执行function(item),将执行的结果为True(符合函数判断)的item组成一个lis ...

  8. map reduce

    作者:Coldwings链接:https://www.zhihu.com/question/29936822/answer/48586327来源:知乎著作权归作者所有,转载请联系作者获得授权. 简单的 ...

  9. python基础——map/reduce

    python基础——map/reduce Python内建了map()和reduce()函数. 如果你读过Google的那篇大名鼎鼎的论文“MapReduce: Simplified Data Pro ...

随机推荐

  1. Paths with -a does not make sense.

    最近开始使用为windows的系统,进行git操作的时候出现了一个小问题. 使用命令: E:\IdeaProjects\mmall>git commit -am 'first commit in ...

  2. mysql使用数据库

    哈哈 只能怪自己太菜哈 刚接触这个MySQL没多久 今天用终端登陆MySQL的时候mysql -u root -p 然后就想看看自己的数据库 我用的MySQL的客户端是navicat for mysq ...

  3. Jquery中复选框选中取消实现文本框的显示隐藏

    标签内容 <div class="box"> 请编写javascript代码,完成如下功能要求:<br /> 1.取消复选款后,要求促销价格.促销开始结束日 ...

  4. ubuntu下的python请求库的安装——Selenium,ChromeDriver,GeckoDriver,PhantomJS,aiohttp

    Selenium安装: pip3 install selenium ChromeDriver安装: 在这链接下载对应版本:https://chromedriver.storage.googleapis ...

  5. bit_length

    #当十进制用二进制表示时,最少使用的位数 v=2data=v.bit_length()print(data)

  6. BugkuWeb本地包含

    知识点:$_REQUEST不是一个函数,它是一个超全局变量,里面包括有$_GET $_POST $_COOKIE的值,$_REPUEST 是接收了 $_GET $_POST $_COOKIE 三个的集 ...

  7. 如何防止index.html首页被篡改

    近期发现公司网站首页文件经常被篡改为indax.php或indax.html,导致网站的功能无法正常使用,百度搜索关键词,在显示结果中点击公司网站,打开后跳转到别的网站上去了,尤其我们在百度做的推广, ...

  8. GDB 单步调试汇编

    本文同时发表在 https://github.com/zhangyachen/zhangyachen.github.io/issues/134 之前在看汇编的时候一直是肉眼看GCC -S的结果,缺点是 ...

  9. iOS SSL Pinning 保护你的 API

    随着互联网的发展,网站全面 https 化已经越来越被重视,做为 App 开发人员,从一开始就让 API 都走 SSL 也是十分必要的.但是光这样就足够了吗? SSL 可以保护线上 API 数据不被篡 ...

  10. 基于Docker的UI自动化初探

    本文来自网易云社区 前言 一直以来,项目迭代的时间都是比较紧张的,开发加班加点coding,测试加班加点提bug.都说"时间像海绵里的水,挤挤总会有的"(当然这里的"挤挤 ...