MapReduce(3): Partitioner, Combiner and Shuffling

Partitioner:

Partitioning and Combining take place between Map and Reduce phases. It is to club the data which should go to the same reducer based on keys. The number of partitioners is equal to the number of reducers. That means a partitioner will divide the data according to the number of reducers. Therefore, the data passed from a single partitioner is processed by a single Reducer. HashPartitioner is the default Partitioner in hadoop.

A partitioner partitions the key-value pairs of intermediate Map-outputs. It partitions the data using a user-defined condition, which works like a hash function. The total number of partitions is same as the number of Reducer tasks for the job. Records having the same key value go into the same partition (within each mapper).

Partition doing jobs on local machine.

Combiner:

Combiner is a 'mini-reducer' (semi-reducer), used to process reducer's work before transfering data onto reducers. It can reduce network congestion. An example is shown below:

Shuffle:

shuffle notify master to copy files onto reducer machines. In the final output of map task there can be multiple partitions and these partitions should go to different reduce task. Shuffling is basically transferring map output partitions to the corresponding reduce tasks. Map task notified application master about completion of map task and application master notifies corresponding reducer to copy the map output into reduce machine. As shuffling can start even before the map phase has finished so this saves some time and completes the tasks in lesser time.

References:

https://www.cnblogs.com/hadoop-dev/p/5910459.html

https://blog.csdn.net/bitcarmanlee/article/details/60137837

http://geekdirt.com/blog/map-reduce-in-detail/

Using hash function to map immediate K,V pairs

https://en.wikipedia.org/wiki/Hash_function

https://www.tutorialspoint.com/map_reduce/map_reduce_partitioner.htm

https://data-flair.training/blogs/hadoop-partitioner-tutorial/

MapReduce(3): Partitioner, Combiner and Shuffling的更多相关文章

Hadoop基础-MapReduce的Partitioner用法案例
Hadoop基础-MapReduce的Partitioner用法案例作者:尹正杰版权声明:原创作品,谢绝转载!否则将追究法律责任. 一.Partitioner关键代码剖析 1>.返回的分区号 ...
MapReduce教程(二)MapReduce框架Partitioner分区<转>
1 Partitioner分区 1.1 Partitioner分区描述在进行MapReduce计算时,有时候需要把最终的输出数据分到不同的文件中,按照手机号码段划分的话,需要把同一手机号码段的数据放 ...
MapReduce框架Partitioner分区方法
前言:对于二次排序相信大家也是似懂非懂,我也是一样,对其中的很多方法都不理解诶,所有只有暂时放在一边,当你接触到其他的函数,你知道的越多时你对二次排序的理解也就更深入了,同时建议大家对wordcoun ...
[MapReduce_5] MapReduce 中的 Combiner 组件应用
0. 说明 Combiner 介绍 && 在 MapReduce 中的应用 1. 介绍 Combiner: Map 端的 Reduce,有自己的使用场景在相同 Key 过多的情况下 ...
MapReduce 调优-Combiner
下图是演示了Combiner的好处因为我们知道Hadoop的好处在于集群中有很多小的机器,组成了一个庞大的集群,把一个大的计算任务后者说复杂的计算过程分发到了一个个小的机器上面.但是这个集群一个致命 ...
mapReduce的优化-combiner
mr的合成器,本质上就是reduce,在map端执行,称之为map端reduce,或者预聚合. 例子: job.setCombinerClass(WordCountCombiner.class);
Hadoop and Big Data
Hadoop(1): HDFS Basics Hadoop(2):HDFS Block Management Hadoop(3): Prepare inputs for MapReduce mappe ...
MR中的combiner和partitioner
1.combiner combiner是MR编程模型中的一个组件: 有些任务中map可能会产生大量的本地输出,combiner的作用就是在map端对输出先做一次合并,以减少map和reduce节点之间 ...
大数据技术 - MapReduce的Combiner介绍
本章来简单介绍下 Hadoop MapReduce 中的 Combiner.Combiner 是为了聚合数据而出现的,那为什么要聚合数据呢?因为我们知道 Shuffle 过程是消耗网络IO 和磁盘I ...

随机推荐

深入ArrayList看fast-fail机制
fail-fast机制简介什么是fail-fast fail-fast 机制是java集合(Collection)中的一种错误机制.它只能被用来检测错误,因为JDK并不保证fail-fast机制一定 ...
11-jQuery简介和选择器
# jQuery > jQuery是一个是免费.开源的javascript库, 也是目前使用最广泛的javascript函数库.>> jQuery极大的方便你完成web前段的相关操作 ...
项目常见bug
Invalid prop: type check failed for prop "disabled". Expected Boolean, got String with val ...
Redis的客户端Jedis
1. Redis支持消息的订阅与发布 Redis的消息订阅支持:先订阅后发布订阅:subscribe c1 c2 发布:publish c2 hello-redis 支持通配符的订阅:psubscr ...
【错误】Publishing to Tomcat'has encountered a problem
tomcat 启动工程时候出现 Publishing to Tomcat'has encountered a problem错误解决方案之后重启tomcat 就可以正常启动了
脚本_获取本机 MAC 地址
#!bin/bash#作者:liusingbon#功能:获取本机 MAC 地址ip a s |awk 'BEGIN{print "本机 MAC 地址信息如下:"}/^[0-9]/{ ...
树——minimum-depth-of-binary-tree（二叉树的最小深度）
问题: Given a binary tree, find its minimum depth.The minimum depth is the number of nodes along the s ...
$PMTargetFileDir 参数位置
系统/session参数与变量参数和变量都配置在Session中,如$PMTargetFileDir.$PMBadFileDir等.这些变量有哪些.在哪里定义.是否可以修改呢?在控制台(Admin C ...
客户端模拟线程线程池发送100个文件给socket
1.线程池模拟发送100个线程发送 2.每个线程启动一个socket发送文件 3.线程池最大并发几个
psexec远程重启服务器
1 使受控机器支持远程psexec 管理暂参照其他教程 2 打开远程命令行 3 重启服务执行net命令停止远程桌面 net stop termservice 启动远程桌面 net start te ...

MapReduce(3): Partitioner, Combiner and Shuffling

MapReduce(3): Partitioner, Combiner and Shuffling的更多相关文章

随机推荐

热门专题