MapReduce的洗牌(Shuffle)

Shuffle过程：数据从map端传输到reduce端的过程~

Map端

每个map有一个环形内存缓冲区，用于存储任务的输出。默认大小100MB（io.sort.mb属性），一旦达到阀值0.8（io.sort.spill.percent）,一个后台线程把内容写到(spill)磁盘的指定目录（mapred.local.dir）下的新建的一个溢出写文件。
写磁盘前，要partition,sort。如果有combiner，combine排序后数据。
等最后记录写完，合并全部溢出写文件为一个分区且排序的文件。

hadoop1中的是resourcemanager，在hadoop2中applicationmaster会通过reduce task从map task拷贝文件

Reduce端

Reducer通过Http方式得到输出文件的分区。 ( 上图为3个reduce任务，每一个分区产生一个reduce任务,分区后的数据通过shuffle，由reduce主动fetch数据，通过网络copy到reduce端)
TaskTracker为分区文件运行Reduce任务。复制阶段把Map输出复制到Reducer的内存或磁盘。一个Map任务完成，Reduce就开始复制输出。
排序阶段合并map输出。然后走Reduce阶段。

优化点：

内存缓冲器越小的时，往磁盘写的几率会增加。磁盘上会产生更多小文件的合并。数据的排序发生在内存中，如果缓冲区越大，也就是往磁盘写入的更少。
Spill到指定目录，如果把指定目录建立在固定硬盘上速度会加快。
数据传输的时候网络也是可优化的，可增加网络带宽。

源码导读：

注释：

/**
* Reduces a set of intermediate values which share a key to a smaller set of Reduce减少汇总了一些中间值的集合，共享一个key给一些较小值得集合
* values.
*

<p><code>Reducer</code> has 3 primary phases:</p>
* <ol>
* <li>
*
* <h4 id="Shuffle">Shuffle</h4>
*
* <p>The <code>Reducer</code> copies the sorted output from each //复制每个的排序输出，核心是拷贝
* {@link Mapper} using HTTP across the network.</p> //在整个网络上使用HTTP，网络传输的过程就是shuffle的过程
* </li>
*
* <li>
* <h4 id="Sort">Sort</h4>
*
* <p>The framework merge sorts <code>Reducer</code> inputs by
* <code>key</code>s
* (since different <code>Mapper</code>s may have output the same key).</p>
*
* <p>The shuffle and sort phases occur simultaneously i.e. while outputs are
* being fetched they are merged.</p>
*
* <h5 id="SecondarySort">SecondarySort</h5>
*
* <p>To achieve a secondary sort on the values returned by the value
* iterator, the application should extend the key with the secondary
* key and define a grouping comparator. The keys will be sorted using the
* entire key, but will be grouped using the grouping comparator to decide
* which keys and values are sent in the same call to reduce.The grouping
* comparator is specified via
* {@link Job#setGroupingComparatorClass(Class)}. The sort order is
* controlled by
* {@link Job#setSortComparatorClass(Class)}.</p>

* For example, say that you want to find duplicate web pages and tag them
* all with the url of the "best" known example. You would set up the job
* like:
* <ul>
* <li>Map Input Key: url</li>
* <li>Map Input Value: document</li>
* <li>Map Output Key: document checksum, url pagerank</li>
* <li>Map Output Value: url</li>
* <li>Partitioner: by checksum</li>
* <li>OutputKeyComparator: by checksum and then decreasing pagerank</li>
* <li>OutputValueGroupingComparator: by checksum</li>
* </ul>
* </li>
*
* <li>
* <h4 id="Reduce">Reduce</h4>
*
* <p>In this phase the
* {@link #reduce(Object, Iterable, Context)}
* method is called for each <code><key, (collection of values)></code> in
* the sorted inputs.</p>
* <p>The output of the reduce task is typically written to a
* {@link RecordWriter} via
* {@link Context#write(Object, Object)}.</p>
* </li>
* </ol>
*
* <p>The output of the <code>Reducer</code> is <b>not re-sorted</b>.</p>
*
* <p>Example:</p>
* <p><blockquote><pre>
* public class IntSumReducer<Key> extends Reducer<Key,IntWritable,
* Key,IntWritable> {
* private IntWritable result = new IntWritable();
*
* public void reduce(Key key, Iterable<IntWritable> values,
* Context context) throws IOException, InterruptedException {
* int sum = 0;
* for (IntWritable val : values) {
* sum += val.get();
* }
* result.set(sum);
* context.write(key, result);
* }
* }
* </pre></blockquote></p>
*
* @see Mapper
* @see Partitioner
*/

End！

MapReduce的洗牌(Shuffle)的更多相关文章

洗牌Shuffle'm Up POJ-3087 模拟
题目链接:Shuffle'm Up 题目大意模拟纸牌的洗牌过程,已知两个牌数相等的牌堆.求解经过多少次洗牌的过程,使牌的顺序与目标顺序相同. 思路直接模拟,主要是字符串的操作.问题是,如何判断出不 ...
erlang 洗牌 shuffle
很简单的一个场景:一副扑克(54张)的乱序洗牌 shuffle_list(List) -> [X || {_, X} <- lists:sort([{random:uniform(), N ...
Fisher–Yates shuffle 洗牌(shuffle)算法
今天在敲undersore的源码,数组里面有一个shuffle,把数组随机打乱. _.shuffle = function(obj) { var set = isArrayLike(obj) ? ob ...
【转】Algorithms -离散概率值（discrete）和重置、洗牌（shuffle）算法及代码
离散概率值(discrete) 和重置\洗牌(shuffle) 算法及代码本文地址: http://blog.csdn.net/caroline_wendy/article/details/1 ...
用C语言实现的扑克牌洗牌程序
一副牌:54张从0开始排序: 0-12表示黑桃 A 1,2,3,... 10,J,Q,K 13-25表示红桃 A 1,2,3,... 10,J,Q,K 26-38表示草花 A 1,2,3,... ...
[大牛翻译系列]Hadoop（13）MapReduce 性能调优：优化洗牌（shuffle）和排序阶段
6.4.3 优化洗牌(shuffle)和排序阶段洗牌和排序阶段都很耗费资源.洗牌需要在map和reduce任务之间传输数据,会导致过大的网络消耗.排序和合并操作的消耗也是很显著的.这一节将介绍一系列 ...
[LeetCode] Shuffle an Array 数组洗牌
Shuffle a set of numbers without duplicates. Example: // Init an array with set 1, 2, and 3. int[] n ...
[转]完美洗牌（Perfect Shuffle）问题
[转]原博文地址:https://github.com/julycoding/The-Art-Of-Programming-By-July/blob/master/ebook/zh/02.09.md ...
[CareerCup] 18.2 Shuffle Cards 洗牌
18.2 Write a method to shuffle a deck of cards. It must be a perfect shuffle—in other words, each of ...

随机推荐

android5.1移植记录
应用能够配置Android系统的各种设置,这些设置的默认值都是由frameworks中的SettingsProvider从数据库中读取的frameworks/base/packages/Setting ...
Android图片处理（Matrix,ColorMatrix） - 转载
Android图片处理(Matrix,ColorMatrix) 转载自:http://www.cnblogs.com/leon19870907/articles/1978065.html 在编程中有时 ...
Unity3d OnApplicationPause与OnApplicationFocus
在手机游戏当中,会碰到“强制暂停”,如:锁屏.接电话或短信之类的.如果“强制暂停”时间过长,网络游戏有时得重新登录等事件. 而Unity3d,Android Plugins中的UnityPlayer. ...
lua中的字符串操作(模式匹配)
(一). 模式匹配函数在string库中功能最强大的函数是:string.find(字符串查找)string.gsub(全局字符串替换)string.gfind(全局字符串查找)string.gmat ...
关于GDI+的一些使用基础设置
一.新建一个MFC的单文档工程,例如工程名字叫GDIPLUSTEST1. 二.在工程的stdafx.h头文件中添加入 #include "gdiplus.h" using name ...
javaBean的理解总结
javaBean简单理解:javaBean在MVC设计模型中是model,又称模型层,在一般的程序中,我们称它为数据层,就是用来设置数据的属性和一些行为,然后我会提供获取属性和设置属性的get/set ...
MDU某产品OMCI模块代码质量现状分析
说明本文参考MDU系列某产品OMCI模块现有代码,提取若干实例以说明目前的代码质量,亦可作为甄别不良代码的参考. 本文旨在就事论事,而非否定前人(没有前人的努力也难有后人的进步).希望以史为鉴,不破 ...
【盘古分词】Lucene.Net 盘古分词实现公众号智能自动回复
盘古分词是一个基于 .net framework 的中英文分词组件.主要功能中文未登录词识别盘古分词可以对一些不在字典中的未登录词自动识别词频优先盘古分词可以根据词频来解决分词的歧义问题多元 ...
10分钟10行代码开发APP(delphi 应用案例)
总结一下用到的知识(开发环境安装配置不计算在内): 第六章使用不同风格的按钮: 第十七章让布局适应不同大小与方向的窗体: 第二十五章使用 dbExpress访问 InterBase ToGo ...
关于使用Delphi XE10 进行android开发的一些总结
RAD,可以快速开发出来,但是问题较多最好别用说实话做出来的app 太!大!了! 十分的特别的占内存! FireMonkey 真心太大了... 太占内存了开发一般应用还可 ...

MapReduce的洗牌(Shuffle)

MapReduce的洗牌(Shuffle)的更多相关文章

随机推荐

热门专题