2014 MapReduce

function map(String name, String document):

  // name: document name

  // document: document contents

  for each word w in document:

    emit (w, )

function reduce(String word, Iterator partialCounts):

  // word: a word

  // partialCounts: a list of aggregated partial counts

  sum =

  for each pc in partialCounts:

    sum += pc

  emit (word, sum)

The prototypical MapReduce example counts the appearance of each word in a set of documents:^[14]

Here, each document is split into words, and each word is counted by the map function, using the word as the result key. The framework puts together all the pairs with the same key and feeds them to the same call to reduce. Thus, this function just needs to sum all of its input values to find the total appearances of that word.

 SELECT age, AVG(contacts)

    FROM social.person

GROUP BY age

ORDER BY age

function Map is

    input: integer K1 between  and , representing a batch of  million social.person records

    for each social.person record in the K1 batch do

        let Y be the person's age

        let N be the number of contacts the person has

        produce one output record (Y,(N,))

    repeat

end function

function Reduce is

    input: age (in years) Y

    for each input record (Y,(N,C)) do

        Accumulate in S the sum of N*C

        Accumulate in Cnew the sum of C

    repeat

    let A be S/Cnew

    produce one output record (Y,(A,Cnew))

end function

-- map output #: age, quantity of contacts

,

,

,

-- map output #: age, quantity of contacts

,

,

-- map output #: age, quantity of contacts

,

-- reduce step #: age, average of contacts

,

(9*3+9*2+10*1)/(3+2+1)

(9*5+10*1)/(5+1)

imagine that for a database of 1.1 billion people, one would like to compute the average number of social contacts a person has according to age

Dataflow

The frozen part of the MapReduce framework is a large distributed sort. The hot spots, which the application defines, are:

an input reader
a Map function
a partition function
a compare function
a Reduce function
an output writer

https://zh.wikipedia.org/wiki/MapReduce

^ "我们的灵感来自lisp和其他函数式编程语言中的古老的映射和归纳操作." -"MapReduce：大规模集群上的简单数据处理方式"

MapReduce是Google提出的一个软件架构，用于大规模数据集（大于1TB）的并行运算。概念“Map（映射）”和“Reduce（归纳）”，及他们的主要思想，都是从函数式编程语言借来的，还有从矢量编程语言借来的特性。^[1]

当前的软件实现是指定一个Map（映射）函数，用来把一组键值对映射成一组新的键值对，指定并发的Reduce（归纳）函数，用来保证所有映射的键值对中的每一个共享相同的键组。

映射和归纳

简单来說，一个映射函数就是对一些独立元素组成的概念上的列表（例如，一个测试成绩的列表）的每一个元素进行指定的操作（比如，有人发现所有学生的成绩都被高估了一分，他可以定义一个“减一”的映射函数，用来修正这个错误。）。事实上，每个元素都是被独立操作的，而原始列表没有被更改，因为这里创建了一个新的列表来保存新的答案。这就是说，Map操作是可以高度并行的，这对高性能要求的应用以及并行计算领域的需求非常有用。

而归纳操作指的是对一个列表的元素进行适当的合并（继续看前面的例子，如果有人想知道班级的平均分该怎么做？他可以定义一个归纳函数，通过让列表中的奇數（odd）或偶數（even）元素跟自己的相邻的元素相加的方式把列表减半，如此递归运算直到列表只剩下一个元素，然后用这个元素除以人数，就得到了平均分）。虽然他不如映射函数那么并行，但是因为归纳总是有一个简单的答案，大规模的运算相对独立，所以归纳函数在高度并行环境下也很有用。

分布和可靠性

MapReduce通过把对数据集的大规模操作分发给网络上的每个节点实现可靠性；每个节点会周期性的把完成的工作和状态的更新报告回来。如果一个节点保持沉默超过一个预设的时间间隔，主节点（类同Google檔案系統中的主服务器）记录下这个节点状态为死亡，并把分配给这个节点的数据发到别的节点。每个操作使用命名文件的不可分割操作以确保不会发生并行线程间的冲突；当文件被改名的时候，系统可能会把他们复制到任务名以外的另一个名字上去。（避免副作用）。

归纳操作工作方式很类似，但是由于归纳操作在并行能力较差，主节点会尽量把归纳操作调度在一个节点上，或者离需要操作的数据尽可能近的节点上了；这个特性可以满足Google的需求，因为他们有足够的带宽，他们的内部网络没有那么多的机器。

MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster.^[1]^[2]

A MapReduce program is composed of a Map() procedure (method) that performs filtering and sorting (such as sorting students by first name into queues, one queue for each name) and a Reduce() method that performs a summary operation (such as counting the number of students in each queue, yielding name frequencies). The "MapReduce System" (also called "infrastructure" or "framework") orchestrates the processing by marshalling the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, and providing for redundancy and fault tolerance.

The model is a specialization of the split-apply-combine strategy for data analysis.^[3] It is inspired by the map and reduce functions commonly used in functional programming,^[4] although their purpose in the MapReduce framework is not the same as in their original forms.^[5] The key contributions of the MapReduce framework are not the actual map and reduce functions (which, for example, resemble the 1995 Message Passing Interface standard's^[6] reduce^[7] and scatter^[8] operations), but the scalability and fault-tolerance achieved for a variety of applications by optimizing the execution engine. As such, a single-threaded implementation of MapReduce will usually not be faster than a traditional (non-MapReduce) implementation; any gains are usually only seen with multi-threaded implementations.^[9] The use of this model is beneficial only when the optimized distributed shuffle operation (which reduces network communication cost) and fault tolerance features of the MapReduce framework come into play. Optimizing the communication cost is essential to a good MapReduce algorithm.^[10]

MapReduce libraries have been written in many programming languages, with different levels of optimization. A popular open-source implementation that has support for distributed shuffles is part of Apache Hadoop. The name MapReduce originally referred to the proprietary Google technology, but has since been genericized. By 2014, Google was no longer using MapReduce as their primary Big Data processing model,^[11] and development on Apache Mahout had moved on to more capable and less disk-oriented mechanisms that incorporated full map and reduce capabilities.^[12]

2014 MapReduce的更多相关文章

Hadoop MapReduce执行过程详解（带hadoop例子）
https://my.oschina.net/itblog/blog/275294 摘要: 本文通过一个例子,详细介绍Hadoop 的 MapReduce过程. 分析MapReduce执行过程 Map ...
PageRank算法简介及Map-Reduce实现
PageRank对网页排名的算法,曾是Google发家致富的法宝.以前虽然有实验过,但理解还是不透彻,这几天又看了一下,这里总结一下PageRank算法的基本原理. 一.什么是pagerank Pag ...
Window7中Eclipse运行MapReduce程序报错的问题
按照文档:http://www.micmiu.com/bigdata/hadoop/hadoop2x-eclipse-mapreduce-demo/安装配置好Eclipse后,运行WordCount程 ...
Hadoop学习之Mapreduce执行过程详解
一.MapReduce执行过程 MapReduce运行时,首先通过Map读取HDFS中的数据,然后经过拆分,将每个文件中的每行数据分拆成键值对,最后输出作为Reduce的输入,大体执行流程如下图所示: ...
Strata 2014 上的 AzureCAT 粉笔会谈
本周,AzureCAT 团队非常高兴在 Strata 会议上首次集体亮相.对于那些对 AzureCAT 团队不太熟悉的人来说,我们是 Microsoft 云与企业部门一个核心的国际性团队,由大约 ...
HADOOP之MAPREDUCE程序应用二
摘要:MapReduce程序进行单词计数. 关键词:MapReduce程序单词计数数据源:人工构造英文文档file1.txt,file2.txt. file1.txt 内容 Hello Ha ...
Hadoop之MapReduce程序应用三
摘要:MapReduce程序进行数据去重. 关键词:MapReduce 数据去重数据源:人工构造日志数据集log-file1.txt和log-file2.txt. log-file1.txt内容 ...
Mapreduce参数调节
http://blog.javachen.com/2014/06/24/tuning-in-mapreduce/ 本文主要记录Hadoop 2.x版本中MapReduce参数调优,不涉及Yarn的调优 ...
Hadoop MapReduce开发最佳实践（上篇）
body{ font-family: "Microsoft YaHei UI","Microsoft YaHei",SimSun,"Segoe UI& ...

随机推荐

openerp-server.conf 中配置 dbfilter 参数无效的解决办法
来自:http://shine-it.net/index.php/topic,14517.html 以前就发现过这个问题, 今天重新在群里同大家讨论了一下. 有时候可能我们希望用户不从登陆界面的账套选 ...
CentOS6.5下docker的安装及遇到的问题和简单使用（已实践）
转载自 CentOS6下docker的安装和使用 Docker是一个开源的应用容器引擎,可以轻松的为任何应用创建一个轻量级的.可移植的.自给自足的容器.利用Linux的LXC.AUFS. Go语言.c ...
win10下iis绑定局域网ip无效的解决方案
win7不会出现此问题 win10会 win8未测试问题描述 <binding protocol="http" bindingInformation="*:808 ...
AngularJs学习笔记（4）——自定义指令
对指令的第一印象:它是一个自定义标签! 先来看一个简单的指令: <!doctype html> <html ng-app="myApp"> <head ...
有关IM即时通讯原理
在网上搜索了一些资料,谈谈自己对IM即时通讯的理解 IM全称为Instant Messaging,即时通讯,如qq那种的. 现在有两个用户UserA, UserB, 俩人是一个IM通讯软件的好友,Us ...
Android Volley框架的几种post提交请求方式
首先简单描述一下Google的Android开发团队在2013年推出的一个网络通信框架Volley.它的设计目标是进行数据量不大,但通信频繁的网络操作,而对于大数据量的网络操作,比如下载文件等,Vol ...
android推送方式
本文介绍在Android中实现推送方式的基础知识及相关解决方案.推送功能在手机开发中应用的场景是越来起来了,不说别的,就我们手机上的新闻客户端就时不j时的推送过来新的消息,很方便的阅读最新的新闻信息. ...
Cocos2D-X2.2.3学习笔记10(几何图形)
我们这节来学习几何图形,即怎样使用Cocos2d-x绘制各种图形.已经贝塞尔曲线我们查看CCNode中有个draw函数,我们须要将绘制的代码所有写在这个函数里面.写在init函数里是画不出线来的, ...
NHibernate Transformers.AliasToEntityMap 返回Hashtable
string query = "select a.CustomerName as CustomerName, b.ProductName as ProductName from Custom ...
spring 第一篇（1-1）：让java开发变得更简单(上)
1.释放POJOS能量传统开发中是如何束缚POJOS呢,如果你开发过java很长时间,那你一定有接触过EJB的开发.那时候开发一个小小的功能都要扩展框架的类或者实现其接口.所以你很容易在早期的Str ...