Google三驾马车之二:MapReduce
第一次接触mr还是在入门mit6.824的lab1,最近重新读了一遍原始论文,又有了一些新的想法,简单做一些记录。
作为Google分布式系统的重要组成,本篇文章核心在于map/reduce操作带来的抽象并行化,给出接口之后,编写应用程序的程序员就不需要对底层的机制做过多的处理。而在本质上,mr只是实现了一组分布式的并行框架,而实际依赖的底层分布式infrastructure还是GFS。
MapReduce: Simplified Data Processing on Large Clusters
Programming Model
K/V pairs
original task --map--> intermedicate K/V pairs --shuffle--> --reduce--> result
shuffle: to generate the same key list
the result can be multi-set, the merge work is finished by user function
Here is a word cound app from mit 6.824(golang)
// The map function is called once for each file of input. The first
// argument is the name of the input file, and the second is the
// file's complete contents. You should ignore the input file name,
// and look only at the contents argument. The return value is a slice
// of key/value pairs.
func Map(filename string, contents string) []mr.KeyValue {
// function to detect word separators.
ff := func(r rune) bool { return !unicode.IsLetter(r) }
// split contents into an array of words.
words := strings.FieldsFunc(contents, ff)
kva := []mr.KeyValue{}
for _, w := range words {
kv := mr.KeyValue{w, "1"}
kva = append(kva, kv)
}
return kva
}
// The reduce function is called once for each key generated by the
// map tasks, with a list of all the values created for that key by
// any map task.
func Reduce(key string, values []string) string {
// return the number of occurrences of this word.
return strconv.Itoa(len(values))
}
map (k1,v1) → list(k2,v2)
reduce (k2,list(v2)) → list(v2)
working flow
- split input file
- master(coordinator) allocate map-task
- do map, generate inter k/v pair
- write inter k/v pair in R partition
- sort on master, reduce: RPC read inter-file from map machine
- final output to GFS
- return
After successful completion, the output of the mapreduce execution is available in the R output files (one per reduce task, with file names as specified by the user). Typically, users do not need to combine these R output files into one file – they often pass these files as input to another MapReduce call, or use them from another distributed application that is able to deal with input that is partitioned into multiple files.
In practical grogramming, the atomic operation is important(regardless of C++ or Go or...)
fault tolerant
heartbeat: master <---> slave
Completed map tasks are re-executed on a failure because their output is stored on the local disk(s) of the failed machine and is therefore inaccessible. Completed reduce tasks do not need to be re-executed since their output is stored in a global file system.
reduce re-execute if has not read finish from a map-machine(RPC would fail)
Task Granularity
M, R >> machine number -> load balance
common: M > R, to decrease final file number
Backup Tasks
solve straggler: When a MapReduce operation is close to completion, the master schedules backup executions of the remaining in-progress tasks.
Refinements
- Partitioning function: pre-define the number of output file: use hash
- pre-sort
- combiner, eg. in wc map-task, append the same key-value here
- input/output: different file type(read by line or offset), database and memory are also useful.
- side-effects
- error in code: skipping bad records(optional)
- sequential on local machine(help to debug)
- display the task status(command or gui) -> data collect and analyse
- counter(in lib) for sth.
Discussion
in some cases: we can also store the inter-file in the global file system,
thus we dont need re-execute the map-task if machine shutdown,
take the reduce RPC as GFS reading.
the bindwidth can be the essential bottleneck, p2p network can decrease the master's I/O pressure
MapReduce: open source version: hadoop(yahoo/apache)
middle step: shuffle, one key run (not) once????? in reduce
so we need combiner?
- use combiner: reduce one key once
- dont use combiner: reduce one key map partition times
but where combiner running?
map-local-disk?: local combine
master?: dont do any logic calculating work
before reduce?: shuffle
shuffle & combine could be bottleneck
task failure: restart tasks
node failure: restart tasks on new node: re-run all finished task for lose inter-file
Google三驾马车之二:MapReduce的更多相关文章
- 分布式系统漫谈一 —— Google三驾马车: GFS,mapreduce,Bigtable
分布式系统学习必读文章!!!! 原文:http://blog.sina.com.cn/s/blog_4ed630e801000bi3.html 分布式系统漫谈一 —— Google三驾马车: GFS, ...
- [MapReduce] Google三驾马车:GFS、MapReduce和Bigtable
声明:此文转载自博客开发团队的博客,尊重原创工作.该文适合学分布式系统之前,作为背景介绍来读. 谈到分布式系统,就不得不提Google的三驾马车:Google FS[1],MapReduce[2],B ...
- Google三驾马车:GFS、MapReduce和Bigtable
谈到分布式系统,就不得不提Google的三驾马车:Google fs[1],Mapreduce[2],Bigtable[3]. 虽然Google没有公布这三个产品的源码,但是他发布了这三个产品的详细设 ...
- Google三驾马车
Google旧三驾马车: GFS,mapreduce,Bigtable http://blog.sina.com.cn/s/blog_4ed630e801000bi3.html Google新三驾马车 ...
- 【技术与商业案例解读笔记】095:Google大数据三驾马车笔记
1.谷歌三驾马车地位 [关键词]开启时代,指明方向 聊起大数据,我们通常言必称谷歌,谷歌有“三驾马车”:谷歌文件系统(GFS).MapReduce和BigTable.谷歌的“三驾马车”开启了大数据时 ...
- Childlife旗下三驾马车
Childlife旗下,尤其以 “提高免疫力”为口号的“三驾马车”:第一防御液.VC.紫雏菊,是相当热门的海淘产品.据说这是一系列“成分天然.有效治愈感冒提升免疫力.由美国著名儿科医生研发”的药物.
- Ubuntu 安装 k8s 三驾马车 kubelet kubeadm kubectl
Ubuntu 版本是 18.04 ,用的是阿里云服务器,记录一下自己实际安装过程的操作步骤. 安装 docker 安装所需的软件 apt-get update apt-get install -y a ...
- Qt 学习笔记 - 第三章 - Qt的三驾马车之一 - 串口编程 + 程序打包成Windows软件
Qt 学习笔记全系列传送门: Qt 学习笔记 - 第一章 - 快速开始.信号与槽 Qt 学习笔记 - 第二章 - 添加图片.布局.界面切换 [本章]Qt 学习笔记 - 第三章 - Qt的三驾马车之一 ...
- 更强、更稳、更高效:解读 etcd 技术升级的三驾马车
点击下载<不一样的 双11 技术:阿里巴巴经济体云原生实践> 本文节选自<不一样的 双11 技术:阿里巴巴经济体云原生实践>一书,点击上方图片即可下载! 作者 | 陈星宇(宇慕 ...
- java大数据最全课程学习笔记(6)--MapReduce精通(二)--MapReduce框架原理
目前CSDN,博客园,简书同步发表中,更多精彩欢迎访问我的gitee pages 目录 MapReduce精通(二) MapReduce框架原理 MapReduce工作流程 InputFormat数据 ...
随机推荐
- P4913【黄】
这题好像可以用线段树什么的高级做法来做,但我感觉我这个简单做法不管是时间还是空间都和那些复杂的做法差不了太多.重点是很优雅,思路非常简单,而且代码很短,用OOP思想写成的代码可读性极高,不用注释估计都 ...
- wireshark 报文颜色
在使用wireshark抓包分析的过程中,默认会对不同的包进行着色,截图如下: 对不同的颜色有了解,可快速的过滤包或分析请求. 菜单栏选择视图-->着色规则,即可看到不同颜色代表的含义: 大致可 ...
- centos7 firewalld配置常用规则
限制ssh服务只允许某个ip ### 允许某个ip(调整前,务必添加定时任务`29 17 * * * systemctl stop firewalld`) firewall-cmd --permane ...
- APB Slave Design
APB Slave Design module apb_slave #( REG1_ADDR = 8'h00, REG2_ADDR = 8'h04, REG3_ADDR = 8'h08 ) ( // ...
- file-loader返回object Module 路径的问题
新版本的 file-loader生成使用ES模块语法的JS模块,所以它加载的文件,不再返回路径,而是返回一个对象,通过对象.default属性,可以取得路径 所以第一种方法,可以修改路径 <i ...
- Matplotlib.pyplot.scatter 散点图绘制
Matplotlib.pyplot.plot 绘图 matplotlib.pyplot.scatter(x, y, s=None, c=None, marker=None, cmap=None, no ...
- [转帖]Oracle的审计
AUDIT_TRAIL 初始化参数AUDIT_TRAIL用于控制数据库审计,默认值为none. 参数类型: String 默认值: none 允许动态修改: 否 基本参数: 否 语法: AUDIT_T ...
- Oracle Preinstall 调优参数的学习
Oracle Preinstall 调优参数的学习 背景 学习是一个痛苦并快乐的过程. 之前自己手工安装过很多套Oracle数据库,也总结过很多 但是很多都是比较皮毛的. 最近遇到了一些问题. 才发现 ...
- [转帖]【mmap】深度分析mmap:是什么 为什么 怎么用 性能总结
https://blog.csdn.net/bandaoyu/article/details/106750990 目录 有什么用? 1.文件映射 2.分配内存(匿名文件映射) mmap基础概念 mma ...
- [转帖]Centos7 nginx访问日志文件割接
一.yum安装nginx 二.各文件路径( /etc/nginx/nginx.conf) 1.访问日志路径:access_log /var/log/nginx/access.log main; 2.p ...