spark hadoop 对比 Resilient Distributed Datasets
hadoop 迭代消耗大 每次迭代启动一个完整的MapReduce作业
spark 首要目标就是避免运算时 过多的网络和磁盘IO开销
Resilient Distributed Datasets
http://www.cs.cmu.edu/~pavlo/courses/fall2013/static/slides/spark.pdf
Resilient Distributed Datasets
Presented by Henggang Cui
15799b Talk
1
Why not MapReduce
• Provide fault-tolerance, but:
• Hard to reuse intermediate results across
multiple computations
– stable storage for sharing data across jobs
• Hard to support interactive ad-hoc queries
2
Why not Other In-Memory Storage
• Examples: Piccolo
– Apply fine-grained updates to shared states
• Efficient, but:
• Hard to provide fault-tolerance
– need replication or checkpointing
3
Resilient Distributed Datasets (RDDs)
• Restricted form of distributed shared memory
– read-only, partitioned collection of records
– can only be built through coarse‐grained
deterministic transformations
• data in stable storage
• transformations from other RDDs.
• Express computation by
– defining RDDs
4
Fault Recovery
• Efficient fault recovery using lineage
– log one operation to apply to many elements
(lineage)
– recompute lost partitions on failure
5
Example
lines = spark.textFile("hdfs://...")
errors = lines.filter(_.startsWith("ERROR"))
hdfs_errors = errors.filter(_.contains(“HDFS"))
6
Advantages of the RDD Model
• Efficient fault recovery
– fine-grained and low-overhead using lineage
• Immutable nature can mitigate stragglers
– backup tasks to mitigate stragglers
• Graceful degradation when RAM is not
enough
7
Spark
• Implementation of the RDD abstraction
– Scala interface
• Two components
– Driver
– Workers
8
• Driver
– defines and invokes actions on RDDs
– tracks the RDDs’ lineage
• Workers
– store RDD partitions
– perform RDD
transformations
Spark Runtime
9
Supported RDD Operations
• Transformations
– map (f: T->U)
– filter (f: T->Bool)
– join()
– ... (and lots of others)
• Actions
– count()
– save()
– ... (and lots of others)
10
Representing RDDs
• A graph-based representation for RDDs
• Pieces of information for each RDD
– a set of partitions
– a set of dependencies on parent RDDs
– a function for computing it from its parents
– metadata about its partitioning scheme and data
placement
11
RDD Dependencies
• Narrow dependencies
– each partition of the parent RDD is used by at
most one partition of the child RDD
• Wide dependencies
– multiple child partitions may depend on it
12
RDD Dependencies
13
RDD Dependencies
• Narrow dependencies
– allow for pipelined execution on one cluster node
– easy fault recovery
• Wide dependencies
– require data from all parent partitions to be
available and to be shuffled across the nodes
– a single failed node might cause a complete reexecution.
14
Job Scheduling
• To execute an action on an RDD
– scheduler decide the stages from the RDD’s
lineage graph
– each stage contains as many pipelined
transformations with narrow dependencies as
possible
15
Job Scheduling
16
Memory Management
• Three options for persistent RDDs
– in-memory storage as deserialized Java objects
– in-memory storage as serialized data
– on-disk storage
• LRU eviction policy at the level of RDDs
– when there’s not enough memory, evict a
partition from the least recently accessed RDD
17
Checkpointing
• Checkpoint RDDs to prevent long lineage
chains during fault recovery
• Simpler to checkpoint than shared memory
– Read-only nature of RDDs
18
Discussions
19
Checkpointing or Versioning?
20
• Frequent checkpointing, or
Keep all versions of ranks?
spark hadoop 对比 Resilient Distributed Datasets的更多相关文章
- Apache Spark 2.2.0 中文文档 - Spark RDD(Resilient Distributed Datasets)论文 | ApacheCN
Spark RDD(Resilient Distributed Datasets)论文 概要 1: 介绍 2: Resilient Distributed Datasets(RDDs) 2.1 RDD ...
- Apache Spark RDD(Resilient Distributed Datasets)论文
Spark RDD(Resilient Distributed Datasets)论文 概要 1: 介绍 2: Resilient Distributed Datasets(RDDs) 2.1 RDD ...
- Apache Spark 2.2.0 中文文档 - Spark RDD(Resilient Distributed Datasets)
Spark RDD(Resilient Distributed Datasets)论文 概要 1: 介绍 2: Resilient Distributed Datasets(RDDs) 2.1 RDD ...
- Spark的核心RDD(Resilient Distributed Datasets弹性分布式数据集)
Spark的核心RDD (Resilient Distributed Datasets弹性分布式数据集) 原文链接:http://www.cnblogs.com/yjd_hycf_space/p/7 ...
- spark 笔记 2: Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf ucb关于spark的论文,对spark中核心组件RDD最原始.本质的理解, ...
- RDD内存迭代原理(Resilient Distributed Datasets)---弹性分布式数据集
Spark的核心RDD Resilient Distributed Datasets(弹性分布式数据集) Spark运行原理与RDD理论 Spark与MapReduce对比,MapReduce的计 ...
- Scala当中什么是RDD(Resilient Distributed Datasets)弹性分布式数据集
RDD(Resilient Distributed Datasets)弹性分布式数据集.你不好理解的话,可以把RDD就可以看成是一个简单的"动态数组"(比如ArrayList),对 ...
- 【Spark】RDD(Resilient Distributed Dataset)究竟是什么?
目录 基本概念 官方文档 概述 含义 RDD出现的原因 五大属性 以单词统计为例,一张图熟悉RDD当中的五大属性 解构图 RDD弹性 RDD特点 分区 只读 依赖 缓存 checkpoint 基本概念 ...
- 大数据 --> Spark与Hadoop对比
Spark与Hadoop对比 什么是Spark Spark是UC Berkeley AMP lab所开源的类Hadoop MapReduce的通用的并行计算框架,Spark基于map reduce算法 ...
随机推荐
- OpenGL C#绘图环境配置
OpenGL C#绘图环境配置 OpenGL简介 OpenGL作为一种图形学编程接口已经非常流行, 虽然在大型游戏方面DirectX有一定的市场占有率, 但由于OpenGL的开放性,可移植性等优点 ...
- python 深浅拷贝&集合
一.深浅拷贝 1.浅拷贝,只会拷贝第一层 s = [1, 'ss', '小可爱'] s1 = s.copy() print(s1) >>> [1, 'ss', '小可爱'] s = ...
- jsp+servlet+mysql增删改查
用的IntelliJ IDEA开发的,jdk1.8 1 首先是项目结构,如下图所示 2看各层的代码 首先是web.xml <?xml version="1.0" encodi ...
- 用java实现二分搜索<算法分析>
实验目的:1.复习java编程:2.掌握二分搜索技术的基本原理:3.掌握使用java程序进行二分搜索的方法.实验步骤:1.由用户输入5个以上的整数:2.利用二分搜索算法完成对数组的搜索. packag ...
- 7-19 求链式线性表的倒数第K项(20 分)(单链表定义与尾插法)
给定一系列正整数,请设计一个尽可能高效的算法,查找倒数第K个位置上的数字. 输入格式: 输入首先给出一个正整数K,随后是若干正整数,最后以一个负整数表示结尾(该负数不算在序列内,不要处理). 输出格式 ...
- hdu 1814 2-sat 输出字典最小的和任意序列的 模板题
/* 思路:http://blog.csdn.net/string_yi/article/details/12686873 hdu 1814 输出字典序最小的2-sat */ #include< ...
- httpClient使用总结
前记 最近有个需求,需要根据商品id获取商品详情: 首先想到的是在浏览器里输入url按回车就可以了:或者在linux中使用curl+url来发起一个http请求; 但如果是要在java程序中发出htt ...
- NIST的安全内容自动化协议(SCAP)以及SCAP中文社区简介
https://blog.csdn.net/langkew/article/details/8795530?utm_source=tuicool&utm_medium=referral
- poj - 2195 Going Home (费用流 || 最佳匹配)
http://poj.org/problem?id=2195 对km算法不理解,模板用的也不好. 下面是大神的解释. KM算法的要点是在相等子图中寻找完备匹配,其正确性的基石是:任何一个匹配的权值之和 ...
- Delphi:Indy9的IdFTP完全使用
Delphi 7自带的INDY控件,其中包含了IdFTP,可以方便的实现FTP客户端程序,参考自带的例子,其中有上传.下载.删除文件,但是不包含对文件夹的操作,得自己实现上传.下载.删除整个文件夹(带 ...