spark hadoop 对比 Resilient Distributed Datasets

hadoop 迭代消耗大每次迭代启动一个完整的MapReduce作业

spark 首要目标就是避免运算时过多的网络和磁盘IO开销

Resilient Distributed Datasets

http://www.cs.cmu.edu/~pavlo/courses/fall2013/static/slides/spark.pdf

Resilient Distributed Datasets
Presented by Henggang Cui
15799b Talk
1
Why not MapReduce
• Provide fault-tolerance, but:
• Hard to reuse intermediate results across
multiple computations
– stable storage for sharing data across jobs
• Hard to support interactive ad-hoc queries
2
Why not Other In-Memory Storage
• Examples: Piccolo
– Apply fine-grained updates to shared states
• Efficient, but:
• Hard to provide fault-tolerance
– need replication or checkpointing
3
Resilient Distributed Datasets (RDDs)
• Restricted form of distributed shared memory
– read-only, partitioned collection of records
– can only be built through coarse‐grained
deterministic transformations
• data in stable storage
• transformations from other RDDs.
• Express computation by
– defining RDDs
4
Fault Recovery
• Efficient fault recovery using lineage
– log one operation to apply to many elements
(lineage)
– recompute lost partitions on failure
5
Example
lines = spark.textFile("hdfs://...")
errors = lines.filter(_.startsWith("ERROR"))
hdfs_errors = errors.filter(_.contains(“HDFS"))
6
Advantages of the RDD Model
• Efficient fault recovery
– fine-grained and low-overhead using lineage
• Immutable nature can mitigate stragglers
– backup tasks to mitigate stragglers
• Graceful degradation when RAM is not
enough
7
Spark
• Implementation of the RDD abstraction
– Scala interface
• Two components
– Driver
– Workers
8
• Driver
– defines and invokes actions on RDDs
– tracks the RDDs’ lineage
• Workers
– store RDD partitions
– perform RDD
transformations
Spark Runtime
9
Supported RDD Operations
• Transformations
– map (f: T->U)
– filter (f: T->Bool)
– join()
– ... (and lots of others)
• Actions
– count()
– save()
– ... (and lots of others)
10
Representing RDDs
• A graph-based representation for RDDs
• Pieces of information for each RDD
– a set of partitions
– a set of dependencies on parent RDDs
– a function for computing it from its parents
– metadata about its partitioning scheme and data
placement
11
RDD Dependencies
• Narrow dependencies
– each partition of the parent RDD is used by at
most one partition of the child RDD
• Wide dependencies
– multiple child partitions may depend on it
12
RDD Dependencies
13
RDD Dependencies
• Narrow dependencies
– allow for pipelined execution on one cluster node
– easy fault recovery
• Wide dependencies
– require data from all parent partitions to be
available and to be shuffled across the nodes
– a single failed node might cause a complete reexecution.
14
Job Scheduling
• To execute an action on an RDD
– scheduler decide the stages from the RDD’s
lineage graph
– each stage contains as many pipelined
transformations with narrow dependencies as
possible
15
Job Scheduling
16
Memory Management
• Three options for persistent RDDs
– in-memory storage as deserialized Java objects
– in-memory storage as serialized data
– on-disk storage
• LRU eviction policy at the level of RDDs
– when there’s not enough memory, evict a
partition from the least recently accessed RDD
17
Checkpointing
• Checkpoint RDDs to prevent long lineage
chains during fault recovery
• Simpler to checkpoint than shared memory
– Read-only nature of RDDs
18
Discussions
19
Checkpointing or Versioning?
20
• Frequent checkpointing, or
Keep all versions of ranks?

spark hadoop 对比 Resilient Distributed Datasets的更多相关文章

Apache Spark 2.2.0 中文文档 - Spark RDD（Resilient Distributed Datasets）论文 | ApacheCN
Spark RDD(Resilient Distributed Datasets)论文概要 1: 介绍 2: Resilient Distributed Datasets(RDDs) 2.1 RDD ...
Apache Spark RDD（Resilient Distributed Datasets）论文
Spark RDD(Resilient Distributed Datasets)论文概要 1: 介绍 2: Resilient Distributed Datasets(RDDs) 2.1 RDD ...
Apache Spark 2.2.0 中文文档 - Spark RDD（Resilient Distributed Datasets）
Spark RDD(Resilient Distributed Datasets)论文概要 1: 介绍 2: Resilient Distributed Datasets(RDDs) 2.1 RDD ...
Spark的核心RDD（Resilient Distributed Datasets弹性分布式数据集）
Spark的核心RDD (Resilient Distributed Datasets弹性分布式数据集) 原文链接:http://www.cnblogs.com/yjd_hycf_space/p/7 ...
spark 笔记 2： Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf ucb关于spark的论文,对spark中核心组件RDD最原始.本质的理解, ...
RDD内存迭代原理(Resilient Distributed Datasets)---弹性分布式数据集
Spark的核心RDD Resilient Distributed Datasets(弹性分布式数据集) Spark运行原理与RDD理论 Spark与MapReduce对比,MapReduce的计 ...
Scala当中什么是RDD（Resilient Distributed Datasets）弹性分布式数据集
RDD(Resilient Distributed Datasets)弹性分布式数据集.你不好理解的话,可以把RDD就可以看成是一个简单的"动态数组"(比如ArrayList),对 ...
【Spark】RDD(Resilient Distributed Dataset)究竟是什么？
目录基本概念官方文档概述含义 RDD出现的原因五大属性以单词统计为例,一张图熟悉RDD当中的五大属性解构图 RDD弹性 RDD特点分区只读依赖缓存 checkpoint 基本概念 ...
大数据 --> Spark与Hadoop对比
Spark与Hadoop对比什么是Spark Spark是UC Berkeley AMP lab所开源的类Hadoop MapReduce的通用的并行计算框架,Spark基于map reduce算法 ...

随机推荐

mysql limit关键字
select * from table_name limit [index, ] length; limit后面跟2个参数: index:索引号,从0开始计算,表示从哪一行开始: length:长度, ...
react之webpack
1. 下载相关模块包 * 创建package.json ``` npm init ``` * react相关库 package-lock.json ``` npm install react reac ...
如何手写一款KOA的中间件来实现断点续传
本文实现的断点续传只是我对断点续传的一个理解.其中有很多不完善的地方,仅仅是记录了一个我对断点续传一个实现过程.大家应该也会发现我用的都是一些H5的api,老得浏览器不会支持,以及我并未将跨域考虑入内 ...
笔试算法题（18）：常数时间删除节点 & 找到仅出现一次的两个数字
出题:给定链表的头指针和一个节点指针,要求在O(1)的时间复杂度下删除该节点分析: 如果需要删除的节点为A,其前序节点为A-,其后续节点为A+,所以删除A之后,需要使得A-的下一个节点就是A+,常规 ...
pop(),del A[:], a[:] = b[:]/'str'/可迭代的
s = ['a','ma','shi','ge'] s0 = s.pop(0) #---> 有返回值 print(s,s0) s1 = s.remove('shi') #---> 无返回值 ...
【tips】RESTful架构
认识RESTful在前后端分离的应用模式里,后端API接口如何定义?例如对于后端数据库中保存了商品的信息,前端可能需要对商品数据进行增删改查,那相应的每个操作后端都需要提供一个API接口: PO ...
jmeter 接口测试
web接口测试工具: 手工测试的话可以用postman ,自动化测试多是用到 Jmeter(开源).soupUI(开源&商业版). 下面将对前一篇Postman做接口测试中的接口用Jmeter ...
Qt 编写应用支持多语言版本--一个GUI应用示例
简介上一篇博文已经说过如何编写支持多语言的Qt 命令行应用,这一篇说说Qt GUI 应用多语言支持的坑. 本人喜欢用代码来写布局,而不是用 Qt Designer 来设计布局,手写布局比 Qt De ...
【BZOJ 1202】 [HNOI2005]狡猾的商人 (加权并查集)
题链:http://www.lydsy.com/JudgeOnline/problem.php?id=1202 Description 刁姹接到一个任务,为税务部门调查一位商人的账本,看看账本是不是伪 ...
BNUOJ 5363 Machine Schedule
Machine Schedule Time Limit: 1000ms Memory Limit: 32768KB This problem will be judged on HDU. Origin ...

spark hadoop 对比 Resilient Distributed Datasets

spark hadoop 对比 Resilient Distributed Datasets的更多相关文章

随机推荐

热门专题