hadoop 迭代消耗大 每次迭代启动一个完整的MapReduce作业

spark 首要目标就是避免运算时 过多的网络和磁盘IO开销

Resilient Distributed Datasets

http://www.cs.cmu.edu/~pavlo/courses/fall2013/static/slides/spark.pdf

Resilient Distributed Datasets
Presented by Henggang Cui
15799b Talk
1
Why not MapReduce
• Provide fault-tolerance, but:
• Hard to reuse intermediate results across
multiple computations
– stable storage for sharing data across jobs
• Hard to support interactive ad-hoc queries
2
Why not Other In-Memory Storage
• Examples: Piccolo
– Apply fine-grained updates to shared states
• Efficient, but:
• Hard to provide fault-tolerance
– need replication or checkpointing
3
Resilient Distributed Datasets (RDDs)
• Restricted form of distributed shared memory
– read-only, partitioned collection of records
– can only be built through coarse‐grained
deterministic transformations
• data in stable storage
• transformations from other RDDs.
• Express computation by
– defining RDDs
4
Fault Recovery
• Efficient fault recovery using lineage
– log one operation to apply to many elements
(lineage)
– recompute lost partitions on failure
5
Example
lines = spark.textFile("hdfs://...")
errors = lines.filter(_.startsWith("ERROR"))
hdfs_errors = errors.filter(_.contains(“HDFS"))
6
Advantages of the RDD Model
• Efficient fault recovery
– fine-grained and low-overhead using lineage
• Immutable nature can mitigate stragglers
– backup tasks to mitigate stragglers
• Graceful degradation when RAM is not
enough
7
Spark
• Implementation of the RDD abstraction
– Scala interface
• Two components
– Driver
– Workers
8
• Driver
– defines and invokes actions on RDDs
– tracks the RDDs’ lineage
• Workers
– store RDD partitions
– perform RDD
transformations
Spark Runtime
9
Supported RDD Operations
• Transformations
– map (f: T->U)
– filter (f: T->Bool)
– join()
– ... (and lots of others)
• Actions
– count()
– save()
– ... (and lots of others)
10
Representing RDDs
• A graph-based representation for RDDs
• Pieces of information for each RDD
– a set of partitions
– a set of dependencies on parent RDDs
– a function for computing it from its parents
– metadata about its partitioning scheme and data
placement
11
RDD Dependencies
• Narrow dependencies
– each partition of the parent RDD is used by at
most one partition of the child RDD
• Wide dependencies
– multiple child partitions may depend on it
12
RDD Dependencies
13
RDD Dependencies
• Narrow dependencies
– allow for pipelined execution on one cluster node
– easy fault recovery
• Wide dependencies
– require data from all parent partitions to be
available and to be shuffled across the nodes
– a single failed node might cause a complete reexecution.
14
Job Scheduling
• To execute an action on an RDD
– scheduler decide the stages from the RDD’s
lineage graph
– each stage contains as many pipelined
transformations with narrow dependencies as
possible
15
Job Scheduling
16
Memory Management
• Three options for persistent RDDs
– in-memory storage as deserialized Java objects
– in-memory storage as serialized data
– on-disk storage
• LRU eviction policy at the level of RDDs
– when there’s not enough memory, evict a
partition from the least recently accessed RDD
17
Checkpointing
• Checkpoint RDDs to prevent long lineage
chains during fault recovery
• Simpler to checkpoint than shared memory
– Read-only nature of RDDs
18
Discussions
19
Checkpointing or Versioning?
20
• Frequent checkpointing, or
Keep all versions of ranks?

spark hadoop 对比 Resilient Distributed Datasets的更多相关文章

  1. Apache Spark 2.2.0 中文文档 - Spark RDD(Resilient Distributed Datasets)论文 | ApacheCN

    Spark RDD(Resilient Distributed Datasets)论文 概要 1: 介绍 2: Resilient Distributed Datasets(RDDs) 2.1 RDD ...

  2. Apache Spark RDD(Resilient Distributed Datasets)论文

    Spark RDD(Resilient Distributed Datasets)论文 概要 1: 介绍 2: Resilient Distributed Datasets(RDDs) 2.1 RDD ...

  3. Apache Spark 2.2.0 中文文档 - Spark RDD(Resilient Distributed Datasets)

    Spark RDD(Resilient Distributed Datasets)论文 概要 1: 介绍 2: Resilient Distributed Datasets(RDDs) 2.1 RDD ...

  4. Spark的核心RDD(Resilient Distributed Datasets弹性分布式数据集)

    Spark的核心RDD (Resilient Distributed Datasets弹性分布式数据集)  原文链接:http://www.cnblogs.com/yjd_hycf_space/p/7 ...

  5. spark 笔记 2: Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

    http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf  ucb关于spark的论文,对spark中核心组件RDD最原始.本质的理解, ...

  6. RDD内存迭代原理(Resilient Distributed Datasets)---弹性分布式数据集

    Spark的核心RDD Resilient Distributed Datasets(弹性分布式数据集)   Spark运行原理与RDD理论 Spark与MapReduce对比,MapReduce的计 ...

  7. Scala当中什么是RDD(Resilient Distributed Datasets)弹性分布式数据集

    RDD(Resilient Distributed Datasets)弹性分布式数据集.你不好理解的话,可以把RDD就可以看成是一个简单的"动态数组"(比如ArrayList),对 ...

  8. 【Spark】RDD(Resilient Distributed Dataset)究竟是什么?

    目录 基本概念 官方文档 概述 含义 RDD出现的原因 五大属性 以单词统计为例,一张图熟悉RDD当中的五大属性 解构图 RDD弹性 RDD特点 分区 只读 依赖 缓存 checkpoint 基本概念 ...

  9. 大数据 --> Spark与Hadoop对比

    Spark与Hadoop对比 什么是Spark Spark是UC Berkeley AMP lab所开源的类Hadoop MapReduce的通用的并行计算框架,Spark基于map reduce算法 ...

随机推荐

  1. C#导出word [无规则表结构+模板遇到的坑]

    1)当然可以考虑使用aspose.word.使用书签替换的方案替换模板中对应的书签值. 2)但是我使用了Interop.Word,下面记录使用类及要注意的地方 3)使用类 Report.cs 来自于网 ...

  2. 怎样让Oracle的存储过程返回结果集

    Oracle存储过程: CREATE OR REPLACE PROCEDURE getcity ( citycode IN VARCHAR2, ref_cursor OUT sys_refcursor ...

  3. LINUX:Contos7.0 / 7.2 LAMP+R 下载安装Apache篇

    文章来源:http://www.cnblogs.com/hello-tl/p/7568803.html 更新时间:2017-09-21 15:38 简介 LAMP+R指Linux+Apache+Mys ...

  4. Python之trutle库-五角星

    Python之trutle库-五角星 #!/usr/bin/env python # coding: utf-8 # Python turtle库官方文档:https://docs.python.or ...

  5. ECNU 3263 丽娃河的狼人传说 (贪心)

    链接:http://acm.ecnu.edu.cn/problem/3263/ 题意: 从 1 到 n 的一条数轴.有 m 个区间至少要安装一定数量的路灯,路灯只能装在整数点上,有k盏路灯已经安装好  ...

  6. Vue如何使用vue-area-linkage实现地址三级联动效果

    很多时候我们需要使用地址三级联动,即省市区三级联动.网上有很多插件,在此介绍Vue的一款地区联动插件:vue-area-linkage,下面介绍如何使用这个插件实现地址联动效果:         1. ...

  7. 【codeforces 1109B】Sasha and One More Name

    [链接] 我是链接,点我呀:) [题意] 题意 [题解] 如果这个回文串的左半部分,字母全是一样的. 那么显然不可能再分出来了,因为不管怎么分怎么排列,最后肯定都只能和原串一样. 所以无解 其他情况下 ...

  8. BNUOJ 2947 Buy Tickets

    Buy Tickets Time Limit: 4000ms Memory Limit: 65536KB This problem will be judged on PKU. Original ID ...

  9. 九度oj 题目1060:完数VS盈数

    题目1060:完数VS盈数 时间限制:1 秒 内存限制:32 兆 特殊判题:否 提交:6461 解决:2426 题目描述: 一个数如果恰好等于它的各因子(该数本身除外)子和,如:6=3+2+1.则称其 ...

  10. [K/3Cloud] 使用操作还是服务

    现在菜单点击事件既可以挂操作又可以挂服务,那到底是应该挂操作还是服务呢? 有个需求是要求一个动作可以在两个时点被触发 1.单据由下推或选单生成的时候: 2.点击单据界面功能菜单: 这样是不是需要做一个 ...