spark hadoop 对比 Resilient Distributed Datasets

hadoop 迭代消耗大每次迭代启动一个完整的MapReduce作业

spark 首要目标就是避免运算时过多的网络和磁盘IO开销

Resilient Distributed Datasets

http://www.cs.cmu.edu/~pavlo/courses/fall2013/static/slides/spark.pdf

Resilient Distributed Datasets
Presented by Henggang Cui
15799b Talk
1
Why not MapReduce
• Provide fault-tolerance, but:
• Hard to reuse intermediate results across
multiple computations
– stable storage for sharing data across jobs
• Hard to support interactive ad-hoc queries
2
Why not Other In-Memory Storage
• Examples: Piccolo
– Apply fine-grained updates to shared states
• Efficient, but:
• Hard to provide fault-tolerance
– need replication or checkpointing
3
Resilient Distributed Datasets (RDDs)
• Restricted form of distributed shared memory
– read-only, partitioned collection of records
– can only be built through coarse‐grained
deterministic transformations
• data in stable storage
• transformations from other RDDs.
• Express computation by
– defining RDDs
4
Fault Recovery
• Efficient fault recovery using lineage
– log one operation to apply to many elements
(lineage)
– recompute lost partitions on failure
5
Example
lines = spark.textFile("hdfs://...")
errors = lines.filter(_.startsWith("ERROR"))
hdfs_errors = errors.filter(_.contains(“HDFS"))
6
Advantages of the RDD Model
• Efficient fault recovery
– fine-grained and low-overhead using lineage
• Immutable nature can mitigate stragglers
– backup tasks to mitigate stragglers
• Graceful degradation when RAM is not
enough
7
Spark
• Implementation of the RDD abstraction
– Scala interface
• Two components
– Driver
– Workers
8
• Driver
– defines and invokes actions on RDDs
– tracks the RDDs’ lineage
• Workers
– store RDD partitions
– perform RDD
transformations
Spark Runtime
9
Supported RDD Operations
• Transformations
– map (f: T->U)
– filter (f: T->Bool)
– join()
– ... (and lots of others)
• Actions
– count()
– save()
– ... (and lots of others)
10
Representing RDDs
• A graph-based representation for RDDs
• Pieces of information for each RDD
– a set of partitions
– a set of dependencies on parent RDDs
– a function for computing it from its parents
– metadata about its partitioning scheme and data
placement
11
RDD Dependencies
• Narrow dependencies
– each partition of the parent RDD is used by at
most one partition of the child RDD
• Wide dependencies
– multiple child partitions may depend on it
12
RDD Dependencies
13
RDD Dependencies
• Narrow dependencies
– allow for pipelined execution on one cluster node
– easy fault recovery
• Wide dependencies
– require data from all parent partitions to be
available and to be shuffled across the nodes
– a single failed node might cause a complete reexecution.
14
Job Scheduling
• To execute an action on an RDD
– scheduler decide the stages from the RDD’s
lineage graph
– each stage contains as many pipelined
transformations with narrow dependencies as
possible
15
Job Scheduling
16
Memory Management
• Three options for persistent RDDs
– in-memory storage as deserialized Java objects
– in-memory storage as serialized data
– on-disk storage
• LRU eviction policy at the level of RDDs
– when there’s not enough memory, evict a
partition from the least recently accessed RDD
17
Checkpointing
• Checkpoint RDDs to prevent long lineage
chains during fault recovery
• Simpler to checkpoint than shared memory
– Read-only nature of RDDs
18
Discussions
19
Checkpointing or Versioning?
20
• Frequent checkpointing, or
Keep all versions of ranks?

spark hadoop 对比 Resilient Distributed Datasets的更多相关文章

Apache Spark 2.2.0 中文文档 - Spark RDD（Resilient Distributed Datasets）论文 | ApacheCN
Spark RDD(Resilient Distributed Datasets)论文概要 1: 介绍 2: Resilient Distributed Datasets(RDDs) 2.1 RDD ...
Apache Spark RDD（Resilient Distributed Datasets）论文
Spark RDD(Resilient Distributed Datasets)论文概要 1: 介绍 2: Resilient Distributed Datasets(RDDs) 2.1 RDD ...
Apache Spark 2.2.0 中文文档 - Spark RDD（Resilient Distributed Datasets）
Spark RDD(Resilient Distributed Datasets)论文概要 1: 介绍 2: Resilient Distributed Datasets(RDDs) 2.1 RDD ...
Spark的核心RDD（Resilient Distributed Datasets弹性分布式数据集）
Spark的核心RDD (Resilient Distributed Datasets弹性分布式数据集) 原文链接:http://www.cnblogs.com/yjd_hycf_space/p/7 ...
spark 笔记 2： Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf ucb关于spark的论文,对spark中核心组件RDD最原始.本质的理解, ...
RDD内存迭代原理(Resilient Distributed Datasets)---弹性分布式数据集
Spark的核心RDD Resilient Distributed Datasets(弹性分布式数据集) Spark运行原理与RDD理论 Spark与MapReduce对比,MapReduce的计 ...
Scala当中什么是RDD（Resilient Distributed Datasets）弹性分布式数据集
RDD(Resilient Distributed Datasets)弹性分布式数据集.你不好理解的话,可以把RDD就可以看成是一个简单的"动态数组"(比如ArrayList),对 ...
【Spark】RDD(Resilient Distributed Dataset)究竟是什么？
目录基本概念官方文档概述含义 RDD出现的原因五大属性以单词统计为例,一张图熟悉RDD当中的五大属性解构图 RDD弹性 RDD特点分区只读依赖缓存 checkpoint 基本概念 ...
大数据 --> Spark与Hadoop对比
Spark与Hadoop对比什么是Spark Spark是UC Berkeley AMP lab所开源的类Hadoop MapReduce的通用的并行计算框架,Spark基于map reduce算法 ...

随机推荐

第3节 mapreduce高级：8、9、自定义分区实现分组求取top1
自定义GroupingComparator求取topN GroupingComparator是mapreduce当中reduce端的一个功能组件,主要的作用是决定哪些数据作为一组,调用一次reduce ...
第2节 mapreduce深入学习：12、reducetask运行机制（多看几遍）
ReduceTask的运行的整个过程背下来1.启动线程到mapTask那里去拷贝数据,拉取属于每一个reducetask自己内部的数据2.数据的合并,拉取过来的数据进行合并,合并的过程,有可能在内存 ...
HashMap、ConcurrentHashMap以及HashTable(面试向)
---->HashMap 在java1.7中,hashmap的数据结构是基于数组+链表的结构,即我们比较熟悉的Entry数组,其包含的(key-value)键值对的形式.在多线程环境下,Hash ...
HTML5地理定位-Geolocation API
HTML5提供了一组Geolocation API,来自navigator定位对象的子对象,获取用户的地理位置信息Geolocation API使用方法:1.判断是否支持 navigator.geol ...
第二十二节：scrapy爬虫识别验证码（一）类库安装
一.安装tesserocr 1.首先下载tesseract:https://digi.bib.uni-mannheim.de/tesseract/ ,我下载的是tesseract-ocr-setup- ...
MT4系统自带指标代码
MT4系统自带指标代码 ~ Accelerator Oscillator 震荡加速指标: double iAC() ~ Accumulation/Distribut ...
LeetCode（47）Permutations II
题目 Given a collection of numbers that might contain duplicates, return all possible unique permutati ...
Leetcode 150.逆波兰表达式求值
逆波兰表达式求值根据逆波兰表示法,求表达式的值. 有效的运算符包括 +, -, *, / .每个运算对象可以是整数,也可以是另一个逆波兰表达式. 说明: 整数除法只保留整数部分. 给定逆波兰表达式总 ...
Spring Boot - how to configure port
https://stackoverflow.com/questions/21083170/spring-boot-how-to-configure-port
Ubuntu 16.04 GNOME在桌面左侧添加启动器（Launcher）
安装Dash to Dock: git clone https://github.com/micheleg/dash-to-dock.git cd dash-to-dock/ make make in ...

spark hadoop 对比 Resilient Distributed Datasets

spark hadoop 对比 Resilient Distributed Datasets的更多相关文章

随机推荐

热门专题