Abstract

Classical strategies do not aware of recovery cost, which could cause system performance degradation. --> a cost aware eviction strategt can obviously reduces the total recovery cost.
A strategy named LCS(Least cost strategy) --> gets the dependencies information between cache data via analyzing application, and calculates the recovery cost during running. By predicting how many times cache data will be reused and using it to weight the recovery cost, LCS always evicts the data which lead to minimum recovery cost in future.

Introduction

Current eviction strategies:
- FIFO: focuses on the create time.
- LRU: focuses on access history for better hit ratio.
Many eviction algorithms take access history and costs of cache items into consideration. But for spark, the execution logic of upcoming phase is known, access history has no help to eviction strategy.
LCS has three steps:
1. Gets the dependencies of RDD by analyzing application, and predicts how many times cache partitions will be reused.
2. Collects information during partition creation, and predicts the recovery cost.
3. Maintains the eviction order using above two information, and evicts the partition that incurs the least cost when memory is not sufficient.

Design and Implementation

Overall Architecture

Three necessary steps:
1. Analyzer in driver node analyzes the application by the DAG strcutures provided by DAGScheduler.
2. Collector in each executor node records information about each cache partition during its creation.
3. Eviction Decision provides an efficient eviction strategy to evict the optimal cache partition set when remaining memory space for cache storage is not efficient, and decide whether remove it from MemoryStore or serialize it to DiskStore.

Analyzer

DAG start points:
- DFS, files on it can be read from local or remote disk directly;
- ShuffledRDD, which can be generated by fetching remote shuffle data.

　　This indicates the longest running path of task: when all the cache RDDs are missing, task needs to run from the starting points. (Only needs to run part of the path from cache RDD by referring dependencies between RDDs).

The aim of Analyzer is classifying cache RDDs and analyzing the dependency information between them before each stage runs.
Analyzer only runs in driver node and will transfer result to executors when driver schedules tasks to them.
By pre-registering RDD that needs to be unpresist, and checking whether it is used in each stage, we put it to the RemovableRDDs list of the last stage to use it. The removable partition can be evicted directly, and will not waste the memory.
Cache RDDs of a stage will be classified to:
- current running cache RDDs (targetCacheRDDs)
- RDDs participate in current stage (relatedCacheRDDs)
- other cache RDDs

Colletor

collector will collect information about each cache partition during task running.
Information that needs to be observed:
- Create cost: Time spent, called Ccreate.
- Eviction cost: Time costs when evicting a partition from memory, called Ceviction. (If partition is serialized to disk, the eviction cost is the time spent on serializing and writing to disk, denoted as Cser. If removed directly, the eviction cost is 0.)
- Recovery cost: Time costs when partition data are not found in memory, named Crecovery. If partition is serialized to disk, the recovery cost is the time spent in reading from disk and deserilization, denoted as Cdeser. Otherwise, recomputed by lineage information, represented as Crecompute.

Eviction Decision

Through using information provided by Colletor, each cache partition has a WCPM value:
WCPM = min (CPM * reus, SPM + DPM * reus).
CPMrenew = (CPMancestor * sizeancestor + CPM * size) / size
SPM refers to serialization, DPM refers to deserialization, resu refers to reusability

Evaluation

Evaluation Environment and Method

PR, CC, KMeans algorithms...
LCS compare to LRU & FIFO

[Paper] LCS: An Efficient Data Eviction Strategy for Spark的更多相关文章

Zore copy(翻译《Efficient data transfer through zero copy》)
原文:https://www.ibm.com/developerworks/library/j-zerocopy/ <Efficient data transfer through zero c ...
Efficient data transfer through zero copy
Efficient data transfer through zero copy https://www.ibm.com/developerworks/library/j-zerocopy/ Eff ...
PatentTips - Apparatus and method for a generic, extensible and efficient data manager for virtual peripheral component interconnect devices (VPCIDs)
BACKGROUND A single physical platform may be segregated into a plurality of virtual networks. Here, ...
Provably Delay Efficient Data Retrieving in Storage Clouds---INFOCOM 2015
[标题] [作者] [来源] [对本文评价] [why] 存在的问题 [how] [不足] assumption future work [相关方法或论文] [重点提示] [其它]
Big Data, MapReduce, Hadoop, and Spark with Python
此书不错,很短,且想打通PYTHON和大数据架构的关系. 先看一次,计划把这个文档作个翻译. 先来一个模拟MAPREDUCE的东东... mapper.py class Mapper: def map ...
[Big Data]从Hadoop到Spark的架构实践
摘要:本文则主要介绍TalkingData在大数据平台建设过程中,逐渐引入Spark,并且以Hadoop YARN和Spark为基础来构建移动大数据平台的过程. 当下,Spark已经在国内得到了广泛的 ...
搭建Data Mining环境（Spark版本）
前言:工欲善其事,必先利其器.倘若不懂得构建一套大数据挖掘环境,何来谈Data Mining!何来领悟“Data Mining Engineer”中的工程二字!也仅仅是在做数据分析相关的事罢了!此文来 ...
### Paper about Event Detection
Paper about Event Detection. #@author: gr #@date: 2014-03-15 #@email: forgerui@gmail.com 看一些相关的论文. 1 ...
In-Stream Big Data Processing
http://highlyscalable.wordpress.com/2013/08/20/in-stream-big-data-processing/ Overview In recent y ...

随机推荐

android--------Eclipse中ddms heap内存分析工具
无论怎么小心,想完全避免bad code是不可能的,此时就需要一些工具来帮助我们检查代码中是否存在会造成内存泄漏的地方. Android tools中的DDMS就带有一个很不错的内存监测工具Heap ...
uWSGI和Gunicorn
因为nginx等优秀的开源项目,有不少本来不是做服务器的同学也可以写很多服务器端的程序了.但是在聊天中会发现,大家虽然写了不少代码,但是对wsgi是什么,gunicorn是什么,反向代理又是什么并不了 ...
java类的设计原则
1.内聚性类应该描述一个单一的实体,所有的类操作应该在逻辑上相互配合,支持一个连贯性的目标.例如:学生和教职工属于不同的实体,应该定义两个类. 2.一致性要遵循一定的设计风格和命名习惯.给类.方法 ...
Linux上部署多个tomcat端口设置
在Linux上部署多个tomcat主要是防止端口冲突的问题, tomcat服务器需配置三个端口才能启动,安装时默认启用了这三个端口,当要运行多个tomcat服务时需要修改这三个端口,不能相同.端口一: ...
使用axios请求数据，post请求出错。因为axios传递的请求参数是json格式，而后端接口要求是formData
解决办法1:(IOS兼容性有问题,不推荐使用) // json格式转为formData格式,因为某些接口的原因 function json2formData(jsonData) { var param ...
python 小练习 8
砝码问题1有一组砝码,重量互不相等,分别为m1.m2.m3……mn:它们可取的最大数量分别为x1.x2.x3……xn. 现要用这些砝码去称物体的重量,问能称出多少种不同的重量. 现在给你两个正整数列表 ...
python 小练习2
给你一个整数列表L,判断L中是否存在相同的数字, 若存在,输出YES,否则输出NO.解1l=[]for i in L: if L.count(i) != 1: print('YES ...
java.sql.SQLException: Parameter index out of range (1 > number of parameters, which is 0).
java.sql.SQLException: Parameter index out of range (1 > number of parameters, which is 0). at co ...
Caused by: java.io.FileNotFoundException: class path resource [spring/springmvc.xml] cannot be opene
Caused by: java.io.FileNotFoundException: class path resource [spring/springmvc. ...
jvm加载类（更新中）
作为jvm的用户,从使用者角度来看,我们给jvm输入一个class文件,得到了一个Class对象.我们可以猜想下jvm加载类的过程:class文件有规定的格式,jvm去解析class文件流,读magi ...

[Paper] LCS: An Efficient Data Eviction Strategy for Spark