Abstract

  • Classical strategies do not aware of recovery cost, which could cause system performance degradation.   -->  a cost aware eviction strategt can obviously reduces the total recovery cost.
  • A strategy named LCS(Least cost strategy) -->  gets the dependencies information between cache data via analyzing application, and calculates the recovery cost during running. By predicting how many times cache data will be reused and using it to weight the recovery cost, LCS always evicts the data which lead to minimum recovery cost in future.

Introduction

  • Current eviction strategies:

    • FIFO: focuses on the create time.
    • LRU: focuses on access history for better hit ratio.
  • Many eviction algorithms take access history and costs of cache items into consideration. But for spark, the execution logic of upcoming phase is known, access history has no help to eviction strategy.
  • LCS has three steps:
    1. Gets the dependencies of RDD by analyzing application, and predicts how many times cache partitions will be reused.
    2. Collects information during partition creation, and predicts the recovery cost.
    3. Maintains the eviction order using above two information, and evicts the partition that incurs the least cost when memory is not sufficient.

Design and Implementation

Overall Architecture

  • Three necessary steps:
    1. Analyzer in driver node analyzes the application by  the DAG strcutures provided by DAGScheduler.
    2. Collector in each executor node records information about each cache partition during its creation.
    3. Eviction Decision provides an efficient eviction strategy to evict the optimal cache partition set when remaining memory space for cache storage is not efficient, and decide whether remove it from MemoryStore or serialize it to DiskStore.

Analyzer

  • DAG start points:

    • DFS, files on it can be read from local or remote disk directly;
    • ShuffledRDD, which can be generated by fetching remote shuffle data.

  This indicates the longest running path of task: when all the cache RDDs are missing, task needs to run from the starting points. (Only needs to run part of the path from cache RDD by referring dependencies between RDDs).

  • The aim of Analyzer is classifying cache RDDs and analyzing the dependency information between them before each stage runs.
  • Analyzer only runs in driver node and will transfer result to executors when driver schedules tasks to them.
  • By pre-registering RDD that needs to be unpresist, and checking whether it is used in each stage, we put it to the RemovableRDDs list of the last stage to use it. The removable partition can be evicted directly, and will not waste the memory.
  • Cache RDDs of a stage will be classified to:
    • current running cache RDDs (targetCacheRDDs)
    • RDDs participate in current stage (relatedCacheRDDs)
    • other cache RDDs

Colletor

  • collector will collect information about each cache partition during task running.
  • Information that needs to be observed:
    • Create cost: Time spent, called Ccreate.
    • Eviction cost: Time costs when evicting a partition from memory, called Ceviction. (If partition is serialized to disk, the eviction cost is the time spent on serializing and writing to disk, denoted as Cser. If removed directly, the eviction cost is 0.)
    • Recovery cost: Time costs when partition data are not found in memory, named Crecovery. If partition is serialized to disk, the recovery cost is the time spent in reading from disk and deserilization, denoted as Cdeser. Otherwise, recomputed by lineage information, represented as Crecompute.

Eviction Decision

  • Through using information provided by Colletor, each cache partition has a WCPM value:
    WCPM = min (CPM * reus, SPM + DPM * reus).
    CPMrenew = (CPMancestor * sizeancestor + CPM * size) / size
    SPM refers to serialization, DPM refers to deserialization, resu refers to reusability

Evaluation

Evaluation Environment and Method

  • PR, CC, KMeans algorithms...
  • LCS compare to LRU & FIFO

[Paper] LCS: An Efficient Data Eviction Strategy for Spark的更多相关文章

  1. Zore copy(翻译《Efficient data transfer through zero copy》)

    原文:https://www.ibm.com/developerworks/library/j-zerocopy/ <Efficient data transfer through zero c ...

  2. Efficient data transfer through zero copy

    Efficient data transfer through zero copy https://www.ibm.com/developerworks/library/j-zerocopy/ Eff ...

  3. PatentTips - Apparatus and method for a generic, extensible and efficient data manager for virtual peripheral component interconnect devices (VPCIDs)

    BACKGROUND A single physical platform may be segregated into a plurality of virtual networks. Here, ...

  4. Provably Delay Efficient Data Retrieving in Storage Clouds---INFOCOM 2015

    [标题] [作者] [来源] [对本文评价] [why] 存在的问题 [how] [不足] assumption future work [相关方法或论文] [重点提示] [其它]

  5. Big Data, MapReduce, Hadoop, and Spark with Python

    此书不错,很短,且想打通PYTHON和大数据架构的关系. 先看一次,计划把这个文档作个翻译. 先来一个模拟MAPREDUCE的东东... mapper.py class Mapper: def map ...

  6. [Big Data]从Hadoop到Spark的架构实践

    摘要:本文则主要介绍TalkingData在大数据平台建设过程中,逐渐引入Spark,并且以Hadoop YARN和Spark为基础来构建移动大数据平台的过程. 当下,Spark已经在国内得到了广泛的 ...

  7. 搭建Data Mining环境(Spark版本)

    前言:工欲善其事,必先利其器.倘若不懂得构建一套大数据挖掘环境,何来谈Data Mining!何来领悟“Data Mining Engineer”中的工程二字!也仅仅是在做数据分析相关的事罢了!此文来 ...

  8. ### Paper about Event Detection

    Paper about Event Detection. #@author: gr #@date: 2014-03-15 #@email: forgerui@gmail.com 看一些相关的论文. 1 ...

  9. In-Stream Big Data Processing

    http://highlyscalable.wordpress.com/2013/08/20/in-stream-big-data-processing/   Overview In recent y ...

随机推荐

  1. php二分法查找

    //二分查找(数组里查找某个元素) function bin_sch($array, $low, $high, $k) { if ($low <= $high) { $mid = intval( ...

  2. 正睿 2018 提高组十连测 Day2 T2 B

    题目链接 http://www.zhengruioi.com/contest/84/problem/318 题解写的比较清楚,直接扒过来了. B 算法 1 直接按题意枚举,动态规划或是记忆化搜索. 时 ...

  3. 第二阶段——个人工作总结DAY07

    1.昨天做了什么:昨天了解了一下时间抽也是一种ListView,然后就此学习了一下如何来修改. 2.今天打算做什么:今天在网上搜一些例子,找比较好看的界面,并实现代码. 3.遇到的困难:不知道最后连接 ...

  4. 【IDEA】【7】Git更新及提交

    如果是Git管理的项目,顶部会出现这样的按钮 绿色代表commit到本地 蓝色代表update最新代码 Push:推送到远程服务器:右键项目->Git->Repository->Pu ...

  5. PAT 1015 Reversible Primes

    1015 Reversible Primes (20 分)   A reversible prime in any number system is a prime whose "rever ...

  6. Spring JdbcTemplate 查询出的Map,是如何产生大小写忽略的Key的?(转)

    原文地址:Spring JdbcTemplate 查询出的Map,是如何产生大小写忽略的Key的? 原始讨论组:用Spring JdbcTemplate 查询出的Map,是如何产生大小写忽略的Key的 ...

  7. echarts-------饼形图

    首先echarts是一个可以提供给用户体验效果更好的一个图形界面, Canvas 类库 ZRender. 1.下载echarts的js,可以在官方网址进行下载echarts.min.js 2.将下载下 ...

  8. Oracle X$Tables

    前言 最早从Yong Huang那里看到关于比较详细的X$表的介绍,后来陆续从其他Oracle专家那里得到了不少信息.在Steve Adams 的书中对X$表多有提及,而且他的站点也是个资源比较丰富的 ...

  9. Python3红楼梦人名出现次数统计分析

    一.程序说明 本程序流程是读取红楼梦txt文件----使用jieba进行分词----借助Counter读取各人名出现次数并排序----使用matplotlib将结果可视化 这里的统计除了将“熙凤”出现 ...

  10. summary_22rd Nov 2018

    一. 列表:记录同种属性的多个值 定义:在[]中用逗号分隔开多个任意的值 类型转换:L=list( )  括号中的内容必须是可迭代类型,包括字符串,列表,字典等 常用操作和内置方法: 1.按照索引位置 ...