[Paper] Selection and replacement algorithm for memory performance improvement in Spark
Summary
Spark does not have a good mechanism to select reasonable RDDs to cache their partitions in limited memory. --> Propose a novel selection algorithm, by which Spark can automatically select the RDDs to cache their partitions in memory according to the number of use for RDDs. --> speeds up iterative computations.
Spark use least recently used (LRU) replacement algorithm to evict RDDs, which only consider the usage of the RDDs. --> a novel replacement algorithm called weight replacement (WR) algorithm, which takes comprehensive consideration of the partitions computation cost, the number of use for partitions, and the sizes of the partitions.
Preliminary Information
Cache mechanism in Spark
- When RDD partitions have been cached in memory during the iterative computation, an operation which needs the partitions will get them by CacheManager.
- All operations including reading or caching in CacheManager mainly depend on the API of BlockManager. BlockManager decides whether partitions are obtained from memory or disks.
Scheduling model
- The LRU algorithm only considers whether those partitions are recently used while ignores the partitions computation cost and the sizes of the partitions.
- The number of use for partitions can be known from the DAG before tasks are performed.
Let Nij be the number of use of j-th partition of RDDi.
Let Sij be the size of j-th partition or RDDi. - The computation time is also an important part. --> Each partition of RDDi starting time STij and finishing time FTij can roughly express its execution and communication time.
Consider the computation cost of partition as Costj = FTij - STij. - After that, we set up a scheduling model and obtain the weight of Pij, which can be expressed as:
where k is the correction parameter, and it's set to a constant. - Finally, we assume that there are h partitions in RDDi, so the weight of RDDi is:
Proposed Algorithm
Selection algorithm
- For a given DAG graph,we can get the num of uses for each RDD, expressioned as NRDDi.
- The pseudocode:
Replacement algorithm
- In this paper, we use weight of partition to evaluate the importance of the partitions.
- When many partitions are cached in memory, we use QuickSort algorithm to sort the partitions according to the value of the partitions.
- The pseudocode:
Experiments
- five servers, six virtual machines, each vm has 100G disk, 2.5GHZ and runs Ubuntu 12.04 operation system while memory is variable, and we set it as 1G, 2G, or 4G in different conditions.
- Hadoop 2.10.4 and Spark-1.1.0.
- use ganglia to observe the memory usage.
- use pageRank algorithm to do expirement, it's iterative.
[Paper] Selection and replacement algorithm for memory performance improvement in Spark的更多相关文章
- Partitioned Replacement for Cache Memory
In a particular embodiment, a circuit device includes a translation look-aside buffer (TLB) configur ...
- Flash-aware Page Replacement Algorithm
1.Abstract:(1)字体太乱,单词中有空格(2) FAPRA此名词第一出现时应有“ FAPRA(Flash-aware Page Replacement Algorithm)”说明. 2.in ...
- Inside Amazon's Kafkaesque "Performance Improvement Plans"
Amazon CEO and brilliant prick Jeff Bezos seems to have lost his magic touch lately. Investors, empl ...
- Hive-Container killed by YARN for exceeding memory limits. 9.2 GB of 9 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task times, most recen ...
- Spring Boot Memory Performance
The Performance Zone is brought to you in partnership with New Relic. Quickly learn how to use Docke ...
- 计算机系统结构总结_Memory Hierarchy and Memory Performance
Textbook: <计算机组成与设计——硬件/软件接口> HI <计算机体系结构——量化研究方法> QR 这是youtube上一个非常好的memory syst ...
- PatentTips - Control register access virtualization performance improvement
BACKGROUND OF THE INVENTION A conventional virtual-machine monitor (VMM) typically runs on a compute ...
- SQL Performance Improvement Techniques(转)
原文地址:http://www.codeproject.com/Tips/1023621/SQL-Performance-Improvement-Techniques This article pro ...
- Ceilometer Polling Performance Improvement
Ceilometer的数据采集agent会定期对nova/keystone/neutron/cinder等服务调用其API的获取信息,默认是20秒一次, # Polling interval for ...
随机推荐
- 动手动脑java异常处理
1>请阅读并运行AboutException.java示例,然后通过后面的几页PPT了解Java中实现异常处理的基础知识. import javax.swing.*; class AboutEx ...
- 【JS】【3】标签显示几秒后自动隐藏
$("#XXX").show().delay(2000).hide(0); 2000,0:可选,速度,(毫秒:"slow":"fast") ...
- Hadoop---hu-hadoop1: mv: cannot stat `/home/bigdata/hadoop-2.6.0/logs/hadoop-root-datanode-hu-hadoop1.out.4': No such file or directory
hu-hadoop1: mv: cannot stat `/home/bigdata/hadoop-2.6.0/logs/hadoop-root-datanode-hu-hadoop1.out.4': ...
- Oracle12c中配置实例参数和修改容器数据库(CDB)及可插拔数据库(PDB)
Oracle12c中的多宿主选项允许一个容器数据库(CDB)容纳多个独立的可插拔数据库(PDB).本文将展示如何配置实例参数和修改容器数据库(CDB)及可插拔数据库(PDB).1. 配置CDB中的实例 ...
- provider和consumer配置参数的优先级
<dubbo:service>和<dubbo:reference>存在一些相同的参数,例如:timeout,retries等,那么哪个配置的优先级高呢? consumer合并u ...
- JS时钟--星期 年 月 日 时 分
var clock = function(clockName){ var mydate = new Date(); var hours = mydate.getHours(); var minutes ...
- 字符串和数组----string
一.初始化string对象的方式 #include <iostream> #include <string> using std::cout; using std::endl; ...
- vue常见开发问题整理
1.(webpack)vue-cli构建的项目如何设置每个页面的title 在路由里每个都添加一个meta [{ path:'/login', meta: { title: '登录页面' }, com ...
- Linux系统从零到高手的进阶心得
初次了解到Linux系统还是在我初中的时候,那时候正是在一个中二年龄,喜欢看小说,对于小说中出现的明显的非现实场景感到十分钦佩.羡慕,并常常幻想自己也有小说主人公那样的本领.那正是在这样一个充满幻想的 ...
- Python之路-python基础二
本章内容: 一.编码格式 二.常用数据类型 三.字符串常用方法 四.列表常用方法 五.数据运算 六.课后作业 编码格式: ASCII A ...