Summary

Spark does not have a good mechanism to select reasonable RDDs to cache their partitions in limited memory. --> Propose a novel selection algorithm, by which Spark can automatically select the RDDs to cache their partitions in memory according to the number of use for RDDs. --> speeds up iterative computations.
Spark use least recently used (LRU) replacement algorithm to evict RDDs, which only consider the usage of the RDDs. --> a novel replacement algorithm called weight replacement (WR) algorithm, which takes comprehensive consideration of the partitions computation cost, the number of use for partitions, and the sizes of the partitions.

Preliminary Information

Cache mechanism in Spark

When RDD partitions have been cached in memory during the iterative computation, an operation which needs the partitions will get them by CacheManager.
All operations including reading or caching in CacheManager mainly depend on the API of BlockManager. BlockManager decides whether partitions are obtained from memory or disks.

Scheduling model

The LRU algorithm only considers whether those partitions are recently used while ignores the partitions computation cost and the sizes of the partitions.
The number of use for partitions can be known from the DAG before tasks are performed.
Let Nij be the number of use of j-th partition of RDDi.
Let Sij be the size of j-th partition or RDDi.
The computation time is also an important part. --> Each partition of RDDi starting time STij and finishing time FTij can roughly express its execution and communication time.
Consider the computation cost of partition as Costj = FTij - STij.
After that, we set up a scheduling model and obtain the weight of Pij, which can be expressed as:

where k is the correction parameter, and it's set to a constant.
Finally, we assume that there are h partitions in RDDi, so the weight of RDDi is:

Proposed Algorithm

Selection algorithm

For a given DAG graph,we can get the num of uses for each RDD, expressioned as NRDDi.
The pseudocode:

Replacement algorithm

In this paper, we use weight of partition to evaluate the importance of the partitions.
When many partitions are cached in memory, we use QuickSort algorithm to sort the partitions according to the value of the partitions.
The pseudocode:

Experiments

five servers, six virtual machines, each vm has 100G disk, 2.5GHZ and runs Ubuntu 12.04 operation system while memory is variable, and we set it as 1G, 2G, or 4G in different conditions.
Hadoop 2.10.4 and Spark-1.1.0.
use ganglia to observe the memory usage.
use pageRank algorithm to do expirement, it's iterative.

[Paper] Selection and replacement algorithm for memory performance improvement in Spark的更多相关文章

Partitioned Replacement for Cache Memory
In a particular embodiment, a circuit device includes a translation look-aside buffer (TLB) configur ...
Flash-aware Page Replacement Algorithm
1.Abstract:(1)字体太乱,单词中有空格(2) FAPRA此名词第一出现时应有“ FAPRA(Flash-aware Page Replacement Algorithm)”说明. 2.in ...
Inside Amazon's Kafkaesque "Performance Improvement Plans"
Amazon CEO and brilliant prick Jeff Bezos seems to have lost his magic touch lately. Investors, empl ...
Hive-Container killed by YARN for exceeding memory limits. 9.2 GB of 9 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task times, most recen ...
Spring Boot Memory Performance
The Performance Zone is brought to you in partnership with New Relic. Quickly learn how to use Docke ...
计算机系统结构总结_Memory Hierarchy and Memory Performance
Textbook: <计算机组成与设计——硬件/软件接口> HI <计算机体系结构——量化研究方法> QR 这是youtube上一个非常好的memory syst ...
PatentTips - Control register access virtualization performance improvement
BACKGROUND OF THE INVENTION A conventional virtual-machine monitor (VMM) typically runs on a compute ...
SQL Performance Improvement Techniques（转）
原文地址:http://www.codeproject.com/Tips/1023621/SQL-Performance-Improvement-Techniques This article pro ...
Ceilometer Polling Performance Improvement
Ceilometer的数据采集agent会定期对nova/keystone/neutron/cinder等服务调用其API的获取信息,默认是20秒一次, # Polling interval for ...

随机推荐

p1468 Party Lamps
就是模拟.同一个开关按2下相当于没按,那么,如果一共按0下,就是没按,按1下就是4个开关的1个,按2下可能相当于实际按了0下或按2下,按3下实际按了1下或3下,之后如果是奇数,相当于按1或3下,偶数相 ...
sgu 139 Help Needed!
题意:16数码是否有解? 先计算展开成一维后逆序对.如果0在最后一行,那么逆序偶时有解.4*4时(n为偶)0的位置上升一行,逆序对+3或-1(奇偶性变化).(n为奇时+2或+0,不变) #includ ...
canvas学习之柱状图
项目地址:http://pan.baidu.com/s/1nvhWrwP 因为最近项目中使用到了图表,而且个人一直希望研究canvas,所以最近几天花时间对canvas好好研究了一下,并写了一个dem ...
docker 基本操作
# 常用命令 docker run 镜像 docker images 查看所有镜像 docke ps 查看运行中的容器 docker ps -a 列出所有容器 docker st ...
vuex之单向数据流
单向数据流 State State 用来存状态.在根实例中注册了store 后,用 this.$store.state 来访问. Getters Getters 从 state 上派生出来的状态.可以 ...
linux Boot目录满了之后的解决方法
boot目录为什么会满? Linux默认分区时,boot分区就200多M,按理说也不小,足够了(实际也就几十M),但是内核经常性的升级,而且自己又不自动卸载,于是该目录下旧的内核文件越积越多,最后就满 ...
Beta阶段——第2篇 Scrum 冲刺博客
Beta阶段--第2篇 Scrum 冲刺博客标签:软件工程一.站立式会议照片二.每个人的工作 (有work item 的ID) 昨日已完成的工作人员工作林羽晴完成https安全连接的问题 ...
常用加密算法简单整理以及spring securiy使用bcrypt加密
一.哈希加密 1.md5加密 Message Digest Algorithm MD5(中文名为消息摘要算法第五版) https://baike.baidu.com/item/MD5/212708?f ...
docker实战系列之搭建rabbitmq
1.搜索镜像[注:因为我这里采用的是阿里云镜像加速器,所以我直接在阿里云中搜索相关镜像路径],点击"详情"查看公网拉取路径 2.拉取镜像 docker pull registry. ...
Eclipse错误：The superclass "javax.servlet.http.HttpServlet" was not found on the Java Build Path
该报错是由于缺少servlet-api.jar造成的,将servlet-api.jar复制到项目下的WEB-INF/lib目录下即可 servlet-api.jar在tomcat的lib目录下有,可以 ...

[Paper] Selection and replacement algorithm for memory performance improvement in Spark

Summary

Preliminary Information

Cache mechanism in Spark

Scheduling model

Proposed Algorithm

Selection algorithm

Replacement algorithm

Experiments

[Paper] Selection and replacement algorithm for memory performance improvement in Spark的更多相关文章

随机推荐

热门专题