Summary

  • Spark does not have a good mechanism to select reasonable RDDs to cache their partitions in limited memory.  --> Propose a novel selection algorithm, by which Spark can automatically select the RDDs to cache their partitions in memory according to the number of use for RDDs.   --> speeds up iterative computations.

  • Spark use least recently used (LRU) replacement algorithm to evict RDDs, which only consider the usage of the RDDs.  --> a novel replacement algorithm called weight replacement (WR) algorithm, which takes comprehensive consideration of the partitions computation cost, the number of use for partitions, and the sizes of the partitions.

Preliminary Information

Cache mechanism in Spark

  • When RDD partitions have been cached in memory during the iterative computation, an operation which needs the partitions will get them by CacheManager.
  • All operations including reading or caching in CacheManager mainly depend on the API of BlockManager. BlockManager decides whether partitions are obtained from memory or disks.

Scheduling model

  • The LRU algorithm only considers whether those partitions are recently used while ignores the partitions computation cost and the sizes of the partitions.
  • The number of use for partitions can be known from the DAG before tasks are performed.
    Let Nij be the number of use of j-th partition of RDDi.
    Let Sij be the size of j-th partition or RDDi.
  • The computation time is also an important part. --> Each partition of RDDi starting time STij and finishing time FTij can roughly express its execution and communication time.
    Consider the computation cost of partition as Costj = FTij - STij.
  • After that, we set up a scheduling model and obtain the weight of Pij, which can be expressed as:

    where k is the correction parameter, and it's set to a constant.
  • Finally, we assume that there are h partitions in RDDi, so the weight of RDDi is:

Proposed Algorithm

Selection algorithm

  • For a given DAG graph,we can get the num of uses for each RDD, expressioned as NRDDi.
  • The pseudocode:

Replacement algorithm

  • In this paper, we use weight of partition to evaluate the importance of the partitions.
  • When many partitions are cached in memory, we use QuickSort algorithm to sort the partitions according to the value of the partitions.
  • The pseudocode:

Experiments

  • five servers, six virtual machines, each vm has 100G disk, 2.5GHZ and runs Ubuntu 12.04 operation system while memory is variable, and we set it as 1G, 2G, or 4G in different conditions.
  • Hadoop 2.10.4 and Spark-1.1.0.
  • use ganglia to observe the memory usage.
  • use pageRank algorithm to do expirement, it's iterative.

[Paper] Selection and replacement algorithm for memory performance improvement in Spark的更多相关文章

  1. Partitioned Replacement for Cache Memory

    In a particular embodiment, a circuit device includes a translation look-aside buffer (TLB) configur ...

  2. Flash-aware Page Replacement Algorithm

    1.Abstract:(1)字体太乱,单词中有空格(2) FAPRA此名词第一出现时应有“ FAPRA(Flash-aware Page Replacement Algorithm)”说明. 2.in ...

  3. Inside Amazon's Kafkaesque "Performance Improvement Plans"

    Amazon CEO and brilliant prick Jeff Bezos seems to have lost his magic touch lately. Investors, empl ...

  4. Hive-Container killed by YARN for exceeding memory limits. 9.2 GB of 9 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.

    Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task times, most recen ...

  5. Spring Boot Memory Performance

    The Performance Zone is brought to you in partnership with New Relic. Quickly learn how to use Docke ...

  6. 计算机系统结构总结_Memory Hierarchy and Memory Performance

    Textbook: <计算机组成与设计——硬件/软件接口>  HI <计算机体系结构——量化研究方法>       QR 这是youtube上一个非常好的memory syst ...

  7. PatentTips - Control register access virtualization performance improvement

    BACKGROUND OF THE INVENTION A conventional virtual-machine monitor (VMM) typically runs on a compute ...

  8. SQL Performance Improvement Techniques(转)

    原文地址:http://www.codeproject.com/Tips/1023621/SQL-Performance-Improvement-Techniques This article pro ...

  9. Ceilometer Polling Performance Improvement

    Ceilometer的数据采集agent会定期对nova/keystone/neutron/cinder等服务调用其API的获取信息,默认是20秒一次, # Polling interval for ...

随机推荐

  1. caffe 动态库 Release X64

    Release X64平台 createdll.h#ifndef CREARDLL_H_#define CREARDLL_H_ extern "C" _declspec(dllex ...

  2. Django多表查询练习题

    #一 model表:from django.db import models # Create your models here. class Teacher(models.Model): tid=m ...

  3. stock 基本操作

    追涨停   量比 大于5      0%-2%   个股    2点卖     37分钟买   板块5 -8 只涨停    板块分向标   追踪短期个股的涨跌现象    明白市场大级别趋势     主 ...

  4. Huffman Coding

    哈夫曼树 霍夫曼编码是一种无前缀编码.解码时不会混淆.其主要应用在数据压缩,加密解密等场合. 1. 由给定结点构造哈夫曼树 (1)先从小到大排序(nlogn) (2)先用最小的两个点构造一个节点,父节 ...

  5. hdu 1542 Atlantis (线段树扫描线)

    大意: 求矩形面积并. 枚举$x$坐标, 线段树维护$[y_1,y_2]$内的边是否被覆盖, 线段树维护边时需要将每条边挂在左端点上. #include <iostream> #inclu ...

  6. 【微信公众号开发】【8】网页授权获取用户基本信息(OAuth 2.0)

    前言: 1,在微信公众号请求用户网页授权之前,开发者需要先到公众平台官网中的“开发 - 接口权限 - 网页服务 - 网页帐号 - 网页授权获取用户基本信息”的配置选项中,修改授权回调域名. 请注意,这 ...

  7. php打印

    function preview() { bdhtml = window.document.body.innerHTML; sprnstr = "<!--startprint--> ...

  8. windows中安装liunx虚拟机

    我感觉这个写的很好,所以本内容是转他人的内容. http://www.csdn.net/article/2015-06-05/2824882/1

  9. Hadoop 2.7.3 完全分布式维护-部署篇

    测试环境如下  IP       host JDK linux hadop role 172.16.101.55 sht-sgmhadoopnn-01 1.8.0_111 CentOS release ...

  10. springboot程序构建一个docker镜像(十一)

    准备工作 环境: linux环境或mac,不要用windows jdk 8 maven 3.0 docker 对docker一无所知的看docker教程. 创建一个springboot工程 引入web ...