Summary

  • Spark does not have a good mechanism to select reasonable RDDs to cache their partitions in limited memory.  --> Propose a novel selection algorithm, by which Spark can automatically select the RDDs to cache their partitions in memory according to the number of use for RDDs.   --> speeds up iterative computations.

  • Spark use least recently used (LRU) replacement algorithm to evict RDDs, which only consider the usage of the RDDs.  --> a novel replacement algorithm called weight replacement (WR) algorithm, which takes comprehensive consideration of the partitions computation cost, the number of use for partitions, and the sizes of the partitions.

Preliminary Information

Cache mechanism in Spark

  • When RDD partitions have been cached in memory during the iterative computation, an operation which needs the partitions will get them by CacheManager.
  • All operations including reading or caching in CacheManager mainly depend on the API of BlockManager. BlockManager decides whether partitions are obtained from memory or disks.

Scheduling model

  • The LRU algorithm only considers whether those partitions are recently used while ignores the partitions computation cost and the sizes of the partitions.
  • The number of use for partitions can be known from the DAG before tasks are performed.
    Let Nij be the number of use of j-th partition of RDDi.
    Let Sij be the size of j-th partition or RDDi.
  • The computation time is also an important part. --> Each partition of RDDi starting time STij and finishing time FTij can roughly express its execution and communication time.
    Consider the computation cost of partition as Costj = FTij - STij.
  • After that, we set up a scheduling model and obtain the weight of Pij, which can be expressed as:

    where k is the correction parameter, and it's set to a constant.
  • Finally, we assume that there are h partitions in RDDi, so the weight of RDDi is:

Proposed Algorithm

Selection algorithm

  • For a given DAG graph,we can get the num of uses for each RDD, expressioned as NRDDi.
  • The pseudocode:

Replacement algorithm

  • In this paper, we use weight of partition to evaluate the importance of the partitions.
  • When many partitions are cached in memory, we use QuickSort algorithm to sort the partitions according to the value of the partitions.
  • The pseudocode:

Experiments

  • five servers, six virtual machines, each vm has 100G disk, 2.5GHZ and runs Ubuntu 12.04 operation system while memory is variable, and we set it as 1G, 2G, or 4G in different conditions.
  • Hadoop 2.10.4 and Spark-1.1.0.
  • use ganglia to observe the memory usage.
  • use pageRank algorithm to do expirement, it's iterative.

[Paper] Selection and replacement algorithm for memory performance improvement in Spark的更多相关文章

  1. Partitioned Replacement for Cache Memory

    In a particular embodiment, a circuit device includes a translation look-aside buffer (TLB) configur ...

  2. Flash-aware Page Replacement Algorithm

    1.Abstract:(1)字体太乱,单词中有空格(2) FAPRA此名词第一出现时应有“ FAPRA(Flash-aware Page Replacement Algorithm)”说明. 2.in ...

  3. Inside Amazon's Kafkaesque "Performance Improvement Plans"

    Amazon CEO and brilliant prick Jeff Bezos seems to have lost his magic touch lately. Investors, empl ...

  4. Hive-Container killed by YARN for exceeding memory limits. 9.2 GB of 9 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.

    Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task times, most recen ...

  5. Spring Boot Memory Performance

    The Performance Zone is brought to you in partnership with New Relic. Quickly learn how to use Docke ...

  6. 计算机系统结构总结_Memory Hierarchy and Memory Performance

    Textbook: <计算机组成与设计——硬件/软件接口>  HI <计算机体系结构——量化研究方法>       QR 这是youtube上一个非常好的memory syst ...

  7. PatentTips - Control register access virtualization performance improvement

    BACKGROUND OF THE INVENTION A conventional virtual-machine monitor (VMM) typically runs on a compute ...

  8. SQL Performance Improvement Techniques(转)

    原文地址:http://www.codeproject.com/Tips/1023621/SQL-Performance-Improvement-Techniques This article pro ...

  9. Ceilometer Polling Performance Improvement

    Ceilometer的数据采集agent会定期对nova/keystone/neutron/cinder等服务调用其API的获取信息,默认是20秒一次, # Polling interval for ...

随机推荐

  1. Confluence 6 权限概述

    下面的权限可以指派给任何一个空间: 分类 权限 全部(All) 查看(View )给你能够查看空间内容的权限,包括有空间目录和其他的内容,例如主面板. 删除自己(Delete own) 给你权限删除你 ...

  2. python记录_day10 动态传参 命名空间 作用域

    一.动态传参 动态传参用到 *args 和 **kwargs ,*号表示接收位置参数,args是参数名:**表示接收关键字参数,kwargs是参数名 def chi(*food): print(foo ...

  3. 『TensorFlow』SSD源码学习_其一:论文及开源项目文档介绍

    一.论文介绍 读论文系列:Object Detection ECCV2016 SSD 一句话概括:SSD就是关于类别的多尺度RPN网络 基本思路: 基础网络后接多层feature map 多层feat ...

  4. php获得时间段的月

    1.时间:$start_time = $_GET['start_time']; //2015-01$end_time = $_GET['end_time']; //2015-052.对时间进行拆分:$ ...

  5. MongoDB 教程(二):MongoDB 简介

    概述: MongoDB 旨在为WEB应用提供可扩展.高性能的数据存储解决方案. MongoDB 将数据存储为一个文档,数据结构由键值(key=>value)对组成. MongoDB 文档类似于 ...

  6. 【转】MVC中code first方式开发,数据库的生成与更新(Ef6)

    一,在models文件夹中,建立相应的model文件         这里注意一点,这里建立的class名,就是数据库里表的名字.         在这里面,可以建立表之间的关系. 这里要说明一点的事 ...

  7. vuex的学习例子

    最近在学习vuejs,一直有听说vuex,用来实现多组件共享的一种状态管理模式,但是网上都说,不要为了用vuex而用vuex,大概意思就是尽量少用vuex,一些小项目可以用bus来实现组件之间的传值问 ...

  8. HDU 1005 Number Sequence(数论)

    HDU 1005 Number Sequence(数论) Problem Description: A number sequence is defined as follows:f(1) = 1, ...

  9. 【基础】火狐和谷歌在Selenium3.0上的启动(二)

    参考地址:http://www.cnblogs.com/fnng/p/5932224.html https://github.com/mozilla/geckodriver [火狐浏览器] 火狐浏览器 ...

  10. In-App Purchase iap 内付费 二次验证代码 (java 服务器端)

    参考网址:https://blog.csdn.net/a351945755/article/details/22919533 package com.yichangmao.buyVerify.Comm ...