《The HiBench Benchmark Suite: Characterization of the MapReduce-Based Data Analysis》内容精选

We then evaluate and characterize the Hadoop framework using HiBench, in terms of speed (i.e., job running time), throughput (i.e., the number of tasks completed per minute), HDFS bandwidth, system resource (e.g., CPU, memory and I/O) utilizations, and data access patterns.

关键内容:speed、 throughput、HDFS bandwidth、 system resource、data access patterns

the last one is an enhanced version of the DFSIO benchmark that we have extended to evaluate the aggregated bandwidth delivered by HDFS.

关键内容:evaluate the aggregated bandwidth delivered by HDFS

As shown in Fig. 1, the aggregated throughput curve has a warm-up period and a cool-down period where map tasks are launching up and shutting down respectively. Between these two periods, there is a steady period where the aggregated throughput values are stable across different time slots. Therefore, the Enhanced DFSIO workload computes the aggregated HDFS throughput by averaging the throughput value of each time slot in the steady period. In Enhanced DFSIO, when the number of concurrent map tasks at a time slot is above a specified percentage (e.g., 50% is used in our benchmarking) of the total map task slots in the Hadoop cluster, the slot is considered to be in the steady period.

关键内容:warm-up period、cool-down period、steady period、computes the aggregated HDFS throughput by averaging the throughput value of each time slot in the steady period

In essence, the TeraSort workload is similar to Sort and therefore is I/O bound in nature. However, we have compressed its shuffle data (i.e., map output) in the experiment so as to minimize the disk and network I/O during shuffle, as shown in Table III. Consequently, TeraSort have very high CPU utilization and moderate disk I/O during the map stage and shuffle phases, and moderate CPU utilization and heavy disk I/O during the reduce phases, as shown in Fig. 4.

关键内容:map stage、shuffle phases、reduce phases、high、moderate、CPU utilization、disk I/O

The best performance (total running time) of Hadoop workloads is usually obtained by accurately estimating the size of the map output, shuffle data and reduce input data, and properly allocating memory buffers to prevent multiple spilling (to disk) of those data.

关键内容:estimating the size of the map output、 shuffle data and reduce input data、allocating memory buffers

HiBench成长笔记——(7) 阅读《The HiBench Benchmark Suite: Characterization of the MapReduce-Based Data Analysis》的更多相关文章

  1. HiBench成长笔记——(2) CentOS部署安装HiBench

    安装Scala 使用spark-shell命令进入shell模式,查看spark版本和Scala版本: 下载Scala2.10.5 wget https://downloads.lightbend.c ...

  2. HiBench成长笔记——(3) HiBench测试Spark

    很多内容之前的博客已经提过,这里不再赘述,详细内容参照本系列前面的博客:https://www.cnblogs.com/ratels/p/10970905.html 创建并修改配置文件conf/spa ...

  3. HiBench成长笔记——(1) HiBench概述

    测试分类 HiBench共计19个测试方向,可大致分为6个测试类别:分别是micro,ml(机器学习),sql,graph,websearch和streaming. 2.1 micro Benchma ...

  4. HiBench成长笔记——(5) HiBench-Spark-SQL-Scan源码分析

    run.sh #!/bin/bash # Licensed to the Apache Software Foundation (ASF) under one or more # contributo ...

  5. HiBench成长笔记——(4) HiBench测试Spark SQL

    很多内容之前的博客已经提过,这里不再赘述,详细内容参照本系列前面的博客:https://www.cnblogs.com/ratels/p/10970905.html 和 https://www.cnb ...

  6. HiBench成长笔记——(11) 分析源码run.sh

    #!/bin/bash # Licensed to the Apache Software Foundation (ASF) under one or more # contributor licen ...

  7. HiBench成长笔记——(10) 分析源码execute_with_log.py

    #!/usr/bin/env python2 # Licensed to the Apache Software Foundation (ASF) under one or more # contri ...

  8. HiBench成长笔记——(9) 分析源码monitor.py

    monitor.py 是主监控程序,将监控数据写入日志,并统计监控数据生成HTML统计展示页面: #!/usr/bin/env python2 # Licensed to the Apache Sof ...

  9. HiBench成长笔记——(8) 分析源码workload_functions.sh

    workload_functions.sh 是测试程序的入口,粘连了监控程序 monitor.py 和 主运行程序: #!/bin/bash # Licensed to the Apache Soft ...

随机推荐

  1. leetCode练题——14. Longest Common Prefix

    1.题目 14. Longest Common Prefix   Write a function to find the longest common prefix string amongst a ...

  2. 第七届蓝桥杯javaB组真题解析-四平方和(第八题)

    题目 /* 四平方和 四平方和定理,又称为拉格朗日定理: 每个正整数都可以表示为至多4个正整数的平方和. 如果把0包括进去,就正好可以表示为4个数的平方和. 比如: 5 = 0^2 + 0^2 + 1 ...

  3. Python - 编程技巧,语法糖,黑魔法,pythonic

    参考,搬运 http://python-web-guide.readthedocs.io/zh/latest/idiom/idiom.html 待定 1. Python支持链式比较 # bad a = ...

  4. ubuntu-18.04 修改用户名密码

    1. 开放root登录 设置root密码 $ sudo passwd root 切换到root 用户 $ sudo -i 修改/etc/pam.d/gdm-autologin $ vim /etc/p ...

  5. 1-9springboot之thymeleaf常用语法(html页面)

    一.引用命名空间 <html xmlns:th="http://www.thymeleaf.org"> 在html中引入此命名空间,可避免编辑器出现html验证错误,虽 ...

  6. git 从创建到推送到远程,到拉取,实操

    https://www.liaoxuefeng.com/wiki/896043488029600/900003767775424 初始化 git init 添加所有文件到暂存区 git add . c ...

  7. CSS布局的三种机制

    浮动元素之间没有缝隙,这和行内块还是不一样的,有点区别的! 2) 浮动元素与兄弟盒子之间的关系 注意:解决浮动的四种办法,后三种都是针对浮动元素的父元素的.

  8. FineReport帆软报表需求:根据url传递过来的参数值决定显示隐藏列

    需求:角色id传递到报表页面中,然后根据角色id,决定隐藏第1列,显示第2-4列,还是隐藏第2-4列,显示第1列. 解决方法:

  9. Linux命令:tcpdump命令

    tcpdump网络抓包工具 格式:tcpdump [options] [表达式] optinos选项 -i any:监听所有网络接口 -i eth0:监听指定的网络接口(eth0) -D:列出所有可用 ...

  10. 33 class.forname

    class.forname(className) class.forname(classname).newInstance class.forname(classname,true,Thread.XX ...