HiBench成长笔记——(7) 阅读《The HiBench Benchmark Suite: Characterization of the MapReduce-Based Data Analysis》
《The HiBench Benchmark Suite: Characterization of the MapReduce-Based Data Analysis》内容精选
We then evaluate and characterize the Hadoop framework using HiBench, in terms of speed (i.e., job running time), throughput (i.e., the number of tasks completed per minute), HDFS bandwidth, system resource (e.g., CPU, memory and I/O) utilizations, and data access patterns.
关键内容:speed、 throughput、HDFS bandwidth、 system resource、data access patterns
the last one is an enhanced version of the DFSIO benchmark that we have extended to evaluate the aggregated bandwidth delivered by HDFS.
关键内容:evaluate the aggregated bandwidth delivered by HDFS

As shown in Fig. 1, the aggregated throughput curve has a warm-up period and a cool-down period where map tasks are launching up and shutting down respectively. Between these two periods, there is a steady period where the aggregated throughput values are stable across different time slots. Therefore, the Enhanced DFSIO workload computes the aggregated HDFS throughput by averaging the throughput value of each time slot in the steady period. In Enhanced DFSIO, when the number of concurrent map tasks at a time slot is above a specified percentage (e.g., 50% is used in our benchmarking) of the total map task slots in the Hadoop cluster, the slot is considered to be in the steady period.
关键内容:warm-up period、cool-down period、steady period、computes the aggregated HDFS throughput by averaging the throughput value of each time slot in the steady period

In essence, the TeraSort workload is similar to Sort and therefore is I/O bound in nature. However, we have compressed its shuffle data (i.e., map output) in the experiment so as to minimize the disk and network I/O during shuffle, as shown in Table III. Consequently, TeraSort have very high CPU utilization and moderate disk I/O during the map stage and shuffle phases, and moderate CPU utilization and heavy disk I/O during the reduce phases, as shown in Fig. 4.
关键内容:map stage、shuffle phases、reduce phases、high、moderate、CPU utilization、disk I/O
The best performance (total running time) of Hadoop workloads is usually obtained by accurately estimating the size of the map output, shuffle data and reduce input data, and properly allocating memory buffers to prevent multiple spilling (to disk) of those data.
关键内容:estimating the size of the map output、 shuffle data and reduce input data、allocating memory buffers
HiBench成长笔记——(7) 阅读《The HiBench Benchmark Suite: Characterization of the MapReduce-Based Data Analysis》的更多相关文章
- HiBench成长笔记——(2) CentOS部署安装HiBench
安装Scala 使用spark-shell命令进入shell模式,查看spark版本和Scala版本: 下载Scala2.10.5 wget https://downloads.lightbend.c ...
- HiBench成长笔记——(3) HiBench测试Spark
很多内容之前的博客已经提过,这里不再赘述,详细内容参照本系列前面的博客:https://www.cnblogs.com/ratels/p/10970905.html 创建并修改配置文件conf/spa ...
- HiBench成长笔记——(1) HiBench概述
测试分类 HiBench共计19个测试方向,可大致分为6个测试类别:分别是micro,ml(机器学习),sql,graph,websearch和streaming. 2.1 micro Benchma ...
- HiBench成长笔记——(5) HiBench-Spark-SQL-Scan源码分析
run.sh #!/bin/bash # Licensed to the Apache Software Foundation (ASF) under one or more # contributo ...
- HiBench成长笔记——(4) HiBench测试Spark SQL
很多内容之前的博客已经提过,这里不再赘述,详细内容参照本系列前面的博客:https://www.cnblogs.com/ratels/p/10970905.html 和 https://www.cnb ...
- HiBench成长笔记——(11) 分析源码run.sh
#!/bin/bash # Licensed to the Apache Software Foundation (ASF) under one or more # contributor licen ...
- HiBench成长笔记——(10) 分析源码execute_with_log.py
#!/usr/bin/env python2 # Licensed to the Apache Software Foundation (ASF) under one or more # contri ...
- HiBench成长笔记——(9) 分析源码monitor.py
monitor.py 是主监控程序,将监控数据写入日志,并统计监控数据生成HTML统计展示页面: #!/usr/bin/env python2 # Licensed to the Apache Sof ...
- HiBench成长笔记——(8) 分析源码workload_functions.sh
workload_functions.sh 是测试程序的入口,粘连了监控程序 monitor.py 和 主运行程序: #!/bin/bash # Licensed to the Apache Soft ...
随机推荐
- php面试题之PHP核心技术
一.PHP核心技术 更多PHP相关知识请关注我的专栏PHPzhuanlan.zhihu.com 1.写出一个能创建多级目录的PHP函数(新浪网技术部) <?php /** * 创建多级目录 * ...
- Java自学-集合框架 ArrayList和LinkedList的区别
ArrayList和LinkedList的区别 步骤 1 : ArrayList和LinkedList的区别 ArrayList ,插入,删除数据慢 LinkedList, 插入,删除数据快 Arra ...
- java模式之单例
懒汉式:需要实例的时候new public class Singleton_Lazy { private static Singleton_Lazy mSingleton; private Singl ...
- Linux环境查看Java应用消耗资源情况
linux线上资源耗时定位 https://www.cnblogs.com/wuchanming/p/7766994.html 1. jps -ml 查看服务器上运行的Java程序 2. jmap 查 ...
- 关于c++的头文件和变量声明
写再最前面:摘录于柳神的笔记: C++的头⽂件⼀般是没有像C语⾔的 .h 这样的扩展后缀的,⼀般情况下C语⾔⾥⾯的头⽂件去掉 .h 然 后在前⾯加个 c 就可以继续在C++⽂件中使⽤C语⾔头⽂件中 ...
- FTP虚拟账户
部署一个内网FTP服务器 为了解决公司员工文件存储和下载的需求.要求部署内部FTP服务器,员工可以通过自己的账号的权限对FTP进行操作. 1)公司公共文件可以通过匿名下载 2)公司财务部.商务部.行政 ...
- Linux之用户和用户组总结
Linux是多用户.多任务操作系统 UID即为用户身份号码,具有唯一性,可通过UID来判断用户身份,有以下几种:UID为0,系统管理员,即root,万能:UID为1-999,系统账号,用于独立执行某些 ...
- 38 java 使用标签跳出多层嵌套循环
public class Interview { public static void main(String[] args) { //使用带标签的break跳出多层嵌套循环 Boolean flag ...
- ng-repeat 设定select 选择项
<select class="form-control m-b" name="FPermissionID" ng-model="mgfunc.F ...
- 使用MyCat实现MySQL读写分离
说明 配置MyCat读写分类前需要先配置MySQL的主从复制,参考我上一篇的文章,已经做了比较详细地讲解了. 环境 centos7.MySQL5.7.mycat1.6 配置MyCat账号密码和数据库名 ...