HiBench成长笔记——(7) 阅读《The HiBench Benchmark Suite: Characterization of the MapReduce-Based Data Analysis》
《The HiBench Benchmark Suite: Characterization of the MapReduce-Based Data Analysis》内容精选
We then evaluate and characterize the Hadoop framework using HiBench, in terms of speed (i.e., job running time), throughput (i.e., the number of tasks completed per minute), HDFS bandwidth, system resource (e.g., CPU, memory and I/O) utilizations, and data access patterns.
关键内容:speed、 throughput、HDFS bandwidth、 system resource、data access patterns
the last one is an enhanced version of the DFSIO benchmark that we have extended to evaluate the aggregated bandwidth delivered by HDFS.
关键内容:evaluate the aggregated bandwidth delivered by HDFS

As shown in Fig. 1, the aggregated throughput curve has a warm-up period and a cool-down period where map tasks are launching up and shutting down respectively. Between these two periods, there is a steady period where the aggregated throughput values are stable across different time slots. Therefore, the Enhanced DFSIO workload computes the aggregated HDFS throughput by averaging the throughput value of each time slot in the steady period. In Enhanced DFSIO, when the number of concurrent map tasks at a time slot is above a specified percentage (e.g., 50% is used in our benchmarking) of the total map task slots in the Hadoop cluster, the slot is considered to be in the steady period.
关键内容:warm-up period、cool-down period、steady period、computes the aggregated HDFS throughput by averaging the throughput value of each time slot in the steady period

In essence, the TeraSort workload is similar to Sort and therefore is I/O bound in nature. However, we have compressed its shuffle data (i.e., map output) in the experiment so as to minimize the disk and network I/O during shuffle, as shown in Table III. Consequently, TeraSort have very high CPU utilization and moderate disk I/O during the map stage and shuffle phases, and moderate CPU utilization and heavy disk I/O during the reduce phases, as shown in Fig. 4.
关键内容:map stage、shuffle phases、reduce phases、high、moderate、CPU utilization、disk I/O
The best performance (total running time) of Hadoop workloads is usually obtained by accurately estimating the size of the map output, shuffle data and reduce input data, and properly allocating memory buffers to prevent multiple spilling (to disk) of those data.
关键内容:estimating the size of the map output、 shuffle data and reduce input data、allocating memory buffers
HiBench成长笔记——(7) 阅读《The HiBench Benchmark Suite: Characterization of the MapReduce-Based Data Analysis》的更多相关文章
- HiBench成长笔记——(2) CentOS部署安装HiBench
安装Scala 使用spark-shell命令进入shell模式,查看spark版本和Scala版本: 下载Scala2.10.5 wget https://downloads.lightbend.c ...
- HiBench成长笔记——(3) HiBench测试Spark
很多内容之前的博客已经提过,这里不再赘述,详细内容参照本系列前面的博客:https://www.cnblogs.com/ratels/p/10970905.html 创建并修改配置文件conf/spa ...
- HiBench成长笔记——(1) HiBench概述
测试分类 HiBench共计19个测试方向,可大致分为6个测试类别:分别是micro,ml(机器学习),sql,graph,websearch和streaming. 2.1 micro Benchma ...
- HiBench成长笔记——(5) HiBench-Spark-SQL-Scan源码分析
run.sh #!/bin/bash # Licensed to the Apache Software Foundation (ASF) under one or more # contributo ...
- HiBench成长笔记——(4) HiBench测试Spark SQL
很多内容之前的博客已经提过,这里不再赘述,详细内容参照本系列前面的博客:https://www.cnblogs.com/ratels/p/10970905.html 和 https://www.cnb ...
- HiBench成长笔记——(11) 分析源码run.sh
#!/bin/bash # Licensed to the Apache Software Foundation (ASF) under one or more # contributor licen ...
- HiBench成长笔记——(10) 分析源码execute_with_log.py
#!/usr/bin/env python2 # Licensed to the Apache Software Foundation (ASF) under one or more # contri ...
- HiBench成长笔记——(9) 分析源码monitor.py
monitor.py 是主监控程序,将监控数据写入日志,并统计监控数据生成HTML统计展示页面: #!/usr/bin/env python2 # Licensed to the Apache Sof ...
- HiBench成长笔记——(8) 分析源码workload_functions.sh
workload_functions.sh 是测试程序的入口,粘连了监控程序 monitor.py 和 主运行程序: #!/bin/bash # Licensed to the Apache Soft ...
随机推荐
- python中 yield 的用法 (简单、清晰)
首先我要吐槽一下,看程序的过程中遇见了yield这个关键字,然后百度的时候,发现没有一个能简单的让我懂的,讲起来真TM的都是头头是道,什么参数,什么传递的,还口口声声说自己的教程是最简单的,最浅显易懂 ...
- 2.9 logistic回归中的梯度下降法(非常重要,一定要重点理解)
怎么样计算偏导数来实现logistic回归的梯度下降法 它的核心关键点是其中的几个重要公式用来实现logistic回归的梯度下降法 接下来开始学习logistic回归的梯度下降法 logistic回归 ...
- PAT A1034 Head Of Gang
用并查集分割团伙,判断输出~ #include<bits/stdc++.h> using namespace std; ; },weight[maxn]; unordered_map< ...
- git pull解决冲突
git报错:Please commit your changes or stash them before you merge. 解决:1.不需要保留本地修改的话,直接将有冲突的文件还原再pull:g ...
- 孤荷凌寒自学python第103天认识区块链017
[主要内容] 今天继续分析从github上获取的开源代码怎么实现简单区块链的入门知识,共用时间25分钟. (此外整理作笔记花费了约34分钟) 详细学习过程见文末学习过程屏幕录像. 今天所作的工作是进一 ...
- Markdown 语法使用
Markdown是一种可以使用普通文本编辑器编写的标记语言,通过简单的标记语法,它可以使普通文本内容具有一定的格式.Markdown的语法简洁明了.学习容易,而且功能比纯文本更强,被广泛的应用在博客写 ...
- 「luogu4366」最短路
「luogu4366」最短路 传送门 直接连边显然不行,考虑优化. 根据异或的结合律和交换律等优秀性质,我们每次只让当前点向只有一位之别的另一个点连边,然后就直接跑最短路. 注意点数会很多,所以用配对 ...
- Python连载60-Tkinter布局、按钮以及属性详解
一.Tkinter 1.组件的大致使用步骤 (1)创建总面板 (2)创建面板上的各种组件: i.指定组件的父组件,即依附关系:ii.利用相应的属性对组件进行设置:iii.给组件安排布局. (3)同步 ...
- luffy 那点事
1 虚拟环境创建 2 后台:Django项目创建 3 后台配置 4 数据库配置 5 user模块User表 6 前台 7 前台配置 8 前端主页 9 后端主页模块设计 10 xadmin 后台管理 1 ...
- display:flex下子元素宽度无效
在子元素上设置: width:60px; flex-shrink:0;