iostat中的util和svctm (Two traps in iostat: %util and svctm)
iostat, from the excellent sysstat suite of utilities, is the go-to tool for evaluating IO performance on Linux. It's obvious why that's the case: sysstat is very useful, solid, and widely installed. System administrators can go a lot worse than taking a look at iostat -x. There are some serious caveats lurking in iostat's output, two of which are greatly magnified on newer machines with solid state drives.
To explain what's wrong, let me compare two lines of iostat output:
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz
sdd 0.00 0.00 13823.00 0.00 55292.00 0.00 8.00
avgqu-sz await r_await w_await svctm %util
0.78 0.06 0.06 0.00 0.06 78.40
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz
sdd 0.00 0.00 72914.67 0.00 291658.67 0.00 8.00
avgqu-sz await r_await w_await svctm %util
15.27 0.21 0.21 0.00 0.01 100.00
Both of these lines are from the same device (a Samsung 840 EVO SSD), and both are from read-only 4kB random loads. What differs here is the level of parallelism: in the first load the mean queue depth is only 0.78, and in the second it's 15.27. Same pattern, more threads.
The first problem we run into with this output is the svctm field, widely taken to be the average amount of time an operation takes. The iostat man page describes it as:
The average service time (in milliseconds) for I/O requests that were issued to the device.
and goes on to say:
The average service time (svctm field) value is meaningless, as I/O statistics are now calculated at block level, and we don't know when the disk driver starts to process a request.
The reasons the man page states for this field being meaningless are true, as are the warnings in the sysstat code. The calculation behind svctm is fundamentally broken, and doesn't really have a clear meaning. Inside iostat, svctm in an interval is calculated as time the device was doing some work / number of IOs, that is the amount of time we were doing work, divided by the amount of work done. Going back to our two workloads, we can compare their service times:
svctm
0.06
0.01
Taken literally, this means the device was responding in 60µs when under little load, and 10µs when under a lot of load. That seems unlikely, and indeed the load generator fio tells us it's not true. So what's going on?

Magnetic hard drives are serial beings. They have a few tiny heads, ganged together, that move over a spinning platter to a single location where they do some IO. Once the IO is done, and no sooner, they move on. Over the years, they've gathered some shiny capabilities like NCQ and TCQ that make them appear parallel (mostly to allow reordering), but they're still the same old horse-and-carriage sequential devices they've always been. Modern hard drives expose some level of concurrency, but no true parallelism. SSDs, like the Samsung 840 EVO in this test, are different. SSDs can and do perform operations in parallel. In fact, the only way to achieve their peak performance is to offer them parallel work to do.
While SSDs vary in the details of their internal construction, most have the ability to access multiple flash packages (groups of chips) at a time. This is a big deal for SSD performance. Individual flash chips actually don't have great bandwidth, so the ability to group the performance of many chips together is essential. The chips are completely independent, and because the controller doesn't need to block on requests to the chip, the drive is truly doing multiple things at once. Without the single electromechanical head as a bottleneck, even single SSDs can have a fairly large amount of internal parallelism. This diagram from Agarwal, et al shows the high-level architecture:

If Jane does one thing at a time, and doing ten things takes Jane 20 minutes, each thing has taken Jane an average of two minutes. The mean time between asking Jane to do something and Jane completing it is two minutes. Alice, like Jane, can do ten things in twenty minutes, but she works on multiple things in parallel. Looking only at Alice's throughput (the number of things she gets done in a period of time) what can we say about Alice's latency (the amount of time it takes her from start to finish for a task)? Very little. We know its less than 10 minutes. If she's busy the whole time, we know it's 2 or more minutes. That's it.
Let's go back to that iostat output:
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz
sdd 0.00 0.00 13823.00 0.00 55292.00 0.00 8.00
avgqu-sz await r_await w_await svctm %util
0.78 0.06 0.06 0.00 0.06 78.40
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz
sdd 0.00 0.00 72914.67 0.00 291658.67 0.00 8.00
avgqu-sz await r_await w_await svctm %util
15.27 0.21 0.21 0.00 0.01 100.00
What's going on with %util, then? The first line is telling us that the drive is 78.4% utilized at 13823 reads per second. The second line is telling us that the drive is 100% utilized at 72914 reads per second. If it takes 14 thousand to fill it to 78.4%, wouldn't we expect it to only be able to do 18 thousand in total? How is it doing 73 thousand?
The problem here is the same - parallelism. When iostat says %util, it means "Percentage of CPU time during which I/O requests were issued to the device". The percentage of the time the drive was doing at least one thing. If it's doing 16 things at the same time, that doesn't change. Once again, this calculation works just fine for magnetic drives (and Jane), which do only one thing at a time. The amount of time they spend doing one thing is a great indication of how busy they really are. SSDs (and RAIDs, and Alice), on the other hand, can do multiple things at once. If you can do multiple things in parallel, the percentage of time you're doing at least one thing isn't a great predictor of your performance potential. The iostat man page does provide a warning:
Device saturation occurs when this value is close to 100% for devices serving requests serially. But for devices serving requests in parallel, such as RAID arrays and modern SSDs, this number does not reflect their performance limits.
As a measure of general IO busyness %util is fairly handy, but as an indication of how much the system is doing compared to what it can do, it's terrible. Iostat's svctm has even fewer redeeming strengths. It's just extremely misleading for most modern storage systems and workloads. Both of these fields are likely to mislead more than inform on modern SSD-based storage systems, and their use should be treated with extreme care.
%util的计算方法是:IO请求在处理的时间/iostat的时间间隔;util的含义是指IO请求在设备上处理的CPU时间百分比。对于SSD盘来讲,SSD盘的driver是并行处理IO请求的,所以%util=90以上并不能说明盘的能力还剩余10%。
%svctm的计算方法是:设备忙的时间/iostat时间间隔内做的IO数;svctm最初的含义是指IO请求在设备中被处理的时间。但是目前的/proc/iostats的数据源是从block层采集的,所以并不能获取到IO的真正排队时间和处理时间。
原文链接:http://brooker.co.za/blog/2014/07/04/iostat-pct.html
iostat中的util和svctm (Two traps in iostat: %util and svctm)的更多相关文章
- iostat中 %util高 应用延迟高
经过长时间监控,发现iostat 中的%util居高不下,一直在98%上下,说明带宽占用率极高,遇到了瓶颈. 且读写速度很慢,经过排查,发现是HBA卡出现问题,更换后,用dd if命令测试,磁盘的读写 ...
- iostat (转https://www.cnblogs.com/ftl1012/p/iostat.html)
iostat是I/O statistics(输入/输出统计)的缩写,iostat工具将对系统的磁盘操作活动进行监视.它的特点是汇报磁盘活动统计情况,同时也会汇报出CPU使用情况.iostat也有一个弱 ...
- java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.FutureTask@1f303192 rejected from java.util.concurrent.ThreadPoolExecutor@11f7cc04[Terminated, pool size = 0, active threads
java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.FutureTask@1f303192 rejec ...
- 疑难杂症:NoSuchMethodError: com.opensymphony.xwork2.util.finder.UrlSet.includeClassesUrl(Lcom/opensymphony/xwork2/util/finder/ClassLoaderInterface;)
严重: Exception starting filter struts2java.lang.NoSuchMethodError: com.opensymphony.xwork2.util.finde ...
- java.lang.ClassCastException: java.util.ArrayList cannot be cast to java.util.Map
1.错误描写叙述 java.lang.ClassCastException: java.util.ArrayList cannot be cast to java.util.Map at servic ...
- util类中非静态方法中注入serivce,在controller层是使用util。
今天碰到如题的问题,刚一开始在util中注入service总是注入失败,起初我以为是util中没有注入成功,debug看了一下果然注入不进来. 然后各种纠结,最终坑爹的问题是在controller直接 ...
- Maven中使用ssm框架出现:org.apache.tomcat.util.modeler.BaseModelMBean.invoke 调用方法[manageApp]时发生异常
org.apache.tomcat.util.modeler.BaseModelMBean.invoke 调用方法[manageApp]时发生异常 首先可以排查一下像: @RequestMapping ...
- mybatis中传集合时 报异常 invalid comparison: java.util.Arrays$ArrayList and java.lang.String
犯了一个低级的错误,在传集合类型的参数时,把他当成字符串处理了,导致报类型转换的错误 把 and nsrsbh!=' ' 删掉就行了
- java 多线程 发布订阅模式:发布者java.util.concurrent.SubmissionPublisher;订阅者java.util.concurrent.Flow.Subscriber
1,什么是发布订阅模式? 在软件架构中,发布订阅是一种消息范式,消息的发送者(称为发布者)不会将消息直接发送给特定的接收者(称为订阅者).而是将发布的消息分为不同的类别,无需了解哪些订阅者(如果有的话 ...
随机推荐
- Gym 102091L Largest Allowed Area 【二分+二维前缀和】
<题目链接> 题目大意:给你一个由01组成的矩形,现在问你,该矩形中,最多只含一个1的正方形的边长最长是多少. 解题分析: 用二维前缀和维护一下矩形的01值,便于后面直接$O(1)$查询任 ...
- CodeForces 433C Ryouko's Memory Note (中位数定理)
<题目链接> 题目大意:给你一堆数字,允许你修改所有相同的数字成为别的数字,不过只能修改一次,问你修改后序列相邻数字的距离和最小是多少. 解题分析: 首先,修改不是任意的,否则那样情况太多 ...
- Codeforces 521C (经典)组合数取模【逆元】
<题目链接> <转载于 >>> > 题目大意:给出一串n个数字,让你在这串数字中添加k个 ' + ' 号(添加后表达式合法),然后所有拆分所得的所有合法表达 ...
- Linux学习之日志管理(二十一)
Linux学习之日志管理 目录 日志管理 日志服务 rsyslogd的新特点 启动日志服务 常见日志的作用 日志文件的一般格式 rsyslogd日志服务 /etc/rsyslog.conf配置文件 服 ...
- [ 严重 ] my网SQL注入
RANK 80 金币 100 数据包 POST maoyan.com/sendapp HTTP/1.1Host: xxx.maoyan.comUser-Agent: Mozilla/5.0 ( ...
- 排列组合 HDU - 1521 -指数型母函数
排列组合 HDU - 1521 一句话区分指数型母函数和母函数就是 母函数是组合数,指数型母函数是排列数 #include<bits/stdc++.h> using namespace s ...
- PXE无人值守安装
简介 1.1 什么是PXE PXE(Pre-boot Execution Environment,预启动执行环境)是由Intel公司开发的最新技术,工作于Client/Server的网络模式,支持工作 ...
- 网上的很多Android项目源码有用吗?Android开发注意的地方。
在Android项目开发中,我们可以在网上看到很多项目源码,大部分也不是很精致, 比如 06.Android阿福多媒体播放器开发教程+源码 还有什么浏览器源码. 那么这些有用吗? 价值在哪里? 精致 ...
- Android 屏幕适配问题分析
一.屏幕分辨率.大小及相关单位介绍 Android categorizes device screens using two general properties: size and density. ...
- (转)ConurrentHashMap和Hashtable的区别
集合类是Java API的核心,但是我觉得要用好它们是一种艺术.我总结了一些个人的经验,譬如使用ArrayList能够提高性能,而不再需要过时的Vector了,等等.JDK 1.5引入了一些好用的并发 ...