[转帖]CPU Utilization is Wrong
CPU Utilization is Wrong
09 May 2017
The metric we all use for CPU utilization is deeply misleading, and getting worse every year. What is CPU utilization? How busy your processors are? No, that's not what it measures. Yes, I'm talking about the "%CPU" metric used everywhere, by everyone. In every performance monitoring product. In top(1).
What you may think 90% CPU utilization means:
What it might really mean:
Stalled means the processor was not making forward progress with instructions, and usually happens because it is waiting on memory I/O. The ratio I drew above (between busy and stalled) is what I typically see in production. Chances are, you're mostly stalled, but don't know it.
What does this mean for you? Understanding how much your CPUs are stalled can direct performance tuning efforts between reducing code or reducing memory I/O. Anyone looking at CPU performance, especially on clouds that auto scale based on CPU, would benefit from knowing the stalled component of their %CPU.
What really is CPU Utilization?
The metric we call CPU utilization is really "non-idle time": the time the CPU was not running the idle thread. Your operating system kernel (whatever it is) usually tracks this during context switch. If a non-idle thread begins running, then stops 100 milliseconds later, the kernel considers that CPU utilized that entire time.
This metric is as old as time sharing systems. The Apollo Lunar Module guidance computer (a pioneering time sharing system) called its idle thread the "DUMMY JOB", and engineers tracked cycles running it vs real tasks as a important computer utilization metric. (I wrote about this before.)
So what's wrong with this?
Nowadays, CPUs have become much faster than main memory, and waiting on memory dominates what is still called "CPU utilization". When you see high %CPU in top(1), you might think of the processor as being the bottleneck – the CPU package under the heat sink and fan – when it's really those banks of DRAM.
This has been getting worse. For a long time processor manufacturers were scaling their clockspeed quicker than DRAM was scaling its access latency (the "CPU DRAM gap"). That levelled out around 2005 with 3 GHz processors, and since then processors have scaled using more cores and hyperthreads, plus multi-socket configurations, all putting more demand on the memory subsystem. Processor manufacturers have tried to reduce this memory bottleneck with larger and smarter CPU caches, and faster memory busses and interconnects. But we're still usually stalled.
How to tell what the CPUs are really doing
By using Performance Monitoring Counters (PMCs): hardware counters that can be read using Linux perf, and other tools. For example, measuring the entire system for 10 seconds:
# perf stat -a -- sleep 10 Performance counter stats for 'system wide': 641398.723351 task-clock (msec) # 64.116 CPUs utilized (100.00%)
379,651 context-switches # 0.592 K/sec (100.00%)
51,546 cpu-migrations # 0.080 K/sec (100.00%)
13,423,039 page-faults # 0.021 M/sec
1,433,972,173,374 cycles # 2.236 GHz (75.02%)
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
1,118,336,816,068 instructions # 0.78 insns per cycle (75.01%)
249,644,142,804 branches # 389.218 M/sec (75.01%)
7,791,449,769 branch-misses # 3.12% of all branches (75.01%) 10.003794539 seconds time elapsed
The key metric here is instructions per cycle (insns per cycle: IPC), which shows on average how many instructions we were completed for each CPU clock cycle. The higher, the better (a simplification). The above example of 0.78 sounds not bad (78% busy?) until you realize that this processor's top speed is an IPC of 4.0. This is also known as 4-wide, referring to the instruction fetch/decode path. Which means, the CPU can retire (complete) four instructions with every clock cycle. So an IPC of 0.78 on a 4-wide system, means the CPUs are running at 19.5% their top speed. Newer Intel processors may move to 5-wide.
There are hundreds more PMCs you can use to dig further: measuring stalled cycles directly by different types.
In the cloud
If you are in a virtual environment, you might not have access to PMCs, depending on whether the hypervisor supports them for guests. I recently posted about The PMCs of EC2: Measuring IPC, showing how PMCs are now available for dedicated host types on the AWS EC2 Xen-based cloud.
Interpretation and actionable items
If your IPC is < 1.0, you are likely memory stalled, and software tuning strategies include reducing memory I/O, and improving CPU caching and memory locality, especially on NUMA systems. Hardware tuning includes using processors with larger CPU caches, and faster memory, busses, and interconnects.
If your IPC is > 1.0, you are likely instruction bound. Look for ways to reduce code execution: eliminate unnecessary work, cache operations, etc. CPU flame graphs are a great tool for this investigation. For hardware tuning, try a faster clock rate, and more cores/hyperthreads.
For my above rules, I split on an IPC of 1.0. Where did I get that from? I made it up, based on my prior work with PMCs. Here's how you can get a value that's custom for your system and runtime: write two dummy workloads, one that is CPU bound, and one memory bound. Measure their IPC, then calculate their mid point.
What performance monitoring products should tell you
Every performance tool should show IPC along with %CPU. Or break down %CPU into instruction-retired cycles vs stalled cycles, eg, %INS and %STL.
As for top(1), there is tiptop(1) for Linux, which shows IPC by process:
tiptop - [root]
Tasks: 96 total, 3 displayed screen 0: default PID [ %CPU] %SYS P Mcycle Minstr IPC %MISS %BMIS %BUS COMMAND
3897 35.3 28.5 4 274.06 178.23 0.65 0.06 0.00 0.0 java
1319+ 5.5 2.6 6 87.32 125.55 1.44 0.34 0.26 0.0 nm-applet
900 0.9 0.0 6 25.91 55.55 2.14 0.12 0.21 0.0 dbus-daemo
Other reasons CPU utilization is misleading
It's not just memory stall cycles that makes CPU utilization misleading. Other factors include:
- Temperature trips stalling the processor.
- Turboboost varying the clockrate.
- The kernel varying the clock rate with speed step.
- The problem with averages: 80% utilized over 1 minute, hiding bursts of 100%.
- Spin locks: the CPU is utilized, and has high IPC, but the app is not making logical forward progress.
Update: is CPU utilization actually wrong?
There have been hundreds of comments on this post, here (below) and elsewhere (1, 2). Thanks to everyone for taking the time and the interest in this topic. To summarize my responses: I'm not talking about iowait at all (that's disk I/O), and there are actionable items if you know you are memory bound (see above).
But is CPU utilization actually wrong, or just deeply misleading? I think many people interpret high %CPU to mean that the processing unit is the bottleneck, which is wrong (as I said earlier). At that point you don't yet know, and it is often something external. Is the metric technically correct? If the CPU stall cycles can't be used by anything else, aren't they are therefore "utilized waiting" (which sounds like an oxymoron)? In some cases, yes, you could say that %CPU as an OS-level metric is technically correct, but deeply misleading. With hyperthreads, however, those stalled cycles can now be used by another thread, so %CPU may count cycles as utilized that are in fact available. That's wrong. In this post I wanted to focus on the interpretation problem and suggested solutions, but yes, there are technical problems with this metric as well.
You might just say that utilization as a metric was already broken, as Adrian Cockcroft discussed previously.
Conclusion
CPU utilization has become a deeply misleading metric: it includes cycles waiting on main memory, which can dominate modern workloads. Perhaps %CPU should be renamed to %CYC, short for cycles. You can figure out what %CPU really means by using additional metrics, including instructions per cycle (IPC). An IPC < 1.0 likely means memory bound, and an IPC > 1.0 likely means instruction bound. I covered IPC in my previous post, including an introduction to the Performance Monitoring Counters (PMCs) needed to measure it.
Performance monitoring products that show %CPU – which is all of them – should also show PMC metrics to explain what that means, and not mislead the end user. For example, they can show %CPU with IPC, and/or instruction-retired cycles vs stalled cycles. Armed with these metrics, developers and operators can choose how to better tune their applications and systems.
[转帖]CPU Utilization is Wrong的更多相关文章
- How do I Find Out Linux CPU Utilization?
From:http://www.cyberciti.biz/tips/how-do-i-find-out-linux-cpu-utilization.html Whenever a Linux sys ...
- 压力测试衡量CPU的三个指标:CPU Utilization、Load Average和Context Switch Rate
分类: 4.软件设计/架构/测试 2010-01-12 19:58 34241人阅读 评论(4) 收藏 举报 测试loadrunnerlinux服务器firebugthread 上篇讲如何用LoadR ...
- Zabbix CPU utilization监控参数
工作中查看Zabbix linux 监控项的时候对linux 监控的cpu使用的各个参数没怎么明白,特意查看了下资料 Zabbix linux模板下的CPU utilization是自带的监控Linu ...
- 【每日一摩斯】-Troubleshooting: High CPU Utilization (164768.1) - 系列6
如果问题是一个正运行的缓慢的查询SQL,那么就应该对该查询进行调优,避免它耗费过高的CPU资源.如果它做了许多的hash连接和全表扫描,那么就应该添加索引以提高效率. 下面的文章可以帮助判断查询的问题 ...
- 【每日一摩斯】-Troubleshooting: High CPU Utilization (164768.1) - 系列5
Oracle(用户)进程 以下这些操作都是需要消耗大量CPU资源的:解析大型查询,存储过程编译或执行,空间管理和排序. 下面这几篇文章可以帮助采集关于使用高CPU资源的进程的更多信息: Note:35 ...
- 【每日一摩斯】-Troubleshooting: High CPU Utilization (164768.1) - 系列4
Jobs (CJQ0, Jn, SNPn) Job进程运行用户定义的以及系统定义的类似于batch的任务.检查Job进程占用大量CPU资源的方法,就像检查用户进程一样. 可以根据以下视图检查Job进程 ...
- [转帖]CPU Cache 机制以及 Cache miss
CPU Cache 机制以及 Cache miss https://www.cnblogs.com/jokerjason/p/10711022.html CPU体系结构之cache小结 1.What ...
- [转帖]CPU 的缓存
缓存这个词想必大家都听过,其实缓存的意义很广泛:电脑整机最大的缓存可以体现为内存条.显卡上的显存就是显卡芯片所需要用到的缓存.硬盘上也有相对应的缓存.CPU有着最快的缓存(L1.L2.L3缓存等),缓 ...
- [转帖]CPU时间片
CPU时间片 https://www.cnblogs.com/xingzc/p/6077214.html CPU的时间片 CPU的利用率好CPU的 load average 是不一样的 Conntex ...
- [转帖]震惊,用了这么多年的 CPU 利用率,其实是错的
震惊,用了这么多年的 CPU 利用率,其实是错的 2018年12月22日 08:43:09 Linuxer_ 阅读数:50 https://blog.csdn.net/juS3Ve/article/d ...
随机推荐
- MySQL篇:第二章_初识MySQL
初始MySQL MySQL的背景 1.前身属于瑞典的一家公司,MySQL AB 2.08年被sun公司收购 3.09年sun被oracle收购 MySQL的优点 1.开源.免费.成本低 2.性能高.移 ...
- 云图说|什么是可信智能计算服务TICS
阅识风云是华为云信息大咖,擅长将复杂信息多元化呈现,其出品的一张图(云图说).深入浅出的博文(云小课)或短视频(云视厅)总有一款能让您快速上手华为云.更多精彩内容请单击此处. 本文分享自华为云社区&l ...
- 详解GaussDB(DWS)用户监控原理及应用
摘要:本文将聚焦于用户监控的原理及应用进行介绍. 本文分享自华为云社区<GaussDB(DWS)监控工具指南(二)用户级监控>,作者:幕后小黑爪 . 前言 资源监控是整个运维乃至整个产品生 ...
- 数据库技术丨GaussDB(DWS)数据同步状态查看方法
摘要:针对数据同步状态查看方法,GaussDB(DWS)提供了丰富的系统函数.视图.工具等可以直观地对同步进度进行跟踪,尤其是为方便定位人员使用,gs_ctl工具已集合了大部分相关系统函数的调用,可做 ...
- 常见的6种MySQL约束
摘要:一篇文章带你彻底了解MySQL各种约束 MySQL约束 <1> 概念 是一种限制,它是对表的行和列的数据做出约束,确保表中数据的完整性和唯一性. <2> 使用场景 创建表 ...
- 云图说|将源端MongoDB业务搬迁至华为云DDS的几种方式
摘要:华为云文档数据库服务DDS能帮您在业务需要时,快速便捷的搬迁源端MongoDB业务上云. 如果您因业务调整或需要使用华为云文档数据库DDS特性功能时,可以通过数据迁移功能将原有MongoDB数据 ...
- 想做DBA,多租户管理你一定要知道这些
摘要:多租户为满足客户混合负载处理需求而生,通过提供两层用户机制,分层资源隔离,满足客户对计算和存储资源的自主控制需求. 本文分享自华为云社区<关于GaussDB(DWS)多租户管理,这些你一定 ...
- 云小课|细数那些VMware虚拟机的恢复招式
阅识风云是华为云信息大咖,擅长将复杂信息多元化呈现,其出品的一张图(云图说).深入浅出的博文(云小课)或短视频(云视厅)总有一款能让您快速上手华为云.更多精彩内容请单击此处. 摘要:当遭遇误操作.病毒 ...
- 云原生时代,政企混合云场景IT监控和诊断的难点和应对之道
摘要:正是因为政企IT架构云化的云原生架构,相比之前的单体烟囱式架构,在监控诊断方面有着更多的难点和挑战,这也在业界催生出大量相关的标准和工具. 本文分享自华为云社区<[华为云Stack][大架 ...
- 处理开发者账号到期导致APP下架的方处理开发者账号到期导致APP下架的方法
开发人员账号到期时,应采取以下步骤处理APP被下架问题: 登录开发者账号. 点击右上角的"账户",选择"续费". 输入信用卡信息,确保使用支持Visa的银行 ...