[转帖]C2C - False Sharing Detection in Linux Perf
https://joemario.github.io/blog/2016/09/01/c2c-blog/
Do you run your application in a NUMA environment? Is it multi-threaded? Is it multi-process with shared memory? If so, is your performance impacted by false sharing?
Now there’s a way to easily find out. We’re posting patches for a new feature to the Linux perf tool, called “c2c” for cache-2-cache.
We at Red Hat have been running the development prototype of c2c on lots of big Linux applications and it’s uncovered many hot false sharing cachelines.
I’ve been playing with this tool quite a bit. It is pretty cool. Let me share a little about what it is and how to use it.
At a high level, “perf c2c” will show you:
* The cachelines where false sharing was detected.
* The readers and writers to those cachelines, and the offsets where those accesses occurred.
* The pid, tid, instruction addr, function name, binary object name for those readers and writers.
* The source file and line number for each reader and writer.
* The average load latency for the loads to those cachelines.
* Which numa nodes the samples a cacheline came from and which cpus were involved.
Using perf c2c is similar to using the Linux perf tool today.
First collect data with “perf c2c record ” Then generate a report output with “perf c2c report ”
Before covering the output data, here is a “how to” for the flags to use when calling “perf c2c”:
c2c usage flags
Then here’s an output file from a recent “perf c2c” run I did:
c2c output file
And, if you want to play with it yourself, here’s a simple source file to generate lots of false sharing.
False sharing .c src file
First I’ll go over the output file to highlight the interesting fields.
This first table in the output file gives a high level summary of all the load and store samples collected. It is interesting to see where your program’s load instructions got their data.
Notice the term “HITM”, which stands for a load that hit in a modified cacheline. That’s the key that false sharing has occured. Remote HITMs, meaning across numa nodes, are the most expensive - especially when there are lots of readers and writers.
1 =================================================
2 Trace Event Information
3 =================================================
4 Total records : 329219 << Total loads and stores sampled.
5 Locked Load/Store Operations : 14654
6 Load Operations : 69679 << Total loads
7 Loads - uncacheable : 0
8 Loads - IO : 0
9 Loads - Miss : 3972
10 Loads - no mapping : 0
11 Load Fill Buffer Hit : 11958
12 Load L1D hit : 17235 << loads that hit in the L1 cache.
13 Load L2D hit : 21
14 Load LLC hit : 14219 << loads that hit in the last level cache (LLC).
15 Load Local HITM : 3402 << loads that hit in a modified cache on the same numa node (local HITM).
16 Load Remote HITM : 12757 << loads that hit in a modified cache on a remote numa node (remote HITM).
17 Load Remote HIT : 5295
18 Load Local DRAM : 976 << loads that hit in the local node's main memory.
19 Load Remote DRAM : 3246 << loads that hit in a remote node's main memory.
20 Load MESI State Exclusive : 4222
21 Load MESI State Shared : 0
22 Load LLC Misses : 22274 << loads not found in any local node caches.
23 LLC Misses to Local DRAM : 4.4% << % hitting in local node's main memory.
24 LLC Misses to Remote DRAM : 14.6% << % hitting in a remote node's main memory.
25 LLC Misses to Remote cache (HIT) : 23.8% << % hitting in a clean cache in a remote node.
26 LLC Misses to Remote cache (HITM) : 57.3% << % hitting in remote modified cache. (most expensive - false sharing)
27 Store Operations : 259539 << store instruction sample count
28 Store - uncacheable : 0
29 Store - no mapping : 11
30 Store L1D Hit : 256696 << stores that got L1 cache when requested.
31 Store L1D Miss : 2832 << stores that couldn't get the L1 cache when requested (L1 miss).
32 No Page Map Rejects : 2376
33 Unable to parse data source : 1
The second table, (below), in the output file gives a brief one-line summary of the hottest cachelines where false sharing was detected. It’s sorted by which line had the most remote HITMs (or local HITMs if you select that sort option). It gives a nice high level sense for the load and store activity for each cacheline.
I look to see if a cacheline has a high number of “Rmt LLC Load Hitm’s”. If so, it’s time to dig further.
54 =================================================
55 Shared Data Cache Line Table
56 =================================================
57 #
58 # Total Rmt ----- LLC Load Hitm ----- ---- Store Reference ---- --- Load Dram ---- LLC Total ----- Core Load Hit ----- -- LLC Load Hit --
59 # Index Cacheline records Hitm Total Lcl Rmt Total L1Hit L1Miss Lcl Rmt Ld Miss Loads FB L1 L2 Llc Rmt
60 # ..... .................. ....... ....... ....... ....... ....... ....... ....... ....... ........ ........ ....... ....... ....... ....... ....... ........ ........
61 #
62 0 0x602180 149904 77.09% 12103 2269 9834 109504 109036 468 727 2657 13747 40400 5355 16154 0 2875 529
63 1 0x602100 12128 22.20% 3951 1119 2832 0 0 0 65 200 3749 12128 5096 108 0 2056 652
64 2 0xffff883ffb6a7e80 260 0.09% 15 3 12 161 161 0 1 1 15 99 25 50 0 6 1
65 3 0xffffffff81aec000 157 0.07% 9 0 9 1 0 1 0 7 20 156 50 59 0 27 4
66 4 0xffffffff81e3f540 179 0.06% 9 1 8 117 97 20 0 10 25 62 11 1 0 24 7
Next is the Pareto table, which shows lots of valuable information about each contended cacheline. This is the most important table in the output. I only show three cachelines here to keep this blog simple. Here’s what’s in it.
* Lines 71 and 72 are the column headers for what’s happening in each cacheline.
* Line 76 shows the HITM and store activity for each cacheline - first with counts for load
and store activity, followed by the cacheline virtual data address.
* Then there’s the data address column. Line 76 shows the virtual address of the cacheline.
Each row underneath is represents the offset into the cachline where those accesses occured.
* The next column shows the pid, and/or the thread id (tid) if you selected that for the output.
* Following is the instruction pointer code address.
* Next are three columns showing the average load latencies. I always look here for long
latency averages, which is a sign for how painful the contention was to that cacheline.
* The “cpu cnt” column shows how many different cpus samples came from.
* Then there’s the function name, binary object name, source file and line number.
* The last column shows for each node, the specific cpus that samples came from.
67 =================================================
68 Shared Cache Line Distribution Pareto
69 =================================================
70 #
71 # ----- HITM ----- -- Store Refs -- Data address ---------- cycles ---------- cpu Shared
72 # Num Rmt Lcl L1 Hit L1 Miss Offset Pid Code address rmt hitm lcl hitm load cnt Symbol Object Source:Line Node{cpu list}
73 # ..... ....... ....... ....... ....... .................. ....... .................. ........ ........ ........ ........ ................... .................... ........................... ....
74 #
75 -------------------------------------------------------------
76 0 9834 2269 109036 468 0x602180
77 -------------------------------------------------------------
78 65.51% 55.88% 75.20% 0.00% 0x0 14604 0x400b4f 27161 26039 26017 9 [.] read_write_func no_false_sharing.exe false_sharing_example.c:144 0{0-1,4} 1{24-25,120} 2{48,54} 3{169}
79 0.41% 0.35% 0.00% 0.00% 0x0 14604 0x400b56 18088 12601 26671 9 [.] read_write_func no_false_sharing.exe false_sharing_example.c:145 0{0-1,4} 1{24-25,120} 2{48,54} 3{169}
80 0.00% 0.00% 24.80% 100.00% 0x0 14604 0x400b61 0 0 0 9 [.] read_write_func no_false_sharing.exe false_sharing_example.c:145 0{0-1,4} 1{24-25,120} 2{48,54} 3{169}
81 7.50% 9.92% 0.00% 0.00% 0x20 14604 0x400ba7 2470 1729 1897 2 [.] read_write_func no_false_sharing.exe false_sharing_example.c:154 1{122} 2{144}
82 17.61% 20.89% 0.00% 0.00% 0x28 14604 0x400bc1 2294 1575 1649 2 [.] read_write_func no_false_sharing.exe false_sharing_example.c:158 2{53} 3{170}
83 8.97% 12.96% 0.00% 0.00% 0x30 14604 0x400bdb 2325 1897 1828 2 [.] read_write_func no_false_sharing.exe false_sharing_example.c:162 0{96} 3{171}
84 -------------------------------------------------------------
85 1 2832 1119 0 0 0x602100
86 -------------------------------------------------------------
87 29.13% 36.19% 0.00% 0.00% 0x20 14604 0x400bb3 1964 1230 1788 2 [.] read_write_func no_false_sharing.exe false_sharing_example.c:155 1{122} 2{144}
88 43.68% 34.41% 0.00% 0.00% 0x28 14604 0x400bcd 2274 1566 1793 2 [.] read_write_func no_false_sharing.exe false_sharing_example.c:159 2{53} 3{170}
89 27.19% 29.40% 0.00% 0.00% 0x30 14604 0x400be7 2045 1247 2011 2 [.] read_write_func no_false_sharing.exe false_sharing_example.c:163 0{96} 3{171}
90 -------------------------------------------------------------
91 2 12 3 161 0 0xffff883ffb6a7e80
92 -------------------------------------------------------------
93 58.33% 100.00% 0.00% 0.00% 0x0 14604 0xffffffff810cf16d 1380 941 1229 9 [k] task_tick_fair [kernel.kallsyms] atomic64_64.h:21 0{0,4,96} 1{25,120,122} 2{53} 3{170-171}
94 16.67% 0.00% 98.76% 0.00% 0x0 14604 0xffffffff810c9379 1794 0 625 13 [k] update_cfs_rq_blocked_load [kernel.kallsyms] atomic64_64.h:45 0{1,4,96} 1{25,120,122} 2{48,53-54,144} 3{169-171}
95 16.67% 0.00% 0.00% 0.00% 0x0 14604 0xffffffff810ce098 1382 0 867 12 [k] update_cfs_shares [kernel.kallsyms] atomic64_64.h:21 0{1,4,96} 1{25,120,122} 2{53-54,144} 3{169-171}
96 8.33% 0.00% 0.00% 0.00% 0x8 14604 0xffffffff810cf18c 2560 0 679 8 [k] task_tick_fair [kernel.kallsyms] atomic.h:26 0{4,96} 1{24-25,120,122} 2{54} 3{170}
97 0.00% 0.00% 1.24% 0.00% 0x8 14604 0xffffffff810cf14f 0 0 0 2 [k] task_tick_fair [kernel.kallsyms] atomic.h:50 2{48,53}
How I often use “perf c2c”
Here are the flags I most commonly use.
perf c2c record -F 60000 -a --all-user sleep 5
perf c2c record -F 60000 -a --all-user sleep 3 // or to sample for a shorter time.
perf c2c record -F 60000 -a --all-kernel sleep 3 // or to only gather kernel samples.
perf c2c record -F 60000 -a -u --ldlat 50 sleep 3 // or to collect only loads >= 50 cycles of load latency (30 is the ldlat default).
To generate report files, you can use the graphical tui report or send the output to stdout:
perf report -NN -c pid,iaddr // to use the tui interactive report
perf report -NN -c pid,iaddr --stdio // or to send the output to stdout
perf report -NN -d lcl -c pid,iaddr --stdio // or to sort on local hitms
By default, symbol names are truncated to a fixed width - for readability.
You can use the “–full-symbols” flag to get full symbol names in the output.
For example:
perf c2c report -NN -c pid,iaddr --full-symbols --stdio
Finding the callers to these cachelines:
Sometimes it’s valuable to know who the callers are. Here is how to get call graph information.
I never generate call graph info initially because it emits so much data, it makes it very difficult to see if and where a false sharing problem exists. I find the problem first without call graphs, then if needed I’ll rerun with call graphs.
perf c2c record --call-graph dwarf,8192 -F 60000 -a --all-user sleep 5
perf c2c report -NN -g --call-graph -c pid,iaddr --stdio
Does bumping perf’s sample rate help?
I’ll sometimes bump the perf sample rate with “-F 60000” or “-F 80000”.
There’s no requirement to do so, but it is a good way to get a richer sample collection in a shorter period of time. If you do, it’s helpful to bump the kernel’s perf sample rate up with the following two echo commands. (see dmesg for “perf interrupt took too long …” sample lowering entries).
echo 500 > /proc/sys/kernel/perf_cpu_time_max_percent
echo 100000 > /proc/sys/kernel/perf_event_max_sample_rate
<then do your "perf c2c record" here>
echo 50 > /proc/sys/kernel/perf_cpu_time_max_percent
What to do when perf drowns in excessive samples:
When running on larger systems (e.g. 4, 8 or 16 socket systems), there can be so many samples that the perf tool can consume lots of cpu time and the perf.data file size grows significantly.
Some tips to help that include:
- Bump the ldlat from the default of 30 to 50. This free’s perf to skip the faster non-interesting loads.
- Lower the sample rate.
- Shorten the sleep time during the “perf record” window. For ex, from “sleep 5” to “sleep 3”.
What I’ve learned by using C2C on numerous applications:
It’s common to look at any performance tool output and ask ‘what does all this data mean?’.
Here are some things I’ve learned. Hopefully they’re of help.
* I tend to run “perf c2c” for 3, 5, or 10 seconds. Running it any longer may take you
from seeing concurrent false sharing to seeing cacheline accesses which are
disjoint in time.
* If you’re not interested in kernel samples, you’ll get better samples in your program by
specifying –all—user. Conversely, specifying –all-kernel is useful when focusing on the
kernel.
* On busy systems with high cpu counts , like >148 cpus, setting –ldlat to a higher value
(like 50 or even 70) may enable perf to generate richer C2C samples.
* Look at the Trace Event table at the top, specifically the “LLC Misses to Remote cache HITM”
number. If it’s not close to zero, then there’s likely worthwhile false sharing to pursue resolving.
* Most of the time the top one, two, or three cachelines in the Shared Cache Line Distribution
Pareto table are the ones to focus on.
* However, sometimes you’ll see the same code from multiple threads causing “less hot”
contention, but you will see it on multiple cachelines for different data addresses.
Even though any one of those lines are less hot individually, fixing them is often a
win because the benefit is spread across many cachelines. This can also happen with
different processes executing the same code accessing shared memory.
* In the Shared Cache Line Distribution Pareto table, if you see long load average load latencies,
it’s often a giveaway that false sharing contention is heavy and is hurting performance.
* Then looking to see what nodes and cpus the samples for those accesses are coming from
can often be a valuable guide to numa-pinning your processes or memory.
For processes using shared memory, it is possible for them to use different virtual addresses,
all pointing to (and contending with) the same shared memory location. They will show
up in the Pareto table as different cachelines, but in fact they are the same cacheline.
These can be tricky to spot. I usually uncover these by first looking to see that shared memory is being used, and then looking for similar patterns in the information provided
for each cacheline.
Last, the Shared Cache Line Distribution Pareto table can also provide great insight into any
ill-aligned hot data.
For example:
* It’s easy to spot heavily modified variables that need to be placed into their own cachelines.
This will enable them to be less contended (and run faster), and it will help accesses to
the other variables that shared their cacheline to not be slowed down.
* It’s easy to spot hot locks or mutexes that are unaligned and spill into multiple cachelines.
* It’s easy to spot “read mostly” variables which can be grouped together into their own
cachelines.
The raw samples can be helpful.
I’ve often found it valuable to take a peek at the raw instruction samples contained in the perf.data file (the one generated by the “perf c2c record”). You can get those raw samples using “perf script”. See man perf-script. The output may be cryptic, but you can sort on the load weight (5th column) to see which loads suffered the most from false sharing contention and took the longest to execute.
The c2c functionality is available in the upstream perf as of the Linux 4.2 kernel./h4>
Lastly, this was a collective effort.
Although Don Zickus, Dick Fowles and Joe Mario worked together to get this implemented, we got lots of early help from Arnaldo Carvalho de Melo, Stephane Eranian, Jiri Olsa and Andi Kleen.
Additionally Jiri has been heavily involved recently integrating the c2c functionality into perf.
A big thanks to all of you for helping to pull this together!
Posted by Joe Mario Sep 1st, 2016 2:54 pm cacheline false sharing, linux, perf c2c, rhel
[转帖]C2C - False Sharing Detection in Linux Perf的更多相关文章
- Linux -- 在多线程程序中避免False Sharing
1.什么是false sharing 在对称多处理器(SMP)系统中,每个处理器均有属于自己的本地高速缓存区. 如图,CPU0和CPU1有各自的本地高速缓存区(cache).线程0和线程1会用到不同的 ...
- OpenMp之false sharing
关于false sharing的文章,网上一大堆了,不过觉得都不太系统,那么下面着重系统说明一下. 先看看外国佬下的定义: In symmetric multiprocessor (SMP) syst ...
- 伪共享(false sharing),并发编程无声的性能杀手
在并发编程过程中,我们大部分的焦点都放在如何控制共享变量的访问控制上(代码层面),但是很少人会关注系统硬件及 JVM 底层相关的影响因素.前段时间学习了一个牛X的高性能异步处理框架 Disruptor ...
- 由一道淘宝面试题到False sharing问题
今天在看淘宝之前的一道面试题目,内容是 在高性能服务器的代码中经常会看到类似这样的代码: typedef union { erts_smp_rwmtx_t rwmtx; byte cache_line ...
- 并发性能的隐形杀手之伪共享(false sharing)
在并发编程过程中,我们大部分的焦点都放在如何控制共享变量的访问控制上(代码层面),但是很少人会关注系统硬件及 JVM 底层相关的影响因素.前段时间学习了一个牛X的高性能异步处理框架 Disruptor ...
- Java8使用@sun.misc.Contended避免伪共享(False Sharing)
伪共享(False Sharing) Java8中用sun.misc.Contended避免伪共享(false sharing) Java8使用@sun.misc.Contended避免伪共享
- 从缓存行出发理解volatile变量、伪共享False sharing、disruptor
volatilekeyword 当变量被某个线程A改动值之后.其他线程比方B若读取此变量的话,立马能够看到原来线程A改动后的值 注:普通变量与volatile变量的差别是volatile的特殊规则保证 ...
- 伪共享(False Sharing)和缓存行(Cache Line)
转载:https://www.jianshu.com/p/a9b1d32403ea https://www.toutiao.com/a6644375612146319886/ 前言 在上篇介绍Long ...
- 并发刺客(False Sharing)——并发程序的隐藏杀手
并发刺客(False Sharing)--并发程序的隐藏杀手 前言 前段时间在各种社交平台"雪糕刺客"这个词比较火,简单的来说就是雪糕的价格非常高!其实在并发程序当中也有一个刺客, ...
- 谣言检测(RDCL)——《Towards Robust False Information Detection on Social Networks with Contrastive Learning》
论文信息 论文标题:Towards Robust False Information Detection on Social Networks with Contrastive Learning论文作 ...
随机推荐
- 带你认识Flink容错机制的两大方面:作业执行和守护进程
摘要:Flink 容错机制主要有作业执行的容错以及守护进程的容错两方面,前者包括 Flink runtime 的 ExecutionGraph 和Execution的容错,后者则包括 JobManag ...
- 云图说|应用魔方AppCube:揭秘码农防脱神器
摘要: 应用魔方(AppCube)是华为云为行业客户.合作伙伴.开发者量身打造的一款低代码开发平台.通过AppCube可轻松构建专业级应用,创新随心所欲,敏捷超乎想象. 本文分享自华为云社区<云 ...
- 你会几种读取/加载 properties配置文件方法
摘要:在java项目中经常会使用到配置文件,这里就介绍几种加载配置文件的方法. 本文分享自华为云社区<[Java]读取/加载 properties配置文件的几种方法>,作者:Copy工程师 ...
- 火山引擎 DataTester:5 个优化思路,构建高性能 A/B 实验平台
导读:DataTester 是由火山引擎推出的 A/B 测试平台,覆盖推荐.广告.搜索.UI.产品功能等业务应用场景,提供从 A/B 实验设计.实验创建.指标计算.统计分析到最终评估上线等贯穿整个 A ...
- MongoDB 副本模式,会映射到本地 127.0.0.1 错误
基于 MongoDB 读写分离--Windows MongoDB 副本集配置 ,里面配置了一个坑,导致出现下列错误 [2021-05-10 10:06:11.981] [cluster-Cluster ...
- Intellij IDEA 关闭阿里编码规约“请不要使用行尾注释”提醒
Settings -> Inspections -> 注释 取消 "方法内部单行注释 xxxx " 里面的勾,[设完后重启]如下图
- Djagno 使用locals()
Django使用locals()函数 locals()函数会以字典类型返回当前位置的全部局部变量 在 views.py中添加 #展示 class Goods_list(View): def get(s ...
- 基于BaseHTTPRequestHandler的HTTP服务器基础实现
1. BaseHTTPRequestHandler介绍 BaseHTTPRequestHandler是Python中的一个基类,属于http.server模块,用于处理HTTP请求的基本功能.它提供了 ...
- 【JAVA基础】时间处理
#时间处理 ##查询前台报表运单数据集 @ApiOperation(value = "查询前台报表运单数据集") @Permission(permissionPublic = tr ...
- C#9.0:Improved Pattern Matching
增强的模式匹配 C#9.0添加了几种新的模式.我们可以参阅模式匹配教程 ,来了解下面代码的上下文: 1 public static decimal CalculateToll(object vehic ...