https://joemario.github.io/blog/2016/09/01/c2c-blog/

Do you run your application in a NUMA environment? Is it multi-threaded? Is it multi-process with shared memory? If so, is your performance impacted by false sharing?

Now there’s a way to easily find out. We’re posting patches for a new feature to the Linux perf tool, called “c2c” for cache-2-cache.
We at Red Hat have been running the development prototype of c2c on lots of big Linux applications and it’s uncovered many hot false sharing cachelines.

I’ve been playing with this tool quite a bit. It is pretty cool. Let me share a little about what it is and how to use it.

At a high level, “perf c2c” will show you:
* The cachelines where false sharing was detected.
* The readers and writers to those cachelines, and the offsets where those accesses occurred.
* The pid, tid, instruction addr, function name, binary object name for those readers and writers.
* The source file and line number for each reader and writer.
* The average load latency for the loads to those cachelines.
* Which numa nodes the samples a cacheline came from and which cpus were involved.

Using perf c2c is similar to using the Linux perf tool today.
First collect data with “perf c2c record ” Then generate a report output with “perf c2c report ”

Before covering the output data, here is a “how to” for the flags to use when calling “perf c2c”:
c2c usage flags

Then here’s an output file from a recent “perf c2c” run I did:
c2c output file

And, if you want to play with it yourself, here’s a simple source file to generate lots of false sharing.
False sharing .c src file

First I’ll go over the output file to highlight the interesting fields.

This first table in the output file gives a high level summary of all the load and store samples collected. It is interesting to see where your program’s load instructions got their data.
Notice the term “HITM”, which stands for a load that hit in a modified cacheline. That’s the key that false sharing has occured. Remote HITMs, meaning across numa nodes, are the most expensive - especially when there are lots of readers and writers.

 1  =================================================
2 Trace Event Information
3 =================================================
4 Total records : 329219 << Total loads and stores sampled.
5 Locked Load/Store Operations : 14654
6 Load Operations : 69679 << Total loads
7 Loads - uncacheable : 0
8 Loads - IO : 0
9 Loads - Miss : 3972
10 Loads - no mapping : 0
11 Load Fill Buffer Hit : 11958
12 Load L1D hit : 17235 << loads that hit in the L1 cache.
13 Load L2D hit : 21
14 Load LLC hit : 14219 << loads that hit in the last level cache (LLC).
15 Load Local HITM : 3402 << loads that hit in a modified cache on the same numa node (local HITM).
16 Load Remote HITM : 12757 << loads that hit in a modified cache on a remote numa node (remote HITM).
17 Load Remote HIT : 5295
18 Load Local DRAM : 976 << loads that hit in the local node's main memory.
19 Load Remote DRAM : 3246 << loads that hit in a remote node's main memory.
20 Load MESI State Exclusive : 4222
21 Load MESI State Shared : 0
22 Load LLC Misses : 22274 << loads not found in any local node caches.
23 LLC Misses to Local DRAM : 4.4% << % hitting in local node's main memory.
24 LLC Misses to Remote DRAM : 14.6% << % hitting in a remote node's main memory.
25 LLC Misses to Remote cache (HIT) : 23.8% << % hitting in a clean cache in a remote node.
26 LLC Misses to Remote cache (HITM) : 57.3% << % hitting in remote modified cache. (most expensive - false sharing)
27 Store Operations : 259539 << store instruction sample count
28 Store - uncacheable : 0
29 Store - no mapping : 11
30 Store L1D Hit : 256696 << stores that got L1 cache when requested.
31 Store L1D Miss : 2832 << stores that couldn't get the L1 cache when requested (L1 miss).
32 No Page Map Rejects : 2376
33 Unable to parse data source : 1

The second table, (below), in the output file gives a brief one-line summary of the hottest cachelines where false sharing was detected. It’s sorted by which line had the most remote HITMs (or local HITMs if you select that sort option). It gives a nice high level sense for the load and store activity for each cacheline.
I look to see if a cacheline has a high number of “Rmt LLC Load Hitm’s”. If so, it’s time to dig further.

54  =================================================
55 Shared Data Cache Line Table
56 =================================================
57 #
58 # Total Rmt ----- LLC Load Hitm ----- ---- Store Reference ---- --- Load Dram ---- LLC Total ----- Core Load Hit ----- -- LLC Load Hit --
59 # Index Cacheline records Hitm Total Lcl Rmt Total L1Hit L1Miss Lcl Rmt Ld Miss Loads FB L1 L2 Llc Rmt
60 # ..... .................. ....... ....... ....... ....... ....... ....... ....... ....... ........ ........ ....... ....... ....... ....... ....... ........ ........
61 #
62 0 0x602180 149904 77.09% 12103 2269 9834 109504 109036 468 727 2657 13747 40400 5355 16154 0 2875 529
63 1 0x602100 12128 22.20% 3951 1119 2832 0 0 0 65 200 3749 12128 5096 108 0 2056 652
64 2 0xffff883ffb6a7e80 260 0.09% 15 3 12 161 161 0 1 1 15 99 25 50 0 6 1
65 3 0xffffffff81aec000 157 0.07% 9 0 9 1 0 1 0 7 20 156 50 59 0 27 4
66 4 0xffffffff81e3f540 179 0.06% 9 1 8 117 97 20 0 10 25 62 11 1 0 24 7

Next is the Pareto table, which shows lots of valuable information about each contended cacheline. This is the most important table in the output. I only show three cachelines here to keep this blog simple. Here’s what’s in it.

* Lines 71 and 72 are the column headers for what’s happening in each cacheline.
    * Line 76 shows the HITM and store activity for each cacheline - first with counts for load
       and store activity, followed by the cacheline virtual data address.
    * Then there’s the data address column. Line 76 shows the virtual address of the cacheline.
       Each row underneath is represents the offset into the cachline where those accesses occured.
    * The next column shows the pid, and/or the thread id (tid) if you selected that for the output.
    * Following is the instruction pointer code address.
    * Next are three columns showing the average load latencies. I always look here for long
       latency averages, which is a sign for how painful the contention was to that cacheline.
   * The “cpu cnt” column shows how many different cpus samples came from.
   * Then there’s the function name, binary object name, source file and line number.
   * The last column shows for each node, the specific cpus that samples came from.

67  =================================================
68 Shared Cache Line Distribution Pareto
69 =================================================
70 #
71 # ----- HITM ----- -- Store Refs -- Data address ---------- cycles ---------- cpu Shared
72 # Num Rmt Lcl L1 Hit L1 Miss Offset Pid Code address rmt hitm lcl hitm load cnt Symbol Object Source:Line Node{cpu list}
73 # ..... ....... ....... ....... ....... .................. ....... .................. ........ ........ ........ ........ ................... .................... ........................... ....
74 #
75 -------------------------------------------------------------
76 0 9834 2269 109036 468 0x602180
77 -------------------------------------------------------------
78 65.51% 55.88% 75.20% 0.00% 0x0 14604 0x400b4f 27161 26039 26017 9 [.] read_write_func no_false_sharing.exe false_sharing_example.c:144 0{0-1,4} 1{24-25,120} 2{48,54} 3{169}
79 0.41% 0.35% 0.00% 0.00% 0x0 14604 0x400b56 18088 12601 26671 9 [.] read_write_func no_false_sharing.exe false_sharing_example.c:145 0{0-1,4} 1{24-25,120} 2{48,54} 3{169}
80 0.00% 0.00% 24.80% 100.00% 0x0 14604 0x400b61 0 0 0 9 [.] read_write_func no_false_sharing.exe false_sharing_example.c:145 0{0-1,4} 1{24-25,120} 2{48,54} 3{169}
81 7.50% 9.92% 0.00% 0.00% 0x20 14604 0x400ba7 2470 1729 1897 2 [.] read_write_func no_false_sharing.exe false_sharing_example.c:154 1{122} 2{144}
82 17.61% 20.89% 0.00% 0.00% 0x28 14604 0x400bc1 2294 1575 1649 2 [.] read_write_func no_false_sharing.exe false_sharing_example.c:158 2{53} 3{170}
83 8.97% 12.96% 0.00% 0.00% 0x30 14604 0x400bdb 2325 1897 1828 2 [.] read_write_func no_false_sharing.exe false_sharing_example.c:162 0{96} 3{171} 84 -------------------------------------------------------------
85 1 2832 1119 0 0 0x602100
86 -------------------------------------------------------------
87 29.13% 36.19% 0.00% 0.00% 0x20 14604 0x400bb3 1964 1230 1788 2 [.] read_write_func no_false_sharing.exe false_sharing_example.c:155 1{122} 2{144}
88 43.68% 34.41% 0.00% 0.00% 0x28 14604 0x400bcd 2274 1566 1793 2 [.] read_write_func no_false_sharing.exe false_sharing_example.c:159 2{53} 3{170}
89 27.19% 29.40% 0.00% 0.00% 0x30 14604 0x400be7 2045 1247 2011 2 [.] read_write_func no_false_sharing.exe false_sharing_example.c:163 0{96} 3{171} 90 -------------------------------------------------------------
91 2 12 3 161 0 0xffff883ffb6a7e80
92 -------------------------------------------------------------
93 58.33% 100.00% 0.00% 0.00% 0x0 14604 0xffffffff810cf16d 1380 941 1229 9 [k] task_tick_fair [kernel.kallsyms] atomic64_64.h:21 0{0,4,96} 1{25,120,122} 2{53} 3{170-171}
94 16.67% 0.00% 98.76% 0.00% 0x0 14604 0xffffffff810c9379 1794 0 625 13 [k] update_cfs_rq_blocked_load [kernel.kallsyms] atomic64_64.h:45 0{1,4,96} 1{25,120,122} 2{48,53-54,144} 3{169-171}
95 16.67% 0.00% 0.00% 0.00% 0x0 14604 0xffffffff810ce098 1382 0 867 12 [k] update_cfs_shares [kernel.kallsyms] atomic64_64.h:21 0{1,4,96} 1{25,120,122} 2{53-54,144} 3{169-171}
96 8.33% 0.00% 0.00% 0.00% 0x8 14604 0xffffffff810cf18c 2560 0 679 8 [k] task_tick_fair [kernel.kallsyms] atomic.h:26 0{4,96} 1{24-25,120,122} 2{54} 3{170}
97 0.00% 0.00% 1.24% 0.00% 0x8 14604 0xffffffff810cf14f 0 0 0 2 [k] task_tick_fair [kernel.kallsyms] atomic.h:50 2{48,53}

How I often use “perf c2c”

Here are the flags I most commonly use.

   perf c2c record -F 60000 -a --all-user sleep 5
perf c2c record -F 60000 -a --all-user sleep 3 // or to sample for a shorter time.
perf c2c record -F 60000 -a --all-kernel sleep 3 // or to only gather kernel samples.
perf c2c record -F 60000 -a -u --ldlat 50 sleep 3 // or to collect only loads >= 50 cycles of load latency (30 is the ldlat default).

To generate report files, you can use the graphical tui report or send the output to stdout:

 perf report -NN -c pid,iaddr                 // to use the tui interactive report
perf report -NN -c pid,iaddr --stdio // or to send the output to stdout
perf report -NN -d lcl -c pid,iaddr --stdio // or to sort on local hitms

By default, symbol names are truncated to a fixed width - for readability.
You can use the “–full-symbols” flag to get full symbol names in the output.
For example:

 perf c2c report -NN -c pid,iaddr --full-symbols --stdio

Finding the callers to these cachelines:

Sometimes it’s valuable to know who the callers are. Here is how to get call graph information.
I never generate call graph info initially because it emits so much data, it makes it very difficult to see if and where a false sharing problem exists. I find the problem first without call graphs, then if needed I’ll rerun with call graphs.

perf c2c record --call-graph dwarf,8192 -F 60000 -a --all-user sleep 5
perf c2c report -NN -g --call-graph -c pid,iaddr --stdio

Does bumping perf’s sample rate help?

I’ll sometimes bump the perf sample rate with “-F 60000” or “-F 80000”.
There’s no requirement to do so, but it is a good way to get a richer sample collection in a shorter period of time. If you do, it’s helpful to bump the kernel’s perf sample rate up with the following two echo commands. (see dmesg for “perf interrupt took too long …” sample lowering entries).

 echo    500 > /proc/sys/kernel/perf_cpu_time_max_percent
echo 100000 > /proc/sys/kernel/perf_event_max_sample_rate
<then do your "perf c2c record" here>
echo 50 > /proc/sys/kernel/perf_cpu_time_max_percent

What to do when perf drowns in excessive samples:

When running on larger systems (e.g. 4, 8 or 16 socket systems), there can be so many samples that the perf tool can consume lots of cpu time and the perf.data file size grows significantly.
Some tips to help that include:
- Bump the ldlat from the default of 30 to 50. This free’s perf to skip the faster non-interesting loads.
- Lower the sample rate.
- Shorten the sleep time during the “perf record” window. For ex, from “sleep 5” to “sleep 3”.

What I’ve learned by using C2C on numerous applications:

It’s common to look at any performance tool output and ask ‘what does all this data mean?’.
Here are some things I’ve learned. Hopefully they’re of help.

* I tend to run “perf c2c” for 3, 5, or 10 seconds. Running it any longer may take you
       from seeing concurrent false sharing to seeing cacheline accesses which are
       disjoint in time.
    * If you’re not interested in kernel samples, you’ll get better samples in your program by
       specifying –all—user.   Conversely, specifying –all-kernel is useful when focusing on the
       kernel.
    * On busy systems with high cpu counts , like >148 cpus, setting –ldlat to a higher value
       (like 50 or even 70) may enable perf to generate richer C2C samples.
    * Look at the Trace Event table at the top, specifically the “LLC Misses to Remote cache HITM”
       number. If it’s not close to zero, then there’s likely worthwhile false sharing to pursue resolving.
    * Most of the time the top one, two, or three cachelines in the Shared Cache Line Distribution
       Pareto table are the ones to focus on.
    * However, sometimes you’ll see the same code from multiple threads causing “less hot”
       contention, but you will see it on multiple cachelines for different data addresses.
       Even though any one of those lines are less hot individually, fixing them is often a
       win because the benefit is spread across many cachelines. This can also happen with
       different processes executing the same code accessing shared memory.
    * In the Shared Cache Line Distribution Pareto table, if you see long load average load latencies,
       it’s often a giveaway that false sharing contention is heavy and is hurting performance.
    * Then looking to see what nodes and cpus the samples for those accesses are coming from
       can often be a valuable guide to numa-pinning your processes or memory.
   
For processes using shared memory, it is possible for them to use different virtual addresses,
all pointing to (and contending with) the same shared memory location. They will show
up in the Pareto table as different cachelines, but in fact they are the same cacheline.
These can be tricky to spot. I usually uncover these by first looking to see that shared memory is being used, and then looking for similar patterns in the information provided
for each cacheline.

Last, the Shared Cache Line Distribution Pareto table can also provide great insight into any
ill-aligned hot data.
For example:
    * It’s easy to spot heavily modified variables that need to be placed into their own cachelines.
       This will enable them to be less contended (and run faster), and it will help accesses to
       the other variables that shared their cacheline to not be slowed down.
    * It’s easy to spot hot locks or mutexes that are unaligned and spill into multiple cachelines.
    * It’s easy to spot “read mostly” variables which can be grouped together into their own
       cachelines.

The raw samples can be helpful.

I’ve often found it valuable to take a peek at the raw instruction samples contained in the perf.data file (the one generated by the “perf c2c record”). You can get those raw samples using “perf script”. See man perf-script. The output may be cryptic, but you can sort on the load weight (5th column) to see which loads suffered the most from false sharing contention and took the longest to execute.

The c2c functionality is available in the upstream perf as of the Linux 4.2 kernel./h4>

Lastly, this was a collective effort.

Although Don Zickus, Dick Fowles and Joe Mario worked together to get this implemented, we got lots of early help from Arnaldo Carvalho de Melo, Stephane Eranian, Jiri Olsa and Andi Kleen.
Additionally Jiri has been heavily involved recently integrating the c2c functionality into perf.
A big thanks to all of you for helping to pull this together!

Posted by Joe Mario Sep 1st, 2016 2:54 pm  cacheline false sharinglinuxperf c2crhel

[转帖]C2C - False Sharing Detection in Linux Perf的更多相关文章

  1. Linux -- 在多线程程序中避免False Sharing

    1.什么是false sharing 在对称多处理器(SMP)系统中,每个处理器均有属于自己的本地高速缓存区. 如图,CPU0和CPU1有各自的本地高速缓存区(cache).线程0和线程1会用到不同的 ...

  2. OpenMp之false sharing

    关于false sharing的文章,网上一大堆了,不过觉得都不太系统,那么下面着重系统说明一下. 先看看外国佬下的定义: In symmetric multiprocessor (SMP) syst ...

  3. 伪共享(false sharing),并发编程无声的性能杀手

    在并发编程过程中,我们大部分的焦点都放在如何控制共享变量的访问控制上(代码层面),但是很少人会关注系统硬件及 JVM 底层相关的影响因素.前段时间学习了一个牛X的高性能异步处理框架 Disruptor ...

  4. 由一道淘宝面试题到False sharing问题

    今天在看淘宝之前的一道面试题目,内容是 在高性能服务器的代码中经常会看到类似这样的代码: typedef union { erts_smp_rwmtx_t rwmtx; byte cache_line ...

  5. 并发性能的隐形杀手之伪共享(false sharing)

    在并发编程过程中,我们大部分的焦点都放在如何控制共享变量的访问控制上(代码层面),但是很少人会关注系统硬件及 JVM 底层相关的影响因素.前段时间学习了一个牛X的高性能异步处理框架 Disruptor ...

  6. Java8使用@sun.misc.Contended避免伪共享(False Sharing)

    伪共享(False Sharing) Java8中用sun.misc.Contended避免伪共享(false sharing) Java8使用@sun.misc.Contended避免伪共享

  7. 从缓存行出发理解volatile变量、伪共享False sharing、disruptor

    volatilekeyword 当变量被某个线程A改动值之后.其他线程比方B若读取此变量的话,立马能够看到原来线程A改动后的值 注:普通变量与volatile变量的差别是volatile的特殊规则保证 ...

  8. 伪共享(False Sharing)和缓存行(Cache Line)

    转载:https://www.jianshu.com/p/a9b1d32403ea https://www.toutiao.com/a6644375612146319886/ 前言 在上篇介绍Long ...

  9. 并发刺客(False Sharing)——并发程序的隐藏杀手

    并发刺客(False Sharing)--并发程序的隐藏杀手 前言 前段时间在各种社交平台"雪糕刺客"这个词比较火,简单的来说就是雪糕的价格非常高!其实在并发程序当中也有一个刺客, ...

  10. 谣言检测(RDCL)——《Towards Robust False Information Detection on Social Networks with Contrastive Learning》

    论文信息 论文标题:Towards Robust False Information Detection on Social Networks with Contrastive Learning论文作 ...

随机推荐

  1. 一文解析Spring JDBC Template的使用指导

    摘要:Spring框架对JDBC的简单封装.提供了一个JDBCTemplate对象简化JDBC的开发. 本文分享自华为云社区<Spring JdbcTemplate使用解析>,作者: 共饮 ...

  2. Java开发如何通过IoT边缘ModuleSDK进行进程应用的开发?

    摘要:为解决用户自定义处理设备数据以及自定义协议设备快速接入IOT平台的诉求,华为IoT边缘提供ModuleSDK,用户可通过集成SDK让设备以及设备数据快速上云. 本文分享自华为云社区<[华为 ...

  3. 详解NLP和时序预测的相似性【附赠AAAI21最佳论文INFORMER的详细解析】

    摘要:本文主要分析自然语言处理和时序预测的相似性,并介绍Informer的创新点. 前言 时序预测模型无外乎RNN(LSTM, GRU)以及现在非常火的Transformer.这些时序神经网络模型的主 ...

  4. 物联网企业该如何与华为云合作,这份FAQ值得一看

    摘要:关于华为云DevRun智联生活行业加速器,梳理出伙伴和企业最关心的问题,并逐一解答. 自华为云DevRun智联生活行业加速器发布以来,一直在为产业链上下游的企业提供技术.生态建设.商业变现等资源 ...

  5. CompletableFuture 打桌球的应用

    CompletableFuture 使用 @Test public void billiardTest() throws Exception { // 创建点外卖线程: CompletableFutu ...

  6. Appium常用定位方法讲解

    Appium常用定位方法讲解 对象定位是自动化测试中很关键的一步,也可以说是 最关键的一步,毕竟你对象都没定位那么你想操作也不行,下面我们来看常用的一些定位方式. ID定位(取resource-id的 ...

  7. three.js项目引入vue,因代码编写不当导致的严重影响性能的问题,卡顿掉帧严重

    three.js项目引入vue,因代码编写不当导致的严重影响性能的问题,卡顿掉帧严重 问题排查 使用谷歌浏览器的Performance分析页面性能 可以看到vue.js的reactiveGetter方 ...

  8. AtCoder ABC 164 (D~E)

    比赛链接:Here ABC水题, D - Multiple of 2019 (DP + 分析) 题意: 给定数字串S,计算有多少个子串 \(S[L,R]\)​ ,满足 \(S[L,R]\) 是 \(2 ...

  9. 【题解】Qin Shi Huang's National Road System HDU - 4081 ⭐⭐⭐⭐ 【次小生成树】

    During the Warring States Period of ancient China(476 BC to 221 BC), there were seven kingdoms in Ch ...

  10. mysql备份恢复总结

    mysqldump备份注:例子中的语句都是在mysql5.6下执行------------------基础------------------------一.修改my.cnf文件 vi /etc/my ...