ThreadCachedInt
folly/ThreadCachedInt.h
High-performance atomic increment using thread caching.
folly/ThreadCachedInt.h introduces a integer class designed for high performance increments from multiple threads simultaneously without loss of precision. It has two read modes, readFast gives a potentially stale value with one load, and readFull gives the exact value, but is much slower, as discussed below.
Performance
Increment performance is up to 10x greater than std::atomic_fetch_add in high contention environments. See folly/test/ThreadCachedIntTest.h for more comprehensive benchmarks.
readFast is as fast as a single load.
readFull, on the other hand, requires acquiring a mutex and iterating through a list to accumulate the values of all the thread local counters, so is significantly slower than readFast.
Usage
Create an instance and increment it with increment or the operator overloads. Read the value with readFast for quick, potentially stale data, or readFull for a more expensive but precise result. There are additional convenience functions as well, such as set.
ThreadCachedInt<int64_t> val;
EXPECT_EQ(, val.readFast());
++val; // increment in thread local counter only
EXPECT_EQ(, val.readFast()); // increment has not been flushed
EXPECT_EQ(, val.readFull()); // accumulates all thread local counters
val.set();
EXPECT_EQ(, val.readFast());
EXPECT_EQ(, val.readFull());
Implementation
folly::ThreadCachedInt uses folly::ThreadLocal to store thread specific objects that each have a local counter. When incrementing, the thread local instance is incremented. If the local counter passes the cache size, the value is flushed to the global counter with an atomic increment. It is this global counter that is read with readFast via a simple load, but will not count any of the updates that haven't been flushed.
In order to read the exact value, ThreadCachedInt uses the extended readAllThreads() API of folly::ThreadLocal to iterate through all the references to all the associated thread local object instances. This currently requires acquiring a global mutex and iterating through the references, accumulating the counters along with the global counter. This also means that the first use of the object from a new thread will acquire the mutex in order to insert the thread local reference into the list. By default, there is one global mutex per integer type used in ThreadCachedInt. If you plan on using a lot of ThreadCachedInts in your application, considering breaking up the global mutex by introducing additional Tag template parameters.
set simply sets the global counter value, and marks all the thread local instances as needing to be reset. When iterating with readFull, thread local counters that have been marked as reset are skipped. When incrementing, thread local counters marked for reset are set to zero and unmarked for reset.
Upon destruction, thread local counters are flushed to the parent so that counts are not lost after increments in temporary threads. This requires grabbing the global mutex to make sure the parent itself wasn't destroyed in another thread already.
Alternate Implementations
There are of course many ways to skin a cat, and you may notice there is a partial alternate implementation in folly/test/ThreadCachedIntTest.cpp that provides similar performance. ShardedAtomicInt simply uses an array ofstd::atomic<int64_t>'s and hashes threads across them to do low-contention atomic increments, and readFull just sums up all the ints.
This sounds great, but in order to get the contention low enough to get similar performance as ThreadCachedInt with 24 threads, ShardedAtomicInt needs about 2000 ints to hash across. This uses about 20x more memory, and the lock-freereadFull has to sum up all 2048 ints, which ends up being a about 50x slower than ThreadCachedInt in low contention situations, which is hopefully the common case since it's designed for high-write, low read access patterns. Performance of readFull is about the same speed as ThreadCachedInt in high contention environments.
Depending on the operating conditions, it may make more sense to use one implementation over the other. For example, a lower contention environment will probably be able to use a ShardedAtomicInt with a much smaller array without hurting performance, while improving memory consumption and perf of readFull.
ThreadCachedInt的更多相关文章
- folly学习心得(转)
原文地址: https://www.cnblogs.com/Leo_wl/archive/2012/06/27/2566346.html 阅读目录 学习代码库的一般步骤 folly库的学习心得 ...
- Folly: Facebook Open-source Library Readme.md 和 Overview.md(感觉包含的东西并不多,还是Boost更有用)
folly/ For a high level overview see the README Components Below is a list of (some) Folly component ...
随机推荐
- 51nod 1040 最大公约数的和 欧拉函数
1040 最大公约数之和 题目来源: rihkddd 基准时间限制:1 秒 空间限制:131072 KB 分值: 80 难度:5级算法题 收藏 关注 给出一个n,求1-n这n个数,同n的最大公约数 ...
- python 计算阶乘
# 用for循环计算 n! sum = n=int(input('请输入n=')) ,n+): ,-): sum *= j # sum=sum*j print('%d!=%3d' %(i,sum)) ...
- JavaScript之搜索框
啧啧啧,又到月末了,时间过的真的好快啊︿( ̄︶ ̄)︿现在没课上,天天宅在寝室就这么三件事:吃饭睡觉打豆豆.真感无所事事,无聊至极!突然好怀念那些上课的日子啊!至少不像现在,生活状态全部都搅乱了:以前可 ...
- 微信公众号开发之微信JSSDK
概述 微信JS-SDK是微信公众平台面向网页开发者提供的基于微信内的网页开发工具包. 通过使用微信JS-SDK,网页开发者可借助微信高效地使用拍照.选图.语音.位置等手机系统的能力,同时可以直接使用微 ...
- UVA-11491 Erasing and Winning (单调队列)
题目大意:给一个数字(开头非0),拿掉其中的d个数字,使剩下的数字最大(前后顺序不能变). 题目分析:拿掉d个数字,还剩下n-d个数字.相当于从n个数字中按先后顺序选出n-d个数字使组成的数字最大,当 ...
- Java虚拟机体系结构分析
下图是JAVA虚拟机的结构图: 每个Java虚拟机都有一个类装载子系统,它根据给定的全限定名来装入类型(类或接口).同样,每个Java虚拟机都有一个执行引擎,它负责执行那些包含在被装载类的方法中的指令 ...
- 『转』谷歌发布Windows版Chrome App Launcher
据国外媒体报道,谷歌发布了Windows版Chrome App Launcher,Windows用户现在因此能够使用谷歌的许多网络应用,如Chrome浏览器.Gmail.Google Drive和Ch ...
- Android sdk 更新后编译不过,【Could not find com.android.sdklib.build.ApkBuilderMain】
最近更新了Android sdk,发现编译不过了 解决方案: 进入 sdk/tool/lib/ 目录下,看看有没有 sdklib.jar 这个文件,如果没有看看有没有sdklib-25.*.*.jar ...
- Linux运维学习笔记-目录知识点总结
目录知识点总结: Note: 1.创建一个/server/scripts目录,用于存放脚本(命令:mkdir -p /server/scripts) 2.安装软件时,安装路径统一为/usr/local ...
- redis安装全过程
1. 从官网上下载redis. 2.安装gcc 3.进入./redis/src目录下make MALLOC =libc 4.遇到的问题 Redis简介: Redis是一个开源的使用ANSI C语言编写 ...