ThreadCachedInt
folly/ThreadCachedInt.h
High-performance atomic increment using thread caching.
folly/ThreadCachedInt.h
introduces a integer class designed for high performance increments from multiple threads simultaneously without loss of precision. It has two read modes, readFast
gives a potentially stale value with one load, and readFull
gives the exact value, but is much slower, as discussed below.
Performance
Increment performance is up to 10x greater than std::atomic_fetch_add
in high contention environments. See folly/test/ThreadCachedIntTest.h
for more comprehensive benchmarks.
readFast
is as fast as a single load.
readFull
, on the other hand, requires acquiring a mutex and iterating through a list to accumulate the values of all the thread local counters, so is significantly slower than readFast
.
Usage
Create an instance and increment it with increment
or the operator overloads. Read the value with readFast
for quick, potentially stale data, or readFull
for a more expensive but precise result. There are additional convenience functions as well, such as set
.
ThreadCachedInt<int64_t> val;
EXPECT_EQ(, val.readFast());
++val; // increment in thread local counter only
EXPECT_EQ(, val.readFast()); // increment has not been flushed
EXPECT_EQ(, val.readFull()); // accumulates all thread local counters
val.set();
EXPECT_EQ(, val.readFast());
EXPECT_EQ(, val.readFull());
Implementation
folly::ThreadCachedInt
uses folly::ThreadLocal
to store thread specific objects that each have a local counter. When incrementing, the thread local instance is incremented. If the local counter passes the cache size, the value is flushed to the global counter with an atomic increment. It is this global counter that is read with readFast
via a simple load, but will not count any of the updates that haven't been flushed.
In order to read the exact value, ThreadCachedInt
uses the extended readAllThreads()
API of folly::ThreadLocal
to iterate through all the references to all the associated thread local object instances. This currently requires acquiring a global mutex and iterating through the references, accumulating the counters along with the global counter. This also means that the first use of the object from a new thread will acquire the mutex in order to insert the thread local reference into the list. By default, there is one global mutex per integer type used in ThreadCachedInt
. If you plan on using a lot of ThreadCachedInt
s in your application, considering breaking up the global mutex by introducing additional Tag
template parameters.
set
simply sets the global counter value, and marks all the thread local instances as needing to be reset. When iterating with readFull
, thread local counters that have been marked as reset are skipped. When incrementing, thread local counters marked for reset are set to zero and unmarked for reset.
Upon destruction, thread local counters are flushed to the parent so that counts are not lost after increments in temporary threads. This requires grabbing the global mutex to make sure the parent itself wasn't destroyed in another thread already.
Alternate Implementations
There are of course many ways to skin a cat, and you may notice there is a partial alternate implementation in folly/test/ThreadCachedIntTest.cpp
that provides similar performance. ShardedAtomicInt
simply uses an array ofstd::atomic<int64_t>
's and hashes threads across them to do low-contention atomic increments, and readFull
just sums up all the ints.
This sounds great, but in order to get the contention low enough to get similar performance as ThreadCachedInt with 24 threads, ShardedAtomicInt
needs about 2000 ints to hash across. This uses about 20x more memory, and the lock-freereadFull
has to sum up all 2048 ints, which ends up being a about 50x slower than ThreadCachedInt
in low contention situations, which is hopefully the common case since it's designed for high-write, low read access patterns. Performance of readFull
is about the same speed as ThreadCachedInt
in high contention environments.
Depending on the operating conditions, it may make more sense to use one implementation over the other. For example, a lower contention environment will probably be able to use a ShardedAtomicInt
with a much smaller array without hurting performance, while improving memory consumption and perf of readFull
.
ThreadCachedInt的更多相关文章
- folly学习心得(转)
原文地址: https://www.cnblogs.com/Leo_wl/archive/2012/06/27/2566346.html 阅读目录 学习代码库的一般步骤 folly库的学习心得 ...
- Folly: Facebook Open-source Library Readme.md 和 Overview.md(感觉包含的东西并不多,还是Boost更有用)
folly/ For a high level overview see the README Components Below is a list of (some) Folly component ...
随机推荐
- python 匹配指定后缀的文件名
import glob x=glob.glob('*.py') print(x)
- Spark 数据倾斜调优
一.what is a shuffle? 1.1 shuffle简介 一个stage执行完后,下一个stage开始执行的每个task会从上一个stage执行的task所在的节点,通过网络传输获取tas ...
- 【Python】单元测试框架unitest及其高级应用
Unittest Unittest是python的一个单元测试框架,但是它不仅适用于单元测试,还适用自动化测试用例的开发与执行.我们可以很方便的使用它组织执行测试用例,使用它提供的丰富的断言方法进行测 ...
- flutter 安装详细教程
Flutter 是 Google 用以帮助开发者在 iOS 和 Android 两个平台开发高质量原生 UI 的移动 SDK.Flutter 兼容现有的代码,免费且开源,在全球开发者中广泛被使用. 安 ...
- angular常用的服务
在 AngularJS 中,服务是一个函数或对象,可在你的 AngularJS 应用中使用. AngularJS 内建了30 多个服务. $window$routeProvider 1. $http服 ...
- ubuntu 10.04 安装arm交叉编译器
家里有一台cotext-A9(armv7-a) 的盒子,现在不用了, 一直想着废物利用.于是想怎么为这盒子编译程序. 目标机器: root@routon-h1:/# uname -a Linux ro ...
- oracle10g连接自动断开,报ORA-03135错误
问题描述: oracle使用过一段时间,连接断开,报ORA-03135错误. 问题挖掘: 用pl/sql和sqlplus连接oracle,也存在该问题,确定该问题与连接方式无关. 查看服务器,发现没有 ...
- PrestaShop 1.7 用户付款的时候无法支付错误
用户付款的时候出现错误,错误的信息是没有支付方式. 这个有可能是你后台支付的国家限制没有选择. 请确定你已经选择了支付国家限制已经选择了. 当选择成功后,你应该可以看到下面的选择项.
- Jquery表单清空
虽然reset方法可以做到一部分,但是如果你有个元素是这样的 <input name="percent" value="50"/> 那么点击rese ...
- java垃圾回收期如何工作(编程思想)
垃圾回收器如何工作: 在以前的程序语言中,在堆上分配对象的代价十分昂贵,因此读者会自然觉得对Java中所有对象(基本类型除外)都在堆上分配的方式也非常高昂.然而,垃圾回收期对提高对象的创建速度,却具有 ...