folly/ThreadCachedInt.h

High-performance atomic increment using thread caching.

folly/ThreadCachedInt.h introduces a integer class designed for high performance increments from multiple threads simultaneously without loss of precision. It has two read modes, readFast gives a potentially stale value with one load, and readFull gives the exact value, but is much slower, as discussed below.

Performance


Increment performance is up to 10x greater than std::atomic_fetch_add in high contention environments. See folly/test/ThreadCachedIntTest.h for more comprehensive benchmarks.

readFast is as fast as a single load.

readFull, on the other hand, requires acquiring a mutex and iterating through a list to accumulate the values of all the thread local counters, so is significantly slower than readFast.

Usage


Create an instance and increment it with increment or the operator overloads. Read the value with readFast for quick, potentially stale data, or readFull for a more expensive but precise result. There are additional convenience functions as well, such as set.

    ThreadCachedInt<int64_t> val;
EXPECT_EQ(, val.readFast());
++val; // increment in thread local counter only
EXPECT_EQ(, val.readFast()); // increment has not been flushed
EXPECT_EQ(, val.readFull()); // accumulates all thread local counters
val.set();
EXPECT_EQ(, val.readFast());
EXPECT_EQ(, val.readFull());

Implementation


folly::ThreadCachedInt uses folly::ThreadLocal to store thread specific objects that each have a local counter. When incrementing, the thread local instance is incremented. If the local counter passes the cache size, the value is flushed to the global counter with an atomic increment. It is this global counter that is read with readFast via a simple load, but will not count any of the updates that haven't been flushed.

In order to read the exact value, ThreadCachedInt uses the extended readAllThreads() API of folly::ThreadLocal to iterate through all the references to all the associated thread local object instances. This currently requires acquiring a global mutex and iterating through the references, accumulating the counters along with the global counter. This also means that the first use of the object from a new thread will acquire the mutex in order to insert the thread local reference into the list. By default, there is one global mutex per integer type used in ThreadCachedInt. If you plan on using a lot of ThreadCachedInts in your application, considering breaking up the global mutex by introducing additional Tag template parameters.

set simply sets the global counter value, and marks all the thread local instances as needing to be reset. When iterating with readFull, thread local counters that have been marked as reset are skipped. When incrementing, thread local counters marked for reset are set to zero and unmarked for reset.

Upon destruction, thread local counters are flushed to the parent so that counts are not lost after increments in temporary threads. This requires grabbing the global mutex to make sure the parent itself wasn't destroyed in another thread already.

Alternate Implementations


There are of course many ways to skin a cat, and you may notice there is a partial alternate implementation in folly/test/ThreadCachedIntTest.cpp that provides similar performance. ShardedAtomicInt simply uses an array ofstd::atomic<int64_t>'s and hashes threads across them to do low-contention atomic increments, and readFull just sums up all the ints.

This sounds great, but in order to get the contention low enough to get similar performance as ThreadCachedInt with 24 threads, ShardedAtomicInt needs about 2000 ints to hash across. This uses about 20x more memory, and the lock-freereadFull has to sum up all 2048 ints, which ends up being a about 50x slower than ThreadCachedInt in low contention situations, which is hopefully the common case since it's designed for high-write, low read access patterns. Performance of readFull is about the same speed as ThreadCachedInt in high contention environments.

Depending on the operating conditions, it may make more sense to use one implementation over the other. For example, a lower contention environment will probably be able to use a ShardedAtomicInt with a much smaller array without hurting performance, while improving memory consumption and perf of readFull.

ThreadCachedInt的更多相关文章

  1. folly学习心得(转)

    原文地址:  https://www.cnblogs.com/Leo_wl/archive/2012/06/27/2566346.html   阅读目录 学习代码库的一般步骤 folly库的学习心得 ...

  2. Folly: Facebook Open-source Library Readme.md 和 Overview.md(感觉包含的东西并不多,还是Boost更有用)

    folly/ For a high level overview see the README Components Below is a list of (some) Folly component ...

随机推荐

  1. UOJ #122 【NOI2013】 树的计数

    题目链接:树的计数 这道题好神啊……正好有人讲了这道题,那么我就写掉吧…… 首先,为了方便考虑,我们可以把节点重标号,使得\(bfs\)序变成\(1,2,3,\dots,n\),那么显然树的深度就是\ ...

  2. php 获取某个日期n天之后的日期

    <?php $date=date_create("2013-03-15"); date_add($date,date_interval_create_from_date_st ...

  3. vs_u8前缀

    1.ZC: 个人测试下来,VS2015开始 支持 u8前缀. 2.What's New for Visual C++ in Visual Studio 2015 https://msdn.micros ...

  4. 解决 android.support.v7.widget.GridLayout 使用 xmlns:app 出现 error 的问题

    GridLayout 是在 Android API Level 14 加进来的 它可用来取代 TableLayout 也提供了自由度较大且实用的排版功能 为了兼容 4.0 以下的较低版本 Androi ...

  5. erlang 一个高性能web框架 Cowboy 的使用笔记

    环境:ubuntu_server 1210 目的:构建web版hello world程序 参考链接:http://roberto-aloi.com/blog/2013/07/13/create-dep ...

  6. recv,recvfrom,send,sendto

    一般情况下:send(),recv()用于TCP,sendto()及recvfrom()用于UDP 但是send(),recv()也可以用于UDP,sendto()及recvfrom()也可以用于TC ...

  7. 校验基于EO的VO中的字段是否发生变化

    I have a table region and there are multiple records fetching from a Entity based VO. Now I have upd ...

  8. ExtJs 6.0+快速入门,ext-bootstrap.js文件的分析,各版本API下载(一)

    ExtAPI 下载地址如下,包含各个版本 http://docs.sencha.com/misc/guides/offline_docs.html 1.使用工具HBuilder 2.java 版本 8 ...

  9. 关于protel 99se 汉化后某些菜单消失的解决方法

    本人在使用protel 99se 画PCB时,遇到了好些问题,通过网上查资料基本都解决了. 下面给大家分享 关于protel 99se 汉化后某些菜单消失的解决方法. 其他的许多看不见的菜单也可以自己 ...

  10. Java其他API介绍

    有一些类虽然不像集合.多线程.网络编程中的类那样属于Java中的核心类,但是它们在开发过程中给我们带来很多便利,这里就对它们做下简要的介绍和演示. 一.System类 System类中的构造方法是私有 ...