folly/ThreadCachedInt.h

High-performance atomic increment using thread caching.

folly/ThreadCachedInt.h introduces a integer class designed for high performance increments from multiple threads simultaneously without loss of precision. It has two read modes, readFast gives a potentially stale value with one load, and readFull gives the exact value, but is much slower, as discussed below.

Performance


Increment performance is up to 10x greater than std::atomic_fetch_add in high contention environments. See folly/test/ThreadCachedIntTest.h for more comprehensive benchmarks.

readFast is as fast as a single load.

readFull, on the other hand, requires acquiring a mutex and iterating through a list to accumulate the values of all the thread local counters, so is significantly slower than readFast.

Usage


Create an instance and increment it with increment or the operator overloads. Read the value with readFast for quick, potentially stale data, or readFull for a more expensive but precise result. There are additional convenience functions as well, such as set.

    ThreadCachedInt<int64_t> val;
EXPECT_EQ(, val.readFast());
++val; // increment in thread local counter only
EXPECT_EQ(, val.readFast()); // increment has not been flushed
EXPECT_EQ(, val.readFull()); // accumulates all thread local counters
val.set();
EXPECT_EQ(, val.readFast());
EXPECT_EQ(, val.readFull());

Implementation


folly::ThreadCachedInt uses folly::ThreadLocal to store thread specific objects that each have a local counter. When incrementing, the thread local instance is incremented. If the local counter passes the cache size, the value is flushed to the global counter with an atomic increment. It is this global counter that is read with readFast via a simple load, but will not count any of the updates that haven't been flushed.

In order to read the exact value, ThreadCachedInt uses the extended readAllThreads() API of folly::ThreadLocal to iterate through all the references to all the associated thread local object instances. This currently requires acquiring a global mutex and iterating through the references, accumulating the counters along with the global counter. This also means that the first use of the object from a new thread will acquire the mutex in order to insert the thread local reference into the list. By default, there is one global mutex per integer type used in ThreadCachedInt. If you plan on using a lot of ThreadCachedInts in your application, considering breaking up the global mutex by introducing additional Tag template parameters.

set simply sets the global counter value, and marks all the thread local instances as needing to be reset. When iterating with readFull, thread local counters that have been marked as reset are skipped. When incrementing, thread local counters marked for reset are set to zero and unmarked for reset.

Upon destruction, thread local counters are flushed to the parent so that counts are not lost after increments in temporary threads. This requires grabbing the global mutex to make sure the parent itself wasn't destroyed in another thread already.

Alternate Implementations


There are of course many ways to skin a cat, and you may notice there is a partial alternate implementation in folly/test/ThreadCachedIntTest.cpp that provides similar performance. ShardedAtomicInt simply uses an array ofstd::atomic<int64_t>'s and hashes threads across them to do low-contention atomic increments, and readFull just sums up all the ints.

This sounds great, but in order to get the contention low enough to get similar performance as ThreadCachedInt with 24 threads, ShardedAtomicInt needs about 2000 ints to hash across. This uses about 20x more memory, and the lock-freereadFull has to sum up all 2048 ints, which ends up being a about 50x slower than ThreadCachedInt in low contention situations, which is hopefully the common case since it's designed for high-write, low read access patterns. Performance of readFull is about the same speed as ThreadCachedInt in high contention environments.

Depending on the operating conditions, it may make more sense to use one implementation over the other. For example, a lower contention environment will probably be able to use a ShardedAtomicInt with a much smaller array without hurting performance, while improving memory consumption and perf of readFull.

ThreadCachedInt的更多相关文章

  1. folly学习心得(转)

    原文地址:  https://www.cnblogs.com/Leo_wl/archive/2012/06/27/2566346.html   阅读目录 学习代码库的一般步骤 folly库的学习心得 ...

  2. Folly: Facebook Open-source Library Readme.md 和 Overview.md(感觉包含的东西并不多,还是Boost更有用)

    folly/ For a high level overview see the README Components Below is a list of (some) Folly component ...

随机推荐

  1. python 获取列表的键值对

    nums = [, , , , ] for num_index, num_val in enumerate(nums): print(num_index, num_val)

  2. 在多节点上运行分布式Intel Caffe

    一般有2种并行模式:数据并行(Data parallelism)和模型并行(model parallelism). 在模型并行化( model parallelism )方法里,分布式系统中的不同机器 ...

  3. express 调优的一个过程和心得,不错的文章

    Netflix的软件工程师Yunong Xiao最近在公司的技术博客上写了一篇文章,分析了他所在的团队在将Netflix网站UI转移到Node.js上时遇到的延迟问题.在文章中他描述了找到问题根本原因 ...

  4. 多目标跟踪方法 NOMT 学习与总结

    多目标跟踪方法 NOMT 学习与总结 ALFD NOMT MTT 读 'W. Choi, Near-Online Multi-target Tracking with Aggregated Local ...

  5. 使用POI设置导出的EXCEL锁定指定的单元格

    注:要锁定单元格需先为此表单设置保护密码,设置之后此表单默认为所有单元格锁定,可使用setLocked(false)为指定单元格设置不锁定. sheet.protectSheet("&quo ...

  6. windows下mysql多实例安装

    在学习和开发过程中有时候会用到多个MySQL数据库,比如Master-Slave集群.分库分表,开发阶段在一台机器上安装多个MySQL实例就显得方便不少. 在 MySQL教程-基础篇-1.1-Wind ...

  7. Cache应用/任务Mutex,用于高并发任务处理经过多个项目使用

    <?php /** * Class Cache redis 用于报表的缓存基本存储和读写 2.0 * <pre> * Cache::read("diamond.accoun ...

  8. 建造者模式 build

    引出建造者模式: package com.disign.build; /** * Created by zhen on 2017-05-19. */ public class BuildPersonT ...

  9. docx文件怎样打开 - 转

    如何打开docx文件?在office2007及2010退出多年后,诸如docx.xlsx.pptx类文件越来越多,我们从网络下载或者别人复制过来的这类文件越来越多.docx文件怎样打开呢?下面有图小站 ...

  10. java基础第9天

    抽象 abstract 抽象类和抽象方法必须用abstract关键字修饰 抽象类格式 abstract class 类名{} 抽象方法定义,在返回值钱,或修饰符前加上abstract关键字 方法没有方 ...