Choose Concurrency-Friendly Data Structures
What is a high-performance data structure? To answer that question, we're used to applying normal considerations like Big-Oh complexity, and memory overhead, locality, and traversal order. All of those apply to both sequential and concurrent software.
But in concurrent code, we need to consider two additional things to help us pick a data structure that is also sufficiently concurrency-friendly:
- In parallel code, your performance needs likely include the ability to allow multiple threads to use the data at the same time. If this is (or may become) a high-contention data structure, does it allow for concurrent readers and/or writers in different parts of the data structure at the same time? If the answer is, "No," then you may be designing an inherent bottleneck into your system and be just asking for lock convoys as threads wait, only one being able to use the data structure at a time.
- On parallel hardware, you may also care about minimizing the cost of memory synchronization. When one thread updates one part of the data structure, how much memory needs to be moved to make the change visible to another thread? If the answer is, "More than just the part that has ostensibly changed," then again you're asking for a potential performance penalty, this time due to cache sloshing as more data has to move from the core that performed the update to the core that is reading the result.
It turns out that both of these answers are directly influenced by whether the data structure allows truly localized updates. If making what appears to be a small change in one part of the data structure actually ends up reading or writing other parts of the structure, then we lose locality; those other parts need to be locked, too, and all of the memory that has changed needs to be synchronized.
To illustrate, let's consider two common data structures: linked lists and balanced trees.
Linked Lists
Linked lists are wonderfully concurrency-friendly data structures because they support highly localized updates. In particular, as illustrated in Figure 1, to insert a new node into a doubly linked list, you only need to touch two existing nodes; namely, the ones immediately adjacent to the position the new node will occupy to splice the new node into the list. To erase a node, you only need to touch three nodes: the one that is being erased, and its two immediately adjacent nodes.
This locality enables the option of using fine-grained locking: We can allow a potentially large number of threads to be actively working inside the same list, knowing that they won't conflict as long as they are manipulating different parts of the list. Each operation only needs to lock enough of the list to cover the nodes it actually uses.
For example, consider Figure 2, which illustrates the technique of hand-over-hand locking. The basic idea is this: Each segment of the list, or even each individual node, is protected by its own mutex. Each thread that may add or remove nodes from the list takes a lock on the first node, then while still holding that, takes a lock on the next node; then it lets go of the first node and while still holding a lock on the second node, it takes a lock on the third node; and so on. (To delete a node requires locking three nodes.) While traversing the list, each such thread always holds at least two locks—and the locks are always taken in the same order.

Figure 1: Localized insertion into a linked list.

Figure 2: Hand-over-hand locking in a linked list.
This technique delivers a number of benefits, including the following:
- Multiple readers and writers can be actively doing work in the same list.
- Readers and writers that are traversing the list in the same order will not pass each other. This can be useful to get deterministic results in concurrent code. In particular, the list's semantics will be the same as if each thread acquired complete exclusion on the list and performed its complete pass in isolation, which is easy to reason about.
- The locks taken on parts of the list won't deadlock with each other, because multiple locks are acquired in the same order.
- We can readily tune the code for better concurrency vs. lower locking overhead by choosing a suitable locking granularity: one lock for the whole list (no concurrency), a lock for each node in the list (maximum concurrency), or a lock for each chunk of some fixed or variable length (something in between).
Aside: If we always traverse the list in the same order, why does the figure show a doubly linked list? Because not all operations need to take multiple locks; those that use individual segments or nodes in-place one at a time without taking more than one node's or chunk's lock at a time can traverse the list in any order without deadlock. (For more on avoiding deadlock, see [1].)
Besides being well suited for concurrent traversal and update, linked lists also are cache-friendly on parallel hardware. When one thread removes a node, for example, the only memory that needs to be transferred to every other core that subsequently reads the list is the memory containing the two adjacent nodes. If the rest of the list hasn't been changed, multiple cores can happily store read-only copies of the list in their caches without expensive memory fetches and synchronization. (Remember, writes are always more expensive than reads because writes need to be broadcast. In turn, "lots of writes" are always more expensive than "limited writes.")
Clearly, one benefit lists enjoy is that they are node-based containers: Each element is stored in its own node, unlike an array or vector where elements are contiguous and inserting or erasing typically involves copying an arbitrary number of elements to one side or the other of the inserted or erased value. We might therefore anticipate that perhaps all node-based containers will be good for concurrency. Unfortunately, we would be wrong.
Balanced Search Trees
The story isn't nearly as good for another popular data structure: the balanced search tree. (Important note: This section refers only to balanced trees; unbalanced trees that support localized updates don't suffer from the problems we'll describe next.)
Consider a red-black tree: The tree stays balanced by marking each node as either "red" or "black," and applying rules that call for optionally rebalancing the tree on each insert or erase to avoid having different branches of the tree become too uneven. In particular, rebalancing is done by rotating subtrees, which involves touching an inserted or erased node's parent and/or uncle node, that node's own parent and/or uncle, and so on to the grandparents and granduncles up the tree, possibly as far as the root.
For example, consider Figure 3. To start with, the tree contains three nodes with the values 1, 2, and 3. To insert the value 4, we simply make it a child of node 3, as we would in a nonbalanced binary search tree. Clearly, that involves writing to node 3, to set its right-child pointer. However, to satisfy the red-black tree mechanics, we must also change node 3's and node 1's color to black. That adds overhead and loses some concurrency; for example, inserting 4 would conflict with adding 1.5 concurrently, because both inserts would need to touch both nodes 1 and 3.

Figure 3: Nonlocalized insertion into a red-black tree.
Next, to insert the value 5, we need to touch all but one of the nodes in the tree: We first make node 4 point to the new node 5 as its right child, then recolor both node 4 and node 3, and then because the tree is out of balance we also rotate 3-4-5 to make node 4 the root of that subtree, which means also touching node 2 to install node 4 as its new right child.
So red-black trees cause some problems for concurrent code:
- It's hard to run updates truly concurrently because updates arbitrarily far apart in the tree can touch the same nodes—especially the root, but also other higher-level nodes to lesser degrees—and therefore contend with each other. We have lost the ability to make truly localized changes.
- The tree performs extra internal housekeeping writes. This increases the amount of shared data that needs to be written and synchronized across caches to publish what would be a small update in another data structure.
"But wait," I can hear some people saying, "why can't we just put a mutex inside each node and take the locks in a single direction (up the tree) like we could do with the linked list and hand-over-hand locking? Wouldn't that let us regain the ability to have concurrent use of the data structure at least?" Short answer: That's easy to do, but hard to do right. Unlike the linked list case, however: (a) you may need to take many more locks, even all the way up to the root; and (b) the higher-level nodes will still end up being high-contention resources that bottleneck scalability. Also, the code to do this is much more complicated. As Fraser noted in 2004: "One superficially attractive solution is to read-lock down the tree and then write-lock on the way back up, just as far as rebalancing operations are required. This scheme would acquire exclusive access to the minimal number of nodes (those that are actually modified), but can result in deadlock with search operations (which are locking down the tree)." [2] He also proposed a fine-grained locking technique that does allow some concurrency, but notes that it "is significantly more complicated." There are easy answers, but few easy and correct answers.
To get around these limitations, researchers have worked on alternative structures such as skip lists [4], and on variants of red-black trees that can be more amenable to concurrency, such as by doing relaxed balancing instead of rebalancing immediately when needed after each update. Some of these are significantly more complex, which incurs its own costs in both performance and correctness/maintainability (for example, relaxed balancing was first suggested in 1978 but not implemented successfully until five years later). For more information and some relative performance measurements showing how even concurrent versions can still limit scalability, see [3].
Conclusions
Concurrency-friendliness alone doesn't singlehandedly trump other performance requirements. The usual performance considerations of Big-Oh complexity, and memory overhead, locality, and traversal order all still apply. Even when writing parallel code, you shouldn't choose a data structure only because it's concurrency-friendly; you should choose the right one that meets all your performance needs. Lists may be more concurrency-friendly than balanced trees, but trees are faster to search, and "individual searches are fast" can outbalance "multiple searches can run in parallel." (If you need both, try an alternative like skip lists.)
Remember:
- In parallel code, your performance needs likely include the ability to allow multiple threads to use the data at the same time.
- On parallel hardware, you may also care about minimizing the cost of memory synchronization.
In those situations, prefer concurrency-friendly data structures. The more a container supports truly localized updates, the more concurrency you can have as multiple threads can actively use different parts of the data structure at the same time, and (secondarily but still sometimes importantly) the more you can avoid invisible memory synchronization overhead in your high-performance code.
Notes
[1] H. Sutter. "Use Lock Hierarchies to Avoid Deadlock" (Dr. Dobb's Journal, January 2008).
[2] K. Fraser. "Practical lock-freedom" (University of Cambridge Computer Laboratory Technical Report #579, February 2004).
[3] S. Hanke. "The Performance of Concurrent Red-Black Tree Algorithms" (Lecture Notes in Computer Science, 1668:286-300, Springer, 1999).
[4] M. Fomitchev and E. Ruppert. "Lock-Free Linked Lists and Skip Lists" (PODC '04, July 2004).
转自:http://www.drdobbs.com/parallel/choose-concurrency-friendly-data-structu/208801371?pgno=3
Choose Concurrency-Friendly Data Structures的更多相关文章
- 20162314 《Program Design & Data Structures》Learning Summary Of The Fifth Week
20162314 2017-2018-1 <Program Design & Data Structures>Learning Summary Of The Fifth Week ...
- [Data Structures and Algorithms - 1] Introduction & Mathematics
References: 1. Stanford University CS97SI by Jaehyun Park 2. Introduction to Algorithms 3. Kuangbin' ...
- Algorithms & Data structures in C++& GO ( Lock Free Queue)
https://github.com/xtaci/algorithms //已实现 ( Implemented ): Array shuffle https://github.com/xtaci/al ...
- Important Abstractions and Data Structures
For Developers > Coding Style > Important Abstractions and Data Structures 目录 1 TaskRunne ...
- A library of generic data structures
A library of generic data structures including a list, array, hashtable, deque etc.. https://github. ...
- The Swiss Army Knife of Data Structures … in C#
"I worked up a full implementation as well but I decided that it was too complicated to post in ...
- 剪短的python数据结构和算法的书《Data Structures and Algorithms Using Python》
按书上练习完,就可以知道日常的用处啦 #!/usr/bin/env python # -*- coding: utf-8 -*- # learn <<Problem Solving wit ...
- Persistent Data Structures
原文链接:http://www.codeproject.com/Articles/9680/Persistent-Data-Structures Introduction When you hear ...
- Go Data Structures: Interfaces
refer:http://research.swtch.com/interfaces Go Data Structures: Interfaces Posted on Tuesday, Decembe ...
随机推荐
- 分享Kali Linux 2016.2第48周虚拟机
分享Kali Linux 2016.2第48周虚拟机该虚拟机使用Kali Linux 2016.2第48周的64位镜像安装而成.基本配置如下:(1)该系统默认设置单CPU双核,内存为2GB,硬盘为50 ...
- 手持终端PDA应用固定资产管理系统(资产查询 盘点)软件程序系统
一.产品概述 固定资产管理系统,是针对企事业单位内部资产管理中出现的工作量大.过程繁琐.追踪困难等一系列难题开发的一套先进管理软件.软件实现了对资产的多种方式管理,目前包括条形码.二维码.RFID管理 ...
- BZOJ 1901 Zju2112 Dynamic Rankings ——整体二分
[题目分析] 上次用树状数组套主席树做的,这次用整体二分去水. 把所有的查询的结果一起进行二分,思路很好. [代码] #include <cstdio> #include <cstr ...
- CSS3-animation,表格表单的格式化
animation 1.与transition一样,animation在IE9之前都不支持,不仅如此,还需要大量的供应商前缀 2.定义关键帧:@内容中需要大量的前缀 @keyframes fadeI ...
- Delphi 包的设计思想及它与PAS、BPL、DCU、DLL、OXC的关系。
DCP ,BPL分别是什么文件,起什么作用?你在DELPHI中建立一个package然后保存一下,看看. bpl和Dll比较相似.只是BPL是BORLAND自己弄出来的东西!!!调用也和调用DLL相似 ...
- 【CLR in c#】属性
1.无参属性 1.为什么有字段还需要属性呢? 因为字段很容易写出不恰当的代码,破坏对象的状态,比如Age=-1.人的年纪不可能为负数.使用属性后你可以缓存某些值或者推迟创建一些内部对象,你可以以线程安 ...
- jQuery操作列表数据转成Json再输出为html dom树
jQuery 把列表数据转成Json再输出为如下 dom树 <div id="menu" class="lv1"> <ul class=&qu ...
- 后缀数组 POJ 3974 Palindrome && URAL 1297 Palindrome
题目链接 题意:求给定的字符串的最长回文子串 分析:做法是构造一个新的字符串是原字符串+反转后的原字符串(这样方便求两边回文的后缀的最长前缀),即newS = S + '$' + revS,枚举回文串 ...
- MFC 启动其他程序 变相跳转
尝试了多种方式之后都无法成功地在对话框程序中弹出一个单文档程序,然后我想到了这个办法. 如果直接在代码中实现某些窗口的弹出比较麻烦,可以采用这个方式来弹出这种窗口. 如果需要传递参数,只需将数据写入文 ...
- HTML5中createPattern()
定义和用法 createPattern() 方法在指定的方向内重复指定的元素. 元素可以是图片.视频,或者其他 <canvas> 元素. 被重复的元素可用于绘制/填充矩形.圆形或线条等等. ...