atomic的底层实现

atomic操作

在编程过程中我们经常会使用到原子操作，这种操作即不想互斥锁那样耗时，又可以保证对变量操作的原子性，常见的原子操作有fetch_add、load、increment等。

而对于atomic的实现最基础的解释：原子操作是由底层硬件支持的一种特性。

底层硬件支持，到底是怎么样的一种支持？首先编写一个简单的示例代码：

#include <atomic>

int main()

{

    std::atomic<int> a;

    //a = 1;

    a++;

    return 0;

}

然后进行编译, 查看编译文件：

g++ -S atomic.cc

cat atomic.s

_ZNSt13__atomic_baseIiEppEi:

.LFB362:

	pushq	%rbp

	.seh_pushreg	%rbp

	movq	%rsp, %rbp

	.seh_setframe	%rbp, 0

	subq	$16, %rsp

	.seh_stackalloc	16

	.seh_endprologue

	movq	%rcx, 16(%rbp)

	movl	%edx, 24(%rbp)

	movq	16(%rbp), %rax

	movq	%rax, -8(%rbp)

	movl	$1, -12(%rbp)

	movl	$5, -16(%rbp)

	movl	-12(%rbp), %edx

	movq	-8(%rbp), %rax

	lock xaddl	%edx, (%rax)

	movl	%edx, %eax

	nop

	addq	$16, %rsp

	popq	%rbp

	ret

	.seh_endproc

	.ident	"GCC: (x86_64-posix-seh-rev0, Built by MinGW-W64 project) 8.1.0"

我们可以看到在执行自增操作的时候，在xaddl 指令前多了一个lock前缀，而cpu对这个lock指令的支持就是所谓的底层硬件支持。

增加这个前缀后，保证了 load-add-store 步骤的不可分割性。

lock 指令的实现

众所周知，cpu在执行任务的时候并不是直接从内存中加载数据，而是会先将数据加载到L1和L2 cache中（典型的是两层缓存，甚至可能更多），然后再从cache中读取数据进行运算。

而现在的计算机通常都是多核处理器，每一个内核都对应一个独立的L1层缓存，多核之间的缓存数据同步是cpu框架设计的重要部分，MESI是比较常用的多核缓存同步方案。

当我们在单线程内执行 atomic++操作，自然是不会发生多核之间数据不同步的问题，但是我们在多线程多核的情况下，cpu是如何保证Lock特性的？

作者这里以intel x86架构的cpu为例进行说明，首先给出官方的说明文档：

User level locks involve utilizing the atomic instructions of processor to atomically update a memory space.

The atomic instructions involve utilizing a lock prefix on the instruction and having the destination operand assigned to a memory address.

The following instructions can run atomically with a lock prefix on current Intel processors: ADD, ADC, AND, BTC, BTR, BTS, CMPXCHG, CMPXCH8B, DEC, INC, NEG, NOT, OR, SBB, SUB, XOR, XADD, and XCHG. EnterCriticalSection utilizes atomic instructions to attempt to get a user-land lock before jumping into the kernel. On most instructions a lock prefix must be explicitly used except for the xchg instruction where the lock prefix is implied if the instruction involves a memory address.

In the days of Intel 486 processors, the lock prefix used to assert a lock on the bus along with a large hit in performance.

Starting with the Intel Pentium Pro architecture, the bus lock is transformed into a cache lock.

A lock will still be asserted on the bus in the most modern architectures if the lock resides in uncacheable memory or if the lock extends beyond a cache line boundary splitting cache lines.

Both of these scenarios are unlikely, so most lock prefixes will be transformed into a cache lock which is much less expensive.

上面说明了lock前缀实现原子性的两种方式：

锁bus：性能消耗大，在intel 486处理器上用此种方式实现
锁cache：在现代处理器上使用此种方式，但是在无法锁定cache的时候（如果锁驻留在不可缓存的内存中，或者如果锁超出了划分cache line 的cache boundy），仍然会去锁定总线。

大多数人看到这里可能感觉已经懂了，但实际还不够，bus lock 以及多核之间的cache lock是如何实现的？

The LOCK prefix (F0H) forces an operation that ensures exclusive use of shared memory in a multiprocessor environment.

See “LOCK—Assert LOCK# Signal Prefix” in Chapter 3, “Instruction Set Reference, A-L,” for a description of this prefix

Causes the processor’s LOCK# signal to be asserted during execution of the accompanying instruction (turns the instruction into an atomic instruction). In a multiprocessor environment, the LOCK# signal ensures that the processor has exclusive use of any shared memory while the signal is asserted.

In most IA-32 and all Intel 64 processors, locking may occur without the LOCK# signal being asserted. See the “IA32 Architecture Compatibility” section below for more details.

The LOCK prefix can be prepended only to the following instructions and only to those forms of the instructions where the destination operand is a memory operand: ADD, ADC, AND, BTC, BTR, BTS, CMPXCHG, CMPXCH8B, CMPXCHG16B, DEC, INC, NEG, NOT, OR, SBB, SUB, XOR, XADD, and XCHG.

If the LOCK prefix is used with one of these instructions and the source operand is a memory operand, an undefined opcode exception (#UD) may be generated. An undefined opcode exception will also be generated if the LOCK prefix is used with any instruction not in the above list.

The XCHG instruction always asserts the LOCK# signal regardless of the presence or absence of the LOCK prefix.

The LOCK prefix is typically used with the BTS instruction to perform a read-modify-write operation on a memory location in shared memory environment.

The integrity of the LOCK prefix is not affected by the alignment of the memory field. Memory locking is observed for arbitrarily misaligned fields.

This instruction’s operation is the same in non-64-bit modes and 64-bit mode.

在Intel的官方文档上标明，一个LOCK前缀强制性的确保一个操作在多核环境的shared memory中操作。LOCK前缀的完整性不受存储字段对齐的影响，对于任意未对齐的字段内存锁定都可以被观察到。

BUS LOCK

这是Intel官方对bus lock的说明

Intel processors provide a LOCK# signal that is asserted automatically during certain critical memory operations to lock the system bus or equivalent link.

While this output signal is asserted, requests from other processors or bus agents for control of the bus are blocked. This metric measures the ratio of bus cycles, during which a LOCK# signal is asserted on the bus.

The LOCK# signal is asserted when there is a locked memory access due to uncacheable memory, locked operation that spans two cache lines, and page-walk from an uncacheable page table.

英特尔处理器提供LOCK＃信号，该信号在某些关键内存操作期间会自动断言，以锁定系统总线或等效链接。在断言该输出信号时，来自其他处理器或总线代理的控制总线的请求将被阻止。此度量标准度量总线周期的比率，在此期间，在总线上声明LOCK＃信号。当由于不可缓存的内存，跨越两条缓存行的锁定操作以及来自不可缓存的页表的页面遍历而导致存在锁定的内存访问时，将发出LOCK＃信号。

在这里，锁定进入操作由总线上的一条消息组成，上面写着“好，每个人都退出总线一段时间”（出于我们的目的，这意味着“停止执行内存操作”）。然后，发送该消息的内核需要等待所有其他内核完成正在执行的内存操作，然后它们将确认锁定。只有在其他所有内核都已确认之后，尝试锁定操作的内核才能继续进行。最终，一旦锁定被释放，它再次需要向总线上的每个人发送一条消息，说：“一切都清楚了，您现在就可以继续在总线上发出请求了”。

CACHE LOCK

cache lock 要比bus lock 复杂很多，这里涉及到内核cache同步，还有 memory barriers、cache line和shared memory等概念，后续会持续更新。

其它

LOCK prefix 不仅仅用于atomic的实现，在其他的一些用户层的同步操作也会应用到，比如依赖于LOCK XCHG实现自旋锁等。

参考网址