刨根问底系列(1)——虚假唤醒(spurious wakeups)的原因以及在pthread_cond_wait、pthread_cond_singal中使用while的必要性
刨根问底之虚假唤醒
1. 概要
将会以下方式展开介绍:
- 什么是虚假唤醒
- 什么原因会导致虚假唤醒(两种原因)
- 为什么系统内核不从根本上解决虚假唤醒这个“bug”(两个原因)
- 开发者如何解决虚假唤醒的bug(用while检测)
2. 什么是虚假唤醒
当一次pthread_cond_signal的调用, 导致了多个线程从pthread_cond_wait的调用中返回, 这种效应就叫"虚假唤醒". (正常情况是一次pthread_cond_signal让一个线程返回)
The effect is that more than one thread can return from its call to pthread_cond_wait() or pthread_cond_timedwait() as a result of one call to pthread_cond_signal(). This effect is called "spurious wakeup".
摘自: https://pubs.opengroup.org/onlinepubs/009604599/functions/pthread_cond_signal.html
3. 什么原因会导致虚假唤醒
大体上我看到了两种解释:
- 系统中断等不可避免的bug, 这个是官方的解释, 也符合虚假唤醒的定义
- 应用层面开发者设计的问题, 这个情况严格意义上并不能算是虚假唤醒, 但是为后面讲while举了很好的例子, 所以暂且当作是广义的虚假唤醒吧. (这是我自己定义的"广义虚假唤醒": 某线程被唤醒, 但这并不是开发者本意)
3.1 系统中断等
Linux中pthread_cond_wait是用futex的系统调用实现的. 而进程收到信号后, 每个阻塞的系统调用(类似wait, read, recv)都会立马返回(并且错误码errno为EINTR), 也就是说即使没有调用pthread_cond_sinal, 也可能导致wait返回.
就像任何其他的代码, 线程调度器都可能会因为底层的硬件软件的异常事件而出现短暂的宕机.
The pthread_cond_wait() function in Linux is implemented using the futex system call. Each blocking system call on Linux returns abruptly with EINTR when the process receives a signal.
摘自: http://en.wikipedia.org/w/index.php?title=Spurious_wakeup&oldid=289803065
like any code, thread scheduler may experience temporary blackout due to something abnormal happening in underlying hardware / software.
摘自: https://softwareengineering.stackexchange.com/questions/186842/spurious-wakeups-explanation-sounds-like-a-bug-that-just-isnt-worth-fixing-is
3.2 应用层问题
注意: 这种情况, 不知道算不算严格意义上的虚假唤醒, 因为这种情况与上面虚假唤醒的定义不一致的, 也就是并没有因为一个signal导致多个wait返回, 所以这里暂且当作是广义上的虚假唤醒吧.
考虑一个生产消费者队列, 有三个线程(thead1, thread2为消费者, Thread3为生产者):
thread1: lock后直接dequeue, 然后unlock, 伪码如下:
lock
dequeue //消费
unlock
thread2: 相比thread1, 增加了判断: 如果队列为空则wait:
lock
if (queue is empyt) pthread_cond_wait
dequeue //消费, 这一步就是bug点, 必须把if改为while,下面分析
unclock
thread3: 和thread1类似, 不过它是生产者, lock后inqueue, 再unclock
lock
inqueue //生产
if(queue is not empty) pthread_cond_signal
unlock
步骤如下:
- 假设初始队列为空, 此时thread2阻塞在wait中:
- thread3生产一个, 此时队列不为空, thread3发送signal通知, 并且unlock
- thead2的wait虽然收到通知了(注意wait返回需要两个条件: 1是收到signal, 2是抢到lock), 但是thread2和thread1还在竞争lock
- 假设thread1先抢到了lock, 消费了, 此时队列为空, 然后unlock
- 此时假设thread2抢到了lock, wait也就终于返回了, 退出if语句, 再去消费, 此时队列早已为空了, 程序就可能报错了, 因为程序本意是确保队列不为空thread2才能消费, 所以thread2就是被广义上的虚假唤醒了.
Consider a producer consumer queue, and three threads.
Thread 1 has just dequeued an element and released the mutex, and the queue is now empty. The thread is doing whatever it does with the element it acquired on some CPU.
Thread 2 attempts to dequeue an element, but finds the queue to be empty when checked under the mutex, calls pthread_cond_wait, and blocks in the call awaiting signal/broadcast.
Thread 3 obtains the mutex, inserts a new element into the queue, notifies the condition variable, and releases the lock.
In response to the notification from thread 3, thread 2, which was waiting on the condition, is scheduled to run.
However before thread 2 manages to get on the CPU and grab the queue lock, thread 1 completes its current task, and returns to the queue for more work. It obtains the queue lock, checks the predicate, and finds that there is work in the queue. It proceeds to dequeue the item that thread 3 inserted, releases the lock, and does whatever it does with the item that thread 3 enqueued.
Thread 2 now gets on a CPU and obtains the lock, but when it checks the predicate, it finds that the queue is empty. Thread 1 'stole' the item, so the wakeup appears to be spurious. Thread 2 needs to wait on the condition again.
So since you already always need to check the predicate under a loop, it makes no difference if the underlying condition variables can have other sorts of spurious wakeups.
摘自: https://stackoverflow.com/questions/8594591/why-does-pthread-cond-wait-have-spurious-wakeups
4. 为什么不从根本上解决虚假唤醒bug
大致上有以下原因:
- 客观原因: 很难解决且没人愿意去解决这种bug
- 性能考虑: 如果不采用虚假唤醒, 则可能大大降低条件变量的效率(比较玄幻的解释, David R. Butenhof in "Programming with POSIX Threads" (p. 80)
- 阿Q精神: 虚假唤醒迫使我们去考虑并解决这些系统bug, 反而"帮我们"提升了程序健壮性
4.1 客观原因
很难解决且没人愿意去解决这种bug.
The first reason is that nobody wants to fix it.
The second reason is that fixing this is supposed to be hard.
摘自: http://blog.vladimirprus.com/2005/07/spurious-wakeups.html
4.2 性能考虑
总而言之, 为了解决一个很少发生且比较容易用其他方式解决的bug, 而去牺牲整体运行效率是不值得的.
Spurious wakeups may sound strange, but on some multiprocessor systems, making condition wakeup completely predictable might substantially slow all condition variable operations.
摘自: David R. Butenhof in "Programming with POSIX Threads" (p. 80)
While this problem could be resolved, the loss of efficiency for a fringe condition that occurs only rarely is unacceptable, especially given that one has to check the predicate associated with a condition variable anyway. Correcting this problem would unnecessarily reduce the degree of concurrency in this basic building block for all higher-level synchronization operations.
摘自: https://pubs.opengroup.org/onlinepubs/009604599/functions/pthread_cond_signal.html
4.3 阿Q精神
当然,应该尽量避免这种情况的发生,但是由于没有100%健壮的软件之类的东西,因此可以合理地假设这种情况会发生,并在调度程序检测到这种情况时谨慎地进行恢复(例如 通过观察丢失的心跳)
Of course, care should be taken for this to happen as rare as possible, but since there's no such thing as 100% robust software it is reasonable to assume this can happen and take care on the graceful recovery in case if scheduler detects this (eg by observing missing heartbeats).
https://softwareengineering.stackexchange.com/questions/186842/spurious-wakeups-explanation-sounds-like-a-bug-that-just-isnt-worth-fixing-is
5. 如何解决虚假唤醒带来的bug
下面先介绍某些方案的不可行性和while的可行性和必要性:
- 再次直接调用wait方案的不可行性: 如果signal发送之后, 再一次wait将会收不到signal, 线程将会suspend
- while可行且必要.
5.1 再次直接调用wait方案的不可行性
glibc在调用阻塞函数时(例如read), 会用while循环执行, 当返回错误码errno为EINTR时, 继续循环read, 那么我们可不可以用这种方法去解决虚假唤醒呢? 答案是不能的, 伪码如下:
//read模型
readagain:
ret = read(fd);
if(ret < 0 && errno== EINTR) //如果被中断了, 就再读
goto readagain;
//wait模型
wait //第一次wait, 返回后, 假设我们用某种手段发现是虚拟唤醒, 准备再次wait
// 这个间隙将会错过一些signal
wait // 第二次wait, 错过了signal, 将会一直wait
之所以read可以, 是因为read是直接从接收缓冲区读数据就可以了, 被中断多少次、多长时间都无所谓, 而wait一旦中断再去wait, 可能这个间隙就错过了signal, 可能会一直wait下去.
... when glibc calls any blocking function, like 'read', it does it in a loop, and if 'read' returns EINTR, calls 'read' again.
Can the same trick be used to conditions? No, because the moment we return from 'futex' call, another thread can send us notification. And since we're not waiting inside 'futex', we'll miss the notification. So, we need to return to the caller, and have it reevaluate the predicate. If another thread indeed set it to true, we'll break out of the loop.
摘自: http://blog.vladimirprus.com/2005/07/spurious-wakeups.html
Now, how could scheduler recover, taking into account that during blackout it could miss some signals intended to notify waiting threads? If scheduler does nothing, mentioned "unlucky" threads will just hang, waiting forever - to avoid this, scheduler would simply send a signal to all the waiting threads.
摘自: https://softwareengineering.stackexchange.com/questions/186842/spurious-wakeups-explanation-sounds-like-a-bug-that-just-isnt-worth-fixing-is
5.2 while循环的可行和必要性
总而言之, 用wait和signal的根本是程序员希望满足了某个条件, 现在既然存在虚假唤醒, 那我们就直接去看那个条件是否满足了就好. 同时, 为了防止多次虚假唤醒, 我们用while.
这里直接把3.2中的例子搬过来(把if改为while了):
thread1: lock后直接dequeue, 然后unlock, 伪码如下:
lock
dequeue
unlock
thread2: 相比thread1, 增加了判断: 如果队列为空则wait:
lock
while (queue is empyt) pthread_cond_wait //改为while
dequeue
unclock
thread3: 和thread1类似, 不过它是生产者, lock后inqueue, 再unclock
lock
inqueue
if(queue is not empty) pthread_cond_signal
unlock
步骤如下:
- 假设初始队列为空, 此时thread2阻塞在wait中:
- thread3生产一个, 此时队列不为空, thread3发送signal通知, 并且unlock
- thead2的wait虽然收到通知了, 但是thread2和thread1还在竞争lock
- 假设thread1先抢到了lock, 消费了, 此时队列为空, 然后unlock
- 此时假设thread2抢到了lock, wait也就终于返回了, 但是有while循环, 再次判断队列是否为空, 发现仍然为空(已经被thread1偷过去消费掉了), 所以并不能退出while循环, 所以再次wait, 这就正常了
Assumption of spurious wakeups forces thread to be conservative in what it does: set condition when notifying other threads, and liberal in what it accepts: check the condition upon any return from wait and repeat wait if it's not there yet.
摘自: https://softwareengineering.stackexchange.com/users/31260/gnat
So, we need to return to the caller, and have it reevaluate the predicate. If another thread indeed set it to true, we'll break out of the loop.
摘自: http://blog.vladimirprus.com/2005/07/spurious-wakeups.html
6. 参考网址
- 介绍了虚假唤醒的一种用户层的原因:https://stackoverflow.com/questions/8594591/why-does-pthread-cond-wait-have-spurious-wakeups
- 介绍了虚拟唤醒的系统内核原因:https://softwareengineering.stackexchange.com/questions/186842/spurious-wakeups-explanation-sounds-like-a-bug-that-just-isnt-worth-fixing-is
- 说明了Linux大佬们不解决虚拟唤醒的两个原因:
https://pubs.opengroup.org/onlinepubs/009604599/functions/pthread_cond_signal.html
刨根问底系列(1)——虚假唤醒(spurious wakeups)的原因以及在pthread_cond_wait、pthread_cond_singal中使用while的必要性的更多相关文章
- 什么是虚假唤醒 spurious wakeup
解释一下什么是虚假唤醒? 说具体的例子,比较容易说通. pthread_mutex_t lock; pthread_cond_t notempty; pthread_cond_t notfull; v ...
- 多线程编程中条件变量和的spurious wakeup 虚假唤醒
1. 概述 条件变量(condition variable)是利用共享的变量进行线程之间同步的一种机制.典型的场景包括生产者-消费者模型,线程池实现等. 对条件变量的使用包括两个动作: 1) 线程等待 ...
- Java-JUC(八):使用wait,notify|notifyAll完成生产者消费者通信,虚假唤醒(Spurious Wakeups)问题出现场景,及问题解决方案。
模拟通过线程实现消费者和订阅者模式: 首先,定义一个店员:店员包含进货.卖货方法:其次,定义一个生产者,生产者负责给店员生产产品:再者,定义一个消费者,消费者负责从店员那里消费产品. 店员: /** ...
- java多线程 生产者消费者案例-虚假唤醒
package com.java.juc; public class TestProductAndConsumer { public static void main(String[] args) { ...
- JUC虚假唤醒(六)
为什么条件锁会产生虚假唤醒现象(spurious wakeup)? 在不同的语言,甚至不同的操作系统上,条件锁都会产生虚假唤醒现象.所有语言的条件锁库都推荐用户把wait()放进循环里: whil ...
- notify丢失、虚假唤醒
notify丢失: 假设线程A因为某种条件在条件队列中等待,同时线程B因为另外一种条件在同一个条件队列中等待,也就是说线程A/B都被同一个Object.wait()挂起,但是等待的条件不同. 现在假设 ...
- pthread_cond_wait虚假唤醒
pthread_cond_wait中的while()不仅仅在等待条件变量前检查条件cond_is_false是否成立,实际上在等待条件变量后也检查条件cond_is_false是否成立.在多线程等待的 ...
- (三)juc高级特性——虚假唤醒 / Condition / 按序交替 / ReadWriteLock / 线程八锁
8. 生产者消费者案例-虚假唤醒 参考下面生产者消费者案例: /* * 生产者和消费者案例 */ public class TestProductorAndConsumer { public stat ...
- 【转】pthread_cond_signal 虚假唤醒问题
引用:http://blog.csdn.net/leeds1993/article/details/52738845 什么是虚假唤醒? 举个例子,我们现在有一个生产者-消费者队列和三个线程. I.1号 ...
随机推荐
- 测试必知必会系列- Linux常用命令 - ps(重点)
21篇测试必备的Linux常用命令,每天敲一篇,每次敲三遍,每月一循环,全都可记住!! https://www.cnblogs.com/poloyy/category/1672457.html 查看所 ...
- Journal of Proteomics Research | 构建用于鉴定蓖麻毒素的串联质谱库
文章题目:Constructing a Tandem Mass Spectral Library for Forensic Ricin Identification 构建用于鉴定蓖麻毒素的串联质谱库 ...
- python使用argparse 、paramiko实现服务器管理器
使用argparse,paramiko两个包去实现简易的服务器管理器,完成两种方式的连接( 密码和密钥 ),以及命令行交互,文件上传下载. 相比sys.argv的方式去判断传入的参数,如果参数较多那么 ...
- 洛谷1880 区间dp+记忆化搜索 合并石子
题目网址:https://www.luogu.com.cn/problem/P1880 题意是:给定一个序列,最小规则是相邻两个值的合并,开销是他们的和,将整个序列合并成一个值的情况下,求解该值的最小 ...
- H3C路由器地址池租期时间H3CMSR830-6BHI-WiNet
H3C路由器地址池租期时间H3CMSR830-6BHI-WiNet 设备H3CMSR830-6BHI-WiNet 先输入dis dhcp server tree pool 查看地址池名称,然后 < ...
- mac下 yarn Stack trace: ExitCodeException exitCode=127
问题出在hadoop 为mac系统配置的读取java_home处. 更改 /Users/shaofengfeng/apache/hadoop/libexec/hadoop-config.sh 如下 # ...
- PAT-B 1040. 有几个PAT(25)
1040. 有几个PAT(25) 时间限制 120 ms 内存限制 65536 kB 代码长度限制 8000 B 判题程序 Standard 作者 CAO, Peng 字符串APPAPT中包含了两个单 ...
- Leetcode——二叉树常考算法整理
二叉树常考算法整理 希望通过写下来自己学习历程的方式帮助自己加深对知识的理解,也帮助其他人更好地学习,少走弯路.也欢迎大家来给我的Github的Leetcode算法项目点star呀~~ 二叉树常考算法 ...
- imread()用法|| root权限
1.ushort用法? USHORT is a macro which is not part of the official C++ language (it's probably defined ...
- Consul+Nginx部署高可用
1. Consul Server 创建consul server虚拟主机 docker-machine create consul 出现如下内容即创建成功 Running pre-create che ...