用gdb调试python多线程代码-记一次死锁的发现
| 版权:本文版权归作者和博客园共有,欢迎转载,但未经作者同意必须保留此段声明,且在文章页面明显位置给出原文连接。如有问题,可以邮件:wangxu198709@gmail.com
前言
相信很多人都有使用过sqlite3的经验,一年前因为项目上的需要,写了一个基于sqlite3的持久化队列库(persist-queue),已经发布在pypi上有段时间了。 前段时间,一下子来了两个issues,一个是关于in-memory database的support,一个是sqlite3 队列性能的问题。在数据量不大的情况下,sqlite的队列会在某些情况下出现:
sqlite3.OperationalError: database is locked
在修复上面的问题的时候,注意到,在多线程的情况下更容易触发上面的exception,并且比较怪的是,有时测试还有两个问题: 1. 出现死锁,并且CPU占用会一直保持在100%。
在程序里面下断点(import pdb;pdb.set_trace())和log都不好使,关键是无法精确定位到所有线程在当时的运行状态。
找了不少方法,最后还是发现了管用的方法,使用大名鼎鼎的: gdb
相比pdb,gdb有以下几个优点:
- 不需要显示的下断点,如"import pdb;pdb.set_trace()"
- 可以方便的调试多线程程序,允许你调试过程中切换调试线程。很多python debug是不支持的如 winpdb, pydevd
- 如果python解释器core dump了,生成的core dump文件可以直接用gdb 来分析,而gdb只能望“dump”兴叹了。
Python给gdb准备了以extension,方便用户不仅可以查看Python解释器的运行情况,还可以查看用户程序的运行情况。
python gdb extension在gdb的环境下提供了如下几个py-*命令
- py-list 查看当前python应用程序上下文
- py-bt 查看当前python应用程序调用堆栈
- py-bt-full 查看当前python应用程序调用堆栈,并且显示每个frame的详细情况
- py-print 查看python变量
- py-locals 查看当前的scope的变量
- py-up 查看上一个frame
- py-down 查看下一个frame
环境准备
首先按照需要的各种调试用到的包
1. 安装gdb, 调试的主要工具
sudo apt-get install gdb
2. 安装python-dbg, 用来在调试的时候看到python源代码的call stack
sudo apt-get install python-dbg
这里面我都是用的python2.7,如果用的是python3.5,需要对应的安装python3.5-dbg
重现问题(死锁,Hung process)
$ tox -e py27
py27 develop-inst-nodeps: /home/peter/Documents/persist-queue
py27 installed: configparser==3.5.0,cov-core==1.15.0,coverage==4.4.1,enum-compat==0.0.2,enum34==1.1.6,eventlet==0.21.0,flake8==3.5.0,funcsigs==1.0.2,greenlet==0.4.12,mccabe==0.6.1,mock==2.0.0,nose2==0.6.5,pbr==3.1.1,-e git+https://github.com/peter-wangxu/persist-queue@d26561e3c7c9a35fd75ddedf0a023a7dcbd563a9#egg=persist_queue,pkg-resources==0.0.0,pycodestyle==2.3.1,pyflakes==1.6.0,six==1.11.0,virtualenv==15.1.0
py27 runtests: PYTHONHASHSEED='564480154'
py27 runtests: commands[0] | nose2 --with-coverage --coverage-report xml --coverage-report term
WARNING:persistqueue.sqlbase:auto_commit=False is still experimental,only use it with care.
....
上面的测试永远不会返回,悲伤啊。。
使用gdb加载symbols
1. 首先找到进程号
$ ps -ef | grep tox
peter : pts/ :: /usr/bin/python3 /usr/bin/tox
peter : pts/ :: /home/peter/Documents/persist-queue/.tox/py27/bin/python2. .tox/py27/bin/nose2 --with-coverage --coverage-report xml --coverage-report term
peter : pts/ :: grep --color=auto tox
上面的 9409 就是我卡住的进程了。
接下来就是使用gdb调试: sudo gdb python -p 9409
记得用用sudo,否则symbols有可能无法加载成功
sudo gdb python -p
GNU gdb (Ubuntu 7.11.-0ubuntu1~16.5) 7.11.
Copyright (C) Free Software Foundation, Inc.
License GPLv3+: GNU GPL version or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python...Reading symbols from /usr/lib/debug/.build-id/cd/e2c487269892a2815c715667ae336984b82b0c.debug...done.
done.
Attaching to program: /usr/bin/python, process
[New LWP ]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007fb923247827 in futex_abstimed_wait_cancelable (private=, abstime=0x0, expected=, futex_word=0x152af60) at ../sysdeps/unix/sysv/linux/futex-internal.h:
../sysdeps/unix/sysv/linux/futex-internal.h: No such file or directory.
注意到这句 “Reading symbols from python...Reading symbols from /usr/lib/debug/.build-id/cd/e2c487269892a2815c715667ae336984b82b0c.debug...done.”
表示加载成功,否则会看到“no debugging symbols found”的log,可能是使用的python2.7跟dbg包的版本不一致。
2. 安装用到的包的dbg信息
注意到我上面有一句 “Reading symbols from /usr/lib/x86_64-linux-gnu/libsqlite3.so.0...(no debugging symbols found)...done.”,表示我的sqlite3的symbols没有加载,直接安装就好
sudo apt-get install libsqlite3--dbg
重现进入gdb
看到这句
Reading symbols from /usr/lib/x86_64-linux-gnu/libsqlite3.so....Reading symbols from /usr/lib/debug/.build-id//5ce90f42d13a60a36853327823ef5f90f2e8f1.debug...done.
所有我关心的symbols就加载完了。
使用gdb debug 程序
debug 主线程
首先,可以是用 info threads,查看当前的线程
(gdb) info threads
Id Target Id Frame
* Thread 0x7fb92365b700 (LWP ) "nose2" 0x00007fb923247827 in futex_abstimed_wait_cancelable (private=, abstime=0x0, expected=, futex_word=0x152af60)
at ../sysdeps/unix/sysv/linux/futex-internal.h:
Thread 0x7fb91d4f1700 (LWP ) "nose2" _PyObject_GenericGetAttrWithDict () at ../Objects/object.c:
带 * 的这个就是当前正在debug的thread。
查看主线程为何一直卡住
(gdb) thread 1
[Switching to thread 1 (Thread 0x7fb92365b700 (LWP 9409))]
#0 0x00007fb923247827 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x152af60) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
205 ../sysdeps/unix/sysv/linux/futex-internal.h: No such file or directory.
(gdb) py-list
335 waiter.acquire()
336 self.__waiters.append(waiter)
337 saved_state = self._release_save()
338 try: # restore state no matter what (e.g., KeyboardInterrupt)
339 if timeout is None:
>340 waiter.acquire()
341 if __debug__:
342 self._note("%s.wait(): got it", self)
343 else:
344 # Balancing act: We can't afford a pure busy loop, so we
345 # have to sleep; but if we sleep the whole timeout time,
(gdb) py-up
#7 Frame 0x7fb9180068e0, for file /usr/lib/python2.7/threading.py, line 340, in wait (self=<_Condition(_Condition__lock=<thread.lock at remote 0x7fb9235efd50>, acquire=<built-in method acquire of thread.lock object at remote 0x7fb9235efd50>, _Verbose__verbose=False, _Condition__waiters=[<thread.lock at remote 0x7fb9235effd0>], release=<built-in method release of thread.lock object at remote 0x7fb9235efd50>) at remote 0x7fb91fdb1c10>, timeout=None, waiter=<thread.lock at remote 0x7fb9235effd0>, saved_state=None)
waiter.acquire()
(gdb) py-up
#11 Frame 0x7fb91fdb69b0, for file /usr/lib/python2.7/threading.py, line 940, in join (self=<Thread(_Thread__block=<_Condition(_Condition__lock=<thread.lock at remote 0x7fb9235efd50>, acquire=<built-in method acquire of thread.lock object at remote 0x7fb9235efd50>, _Verbose__verbose=False, _Condition__waiters=[<thread.lock at remote 0x7fb9235effd0>], release=<built-in method release of thread.lock object at remote 0x7fb9235efd50>) at remote 0x7fb91fdb1c10>, _Thread__args=(3,), _Thread__ident=140433037399808, _Thread__target=<function at remote 0x7fb91fdae1b8>, _Thread__daemonic=False, _Thread__initialized=True, _Thread__started=<_Event(_Event__flag=True, _Verbose__verbose=False, _Event__cond=<_Condition(_Condition__lock=<thread.lock at remote 0x7fb9235eff90>, acquire=<built-in method acquire of thread.lock object at remote 0x7fb9235eff90>, _Verbose__verbose=False, _Condition__waiters=[], release=<built-in method release of thread.lock object at remote 0x7fb9235eff90>) at remote 0x7fb91fdb1bd0>) at remote 0x7fb91f...(truncated)
self.__block.wait()
(gdb) py-up
#15 Frame 0x13d7ba0, for file /home/peter/Documents/persist-queue/tests/test_sqlqueue.py, line 179, in test_multiple_consumers (self=<SQLite3QueueInMemory(path=':memory:', auto_commit=True, _testMethodName='test_multiple_consumers', _cleanups=[], _testMethodDoc='Test sqlqueue can be used by multiple consumers.', _type_equality_funcs={<type at remote 0x905840>: 'assertSetEqual', <type at remote 0x9056a0>: 'assertMultiLineEqual', <type at remote 0x906f00>: 'assertDictEqual', <type at remote 0x906d60>: 'assertTupleEqual', <type at remote 0x9049a0>: 'assertSetEqual', <type at remote 0x9073e0>: 'assertListEqual'}, _resultForDoCleanups=<PluggableTestResult(session=<Session(testResult=<...>, logLevel=30, hooks=<PluginInterface(hooks={'wasSuccessful': <Hook(method='wasSuccessful', plugins=[<ResultReporter(descriptions=True, reportCategories={'failures': [], 'expectedFailures': [], 'skipped': [], 'unexpectedSuccesses': [], 'errors': []}, registered=True, dontReport=set(['failures', 'skipped', 'errors', 'passed', 'expected...(truncated)
t.join()
(gdb) py-list
174 t.start()
175 consumers.append(t)
176
177 p.join()
178 for t in consumers:
>179 t.join()
180
181 self.assertEqual(0, queue.qsize())
182 for x in range(1000):
183 self.assertNotEqual(0, counter[x],
184 "not 0 for counter's index %s" % x)
可以看到我使用了py-list和几次的py-up来到了我写的测试test_sqlqueue.py里面。
发现主线程是在t.join在等待子线程完成,但是奇怪的是,子线程一直没有完成任务。
debug工作线程
接下来,使用 thread 2 切换到上面的 2 线程。
(gdb) thread 2
[Switching to thread 2 (Thread 0x7f02f51f9700 (LWP 18749))]
#0 PyType_IsSubtype () at ../Objects/typeobject.c:1184
1184 ../Objects/typeobject.c: No such file or directory.
使用py-list查看当前运行位置
(gdb) py-list
64 raise Empty
65 elif timeout is None:
66 # block until a put event.
67 pickled = self._pop()
68 while not pickled:
>69 self.put_event.wait()
70 pickled = self._pop()
71 elif timeout < 0:
72 raise ValueError("'timeout' must be a non-negative number")
73 else:
74 # block until the timeout reached
(gdb)
可以看到,工作线程是在 "self.put_event.wait()“,一种可能是self.put_event一直是false,没有被set位True,所以线程就这么一直等着。继续分析
使用py-locals 查看put-event当前值
(gdb) py-up
#26 Frame 0x7f02f82b3608, for file /home/peter/Documents/persist-queue/persistqueue/sqlqueue.py, line 49, in _pop (self=<SQLiteQueue(name='default', _putter=<sqlite3.Connection at remote 0x7f02f82c2030>, path=':memory:', action_lock=<thread.lock at remote 0x7f02fbaf8e30>, memory_sql=True, put_event=<_Event(_Event__cond=<_Condition(acquire=<built-in method acquire of thread.lock object at remote 0x7f02fbaf8eb0>, release=<built-in method release of thread.lock object at remote 0x7f02fbaf8eb0>, _Condition__lock=<thread.lock at remote 0x7f02fbaf8eb0>, _Condition__waiters=[], _Verbose__verbose=False) at remote 0x7f02f835cf90>, _Event__flag=True, _Verbose__verbose=False) at remote 0x7f02f835ce90>, multithreading=True, tran_lock=<thread.lock at remote 0x7f02fbaf8f30>, timeout=<float at remote 0x1bb8660>, auto_commit=True, _getter=<sqlite3.Connection at remote 0x7f02f82c2030>, _conn=<sqlite3.Connection at remote 0x7f02f82c2030>) at remote 0x7f02f8491a10>)
row = self._select()
(gdb) py-list
44 # Action lock to assure multiple action to be *atomic*
45 self.action_lock = threading.Lock()
46
47 def _pop(self):
48 with self.action_lock:
>49 row = self._select()
50 # Perhaps a sqlite3 bug, sometimes (None, None) is returned
51 # by select, below can avoid these invalid records.
52 if row and row[0] is not None:
53 self._delete(row[0])
54 if not self.auto_commit:
(gdb) py-print self
local 'self' = <SQLiteQueue(name='default', _putter=<sqlite3.Connection at remote 0x7f02f82c2030>, path=':memory:', action_lock=<thread.lock at remote 0x7f02fbaf8e30>, memory_sql=True, put_event=<_Event(_Event__cond=<_Condition(acquire=<built-in method acquire of thread.lock object at remote 0x7f02fbaf8eb0>, release=<built-in method release of thread.lock object at remote 0x7f02fbaf8eb0>, _Condition__lock=<thread.lock at remote 0x7f02fbaf8eb0>, _Condition__waiters=[], _Verbose__verbose=False) at remote 0x7f02f835cf90>, _Event__flag=True, _Verbose__verbose=False) at remote 0x7f02f835ce90>, multithreading=True, tran_lock=<thread.lock at remote 0x7f02fbaf8f30>, timeout=<float at remote 0x1bb8660>, auto_commit=True, _getter=<sqlite3.Connection at remote 0x7f02f82c2030>, _conn=<sqlite3.Connection at remote 0x7f02f82c2030>) at remote 0x7f02f8491a10>(gdb) py-locals
self = <SQLiteQueue(name='default', _putter=<sqlite3.Connection at remote 0x7f02f82c2030>, path=':memory:', action_lock=<thread.lock at remote 0x7f02fbaf8e30>, memory_sql=True, put_event=<_Event(_Event__cond=<_Condition(acquire=<built-in method acquire of thread.lock object at remote 0x7f02fbaf8eb0>, release=<built-in method release of thread.lock object at remote 0x7f02fbaf8eb0>, _Condition__lock=<thread.lock at remote 0x7f02fbaf8eb0>, _Condition__waiters=[], _Verbose__verbose=False) at remote 0x7f02f835cf90>, _Event__flag=True, _Verbose__verbose=False) at remote 0x7f02f835ce90>, multithreading=True, tran_lock=<thread.lock at remote 0x7f02fbaf8f30>, timeout=<float at remote 0x1bb8660>, auto_commit=True, _getter=<sqlite3.Connection at remote 0x7f02f82c2030>, _conn=<sqlite3.Connection at remote 0x7f02f82c2030>) at remote 0x7f02f8491a10>
可以看到这个put_event是一直为True(_Event__flag=True),仔细想想,我好像没设置为False过。。。
结合代码:
def get(self, block=True, timeout=None):
if not block:
pickled = self._pop()
if not pickled:
raise Empty
elif timeout is None:
# block until a put event.
pickled = self._pop()
while not pickled:
self.put_event.wait(TICK_FOR_WAIT)
pickled = self._pop()
elif timeout < 0:
raise ValueError("'timeout' must be a non-negative number")
else:
# block until the timeout reached
endtime = _time.time() + timeout
pickled = self._pop()
while not pickled:
remaining = endtime - _time.time()
if remaining <= 0.0:
raise Empty
self.put_event.wait(
TICK_FOR_WAIT if TICK_FOR_WAIT < remaining else remaining)
pickled = self._pop()
item = pickle.loads(pickled)
return item
如果因为某种情况,queue为空,那么上面的9-11行就是一个无限循环,并且cpu会被占用到100%,
好了,到这里就大概搞清楚为什么有时候多线程测试是,程序一直为hung住,并且CPU的占用一直是100%,还会报出 ”sqlite3.OperationalError: database is locked“ 的error(因为程序的线程一直在尝试读数据库,如果其他线程有update,或者inster拿到了sqlite的文件锁,读线程会读超时)
参考文献
http://podoliaka.org/2016/04/10/debugging-cpython-gdb
https://wiki.python.org/moin/DebuggingWithGdb
用gdb调试python多线程代码-记一次死锁的发现的更多相关文章
- 使用gdb调试Python进程
使用gdb调试Python进程 有时我们会想调试一个正在运行的Python进程,或者一个Python进程的coredump.例如现在遇到一个mod_wsgi的进程僵死了,不接受请求,想看看究竟是运行到 ...
- gdb常用命令及使用gdb调试多进程多线程程序
一.常用普通调试命令 1.简单介绍GDB 介绍: gdb是Linux环境下的代码调试⼯具.使⽤:需要在源代码⽣成的时候加上 -g 选项.开始使⽤: gdb binFile退出: ctrl + d 或 ...
- gdb调试多进程多线程程序
一.调试的指令 1.list命令 list linenum 显示程序第linenum行的周围的程序 list function 显示程序名为function的函数的源程序 list 显示当前行后面的源 ...
- 使用gdb调试c/c++代码
转自 http://blog.csdn.net/haoel/article/details/2879 GDB概述———— GDB是GNU开源组织发布的一个强大的UNIX下的程序调试工具.或许,各位比较 ...
- Linux下用gdb 调试、查看代码堆栈
Linux中用gdb 查看代码堆栈的信息 core dump 一般是在segmentation fault(段错误)的情况下产生的文件,需要通过ulimit来设置才会得到的. 调试的话输入: gd ...
- 干货最实用的 Python 多线程代码框架
前言 很多地方都要用到多线程,这是我经常用的多线程代码,放在博客园记录下. 代码 from multiprocessing.pool import ThreadPool thread = 10 ite ...
- gdb调试python
一.概述 有时我们会想调试一个正在运行的Python进程,或者一个Python进程的coredump.例如现在遇到一个mod_wsgi的进程僵死了,不接受请求,想看看究竟是运行到哪行Python代码呢 ...
- gdb 调试 python
gdb 版本 >7 的 对python调试有特别支持,参考: https://docs.python.org/devguide/gdb.html?highlight=gdb https://bl ...
- [gdb][python][libpython] 使用gdb调试python脚本
https://devguide.python.org/gdb/ https://sourceware.org/gdb/current/onlinedocs/gdb/Python.html#Pytho ...
随机推荐
- bookStore第二篇【图书模块、前台页面】
图书模块 分析 在设计图书管理的时候,我们应该想到:图书和分类是有关系的.一个分类可以对应多本图书. 为什么要这样设计?这样更加人性化,用户在购买书籍的时候,用户能够查看相关分类后的图书,而不是全部图 ...
- 定位页面元素之xpath详解以及定位不到测试元素的常见问题
一.定位元素的方法 id:首选的识别属性,W3C标准推荐为页面每一个元素设置一个独一无二的ID属性, 如果没有且很难找到唯一属性,解决方法:(1)找开发把id或者name加上.如果不行,解决思路可以是 ...
- 解决在Ubuntu终端下使用cURL获取GBK格式的页面出现乱码问题
问题描述 在Ubuntu下使用终端使用cURL去拿一个GBK的页面,发现返回来的内容里面中文都是乱码 解决方法 通过iconv来处理乱码拿到的内容,进行转码,示例如下: $curl http://ww ...
- Excel开发之旅(二)----数据的读写
1.要实现数据的读写,首先,我们需要添加引用: using Excel=Microsoft.Office.Interop.Excel; 直接在项目中添加即可. 2.给3个按钮添加响应事件,工程代码截图 ...
- Angular2 Service实践
引言: 如果说组件系统(Component)是ng2应用的躯体,那把服务(Service)认为是流通于组件之间并为其带来生机的血液再合适不过了.组件间通信的其中一种优等选择就是使用服务,在ng1里就有 ...
- 【个人笔记】《知了堂》前端mySql基础
指定列之后添加: ALTER TABLE 表名 ADD 添加的新列名 INT AFTER 指定列之后 第一个位置: ALTER TABLE 表名 ADD 添加的新列名 varchar(20) AFTE ...
- 15.linux-LCD层次分析(详解)
如果我们的系统要用GUI(图形界面接口),这时LCD设备驱动程序就应该编写成frambuffer接口,而不是像之前那样只编写操作底层的LCD控制器接口. 什么是frambuffer设备? frambu ...
- Check for Palindromes-FCC
問題: 检查回文字符串 如果给定的字符串是回文,返回true,反之,返回false. 如果一个字符串忽略标点符号.大小写和空格,正着读和反着读一模一样,那么这个字符串就是palindrome(回文). ...
- 使用百度ueditor的插件使得代码高亮显示
一.在show.html模板中,引入ueditor的插件,并调用 <link rel="stylesheet" href="__ROOT__/Data/uedito ...
- 分享基于分布式Http长连接框架--架构模型
我画了个简单的架构图来帮助说明: 其实为发布订阅架构模式. 生产者和消费者我们统一可理解为客户端,消息中间件可认为是服务端. 生产者和消费者做为客户端要跟服务端交互,则先通过代理订阅服务端,订阅成功后 ...