sqlserver 调试WINDBG ---troubleshootingsql.com
https://troubleshootingsql.com/tag/stack-dump/
Debugging that latch timeout
My last post of debugging an assertion didn’t have any cool debugging tips since there is not much that you can do with an assertion dump unless you have access to private symbols and sometimes even access to the source code. In this post, I am going to not disappoint and show you some more cool things that the windows debugger can do for you with public symbols for a latch timeout issue.
When you encounter a latch timeout (buffer or non-buffer latch), the first occurrence of it’s type generates a mini-dump. If there are further occurrences of the same latch timeout, then that is reported as an error message in the SQL Errorlog.
Buffer latch timeouts are typically reported using Error: 844 and 845. The common reasons for such errors are documented in a KB Article. For a non-buffer latch timeout, you will get the an 847 error.
Error # | Error message template (from sys.messages) |
844 | Time out occurred while waiting for buffer latch — type %d, bp %p, page %d:%d, stat %#x, database id: %d, allocation unit id: %I64d%ls, task 0x%p : %d, waittime %d, flags 0x%I64x, owning task 0x%p. Continuing to wait. |
845 | Time-out occurred while waiting for buffer latch type %d for page %S_PGID, database ID %d. |
846 | A time-out occurred while waiting for buffer latch — type %d, bp %p, page %d:%d, stat %#x, database id: %d, allocation unit Id: %I64d%ls, task 0x%p : %d, waittime %d, flags 0x%I64x, owning task 0x%p. Not continuing to wait. |
847 |
Timeout occurred while waiting for latch: class ‘%ls’, id %p, type %d, Task 0x%p : %d, waittime %d, flags 0x%I64x, owning task 0x%p. Continuing to wait. |
This is what you will see in the SQL Errorlog when a latch timeout occurs.
spid148 Time out occurred while waiting for buffer latch —type 4, bp 0000000832FE1200, page 3:11234374, stat 0x7c20009,database id: 120, allocation unit id: 72057599731367936, task 0x0000000003C4F2E8 : 0, waittime 300, flags 0x1a, owning task 0x0000000003C129B8. Continuing to wait.
spid148 **Dump thread – spid = 148, PSS = 0x000000044DC17BD0, EC = 0x000000044DC17BE0
spid148 ***Stack Dump being sent to D:\Microsoft SQL Server\MSSQL.1\MSSQL\LOG\SQLDump0001.txtspid148 * Latch timeout
spid148 * Input Buffer 84 bytes –
spid148 * DBCC CHECKDB WITH ALL_ERRORMSGS
External dump process returned no errors.
I have only pasted the relevant portion from the Errorlog for brevity. As I have outlined in my previous blog posts on similar topics, that there is a large opportunity for due diligence that can be done with the help of the Windows Event Logs and the SQL Server Errorlogs before you start spawning off windows debugger to analyze the memory dump on your system. The first few obvious things that you will notice is that SPID 148 encountered the issue while performing a CHECKDB on database ID 120. The timeout occurred while waiting for a buffer latch on a page (Page ID is available in the message above). I don’t see a “Timeout waiting for external dump process” message in the SQL Errorlog which means that I have a good chance of extracting useful information from the mini-dump that was generated by SQLDumper.
Latch timeouts are typically victims of either a system related issue (hardware or drivers or operating system or a previous error encountered by SQL Server). So the next obvious action item would be to look into the SQL Errorlogs and find out if there were any additional errors prior to the latch timeout issue. I see a number of OS Error 1450 reported by the same SPID 148 for the same file handle but different offsets.
spid148 The operating system returned error1450(Insufficient system resources exist to complete the requested service.) to SQL Server during a write at offset 0x0000156bf36000 in file with handle 0x0000000000001358. This is usually a temporary condition and the SQL Server will keep retrying the operation. If the condition persists then immediate action must be taken to correct it.
Additionally, I see prior and post (within 5-15 minutes) the latch timeout issue, multiple other SPIDs reporting the same 1450 error message for different offsets but again on the same file.
spid137 The operating system returned error1450(Insufficient system resources exist to complete the requested service.) to SQL Server during a write at offset 0x000007461f8000 in file with handle 0x0000000000001358. This is usually a temporary condition and the SQL Server will keep retrying the operation. If the condition persists then immediate action must be taken to correct it.
I also see the latch timeout message being reported after every 300 ms for the same page and the database.
spid148 Time out occurred while waiting for buffer latch — type 4, bp 0000000832FE1200, page 3:11234374, stat 0x7c20009,database id: 120, allocation unit id: 72057599731367936, task 0x0000000003C4F2E8 : 0, waittime 82800, flags 0x1a, owning task 0x0000000003C129B8. Continuing to wait.
Notice the waittime above, it has increased from 300 to 82800!! So the next thing I do is look up issues related to CHECKDB and 1450 error messages on the web using Bing (Yes, I do use BING!!). These are relevant posts related to the above issue.
http://blogs.msdn.com/b/psssql/archive/2008/07/10/sql-server-reports-operating-system-error-1450-or-1452-or-665-retries.aspx
http://blogs.msdn.com/b/psssql/archive/2009/03/04/sparse-file-errors-1450-or-665-due-to-file-fragmentation-fixes-and-workarounds.aspx
As of now, it is quite clear that the issue is related to a possible sparse file issue related to file fragmentation. Now it is time for me to check if there are other threads in the dump waiting on SyncWritePreemptivecalls.
Use the location provided in the Errorlog snippet reporting the Latch Timeout message to locate the mini-dump for the issue (in this case SQLDump0001.mdmp).
Now when you load the dump using WinDBG, you will see the following information:
Loading Dump File [D:\Microsoft SQL Server\MSSQL.1\MSSQL\LOG\SQLDump0001.mdmp]
User Mini Dump File: Only registers, stack and portions of memory are availableComment: ‘Stack Trace’
Comment: ‘Latch timeout’This dump file has an exception of interest stored in it.
The above tells you that this is a mini-dump for a Latch Timeoutcondition and the location from where you loaded the dump. Then I use the command to set my symbol path and direct the symbols downloaded from the Microsoft symbol server to a local symbol file cache on my machine.
.sympath srv*D:\PublicSymbols*http://msdl.microsoft.com/download/symbols
Then I issue a reload command to load the symbols for sqlservr.exe. This can also be done using CTRL+L and providing the complete string above (without .sympath), checking the Reload checkbox and clicking on OK. The only difference here is that the all the public symbols for all loaded modules in the dump will be downloaded from the Microsoft Symbol Server which are available.
.reload /f sqlservr.exe
Next thing is to verify that the symbols were correctly loaded using thelmvm sqlservr command. If the symbols were loaded correctly, you should see the following output. Note the text in green.
0:005> lmvm sqlservr
start end module name
00000000`01000000 00000000`03668000 sqlservr T (pdb symbols) d:\publicsymbols\sqlservr.pdb\2A3969D78EE24FD494837AF090F5EDBC2\sqlservr.pdb
If symbols were not loaded, then you will see an output as shown below.
0:005> lmvm sqlservr
start end module name
00000000`01000000 00000000`03668000 sqlservr (export symbols) sqlservr.exe
I will use the !findstack command to locate all threads which have the function call SyncWritePreemptive on their callstack.
0:070> !findstack sqlservr!FCB::SyncWritePreemptive 0
Thread 069, 1 frame(s) match
Thread 074, 1 frame(s) match
Thread 076, 1 frame(s) match
Thread 079, 1 frame(s) match
Thread 081, 1 frame(s) match
Thread 082, 1 frame(s) match
Thread 086, 1 frame(s) match
Thread 089, 1 frame(s) match
Thread 091, 1 frame(s) match
Thread 095, 1 frame(s) match
Thread 098, 1 frame(s) match
Thread 099, 1 frame(s) match
Thread 104, 1 frame(s) match
Thread 106, 1 frame(s) match
Thread 107, 1 frame(s) match
Thread 131, 1 frame(s) match
Thread 136, 1 frame(s) match0:070> ~81s
ntdll!ZwWaitForSingleObject+0xa:
00000000`77ef0a2a c3 ret
0:081> kL100ntdll!ZwDelayExecution
kernel32!SleepEx
sqlservr!FCB::SyncWritePreemptive
sqlservr!FCB::PullPageToReplica
sqlservr!alloca_probe
sqlservr!BUF::CopyOnWrite
sqlservr!PageRef::PrepareToDirty
sqlservr!RecoveryMgr::DoCOWPreWrites
You could get all the callstacks with the function that you are searching for using the command: !findstack sqlservr!FCB::SyncWritePreemptive 3
If you look at the thread that raised ended up raising the Latch Timeout warning was also performing a CHECKDB.
0:074> .ecxr
0:074> kL100
kernel32!RaiseException
sqlservr!CDmpDump::Dump
sqlservr!CImageHelper::DoMiniDump
sqlservr!stackTrace
sqlservr!LatchBase::DumpOnTimeoutIfNeeded
sqlservr!LatchBase::PrintWarning
sqlservr!alloca_probe
sqlservr!BUF::AcquireLatch
…
…
sqlservr!UtilDbccCreateReplica
sqlservr!UtilDbccCheckDatabase
sqlservr!DbccCheckDB
sqlservr!DbccCommand::Execute
If you cannot find the thread which raised the Latch Timeout warning, you could print out all the callstacks in the dump using ~*kL100 and the searching for the function call in blue above. It is quite clear from the callstack above that the thread was also involved in performing a CHECKDB operation as reported in the SQL Errorlog in the input buffer for the Latch Timeout dump.
If case you were not able to identify the issue right off the bat, you need to check the build that you are on and look for issues that were addressed related to Latch Timeouts for the SQL Server release that you are using. The symptoms section would have sufficient amount of information for you to compare with your current symptoms and scenario and determine if the KB Article that you are looking at is applicable in your case.
Now is the time, when you need to have some context about the operations that were occurring on the server to actually determine what the actual issue is. Based on what I heard from the system administrators that there was a CHECKDB being executed on the database while the application was executing DML operations on the database. Additionally, the volume on which the disk resides on has fragmentation and the database in question is large (>750GB).
Based on the two MSDN blog posts that I mentioned above, it is quite clear that it is possible to run into sparse file limitations when there is high amount of fragmentation on the disks or if there are a large number of concurrent changes occurring on the database when a CHECKDB is executing on it. A number of Windows and SQL Server updates along with workarounds to run CHECKDB on such databases is mentioned in the second blog post mentioned above. On a separate note, this is not an issue with CHECKDB! It is limitation that you are hitting with sparse files on the OS layer. Remember CHECKDB, starting from SQL Server 2005, creates an internal snapshot (makes sparse file) to execute the consistency check. Paul Randal’s tweet made me add this line to call this out explicitly!
As always… if you are still stuck, contact Microsoft CSS with the mini-dump file, SQL Errorlog and the Windows Event Logs. It might be quite possible that CSS might ask you to collect additional data as most Latch Timeout issues are generally an after-effect of a previous issue. In this case, it was the OS Error 1450.
Well… That’s it for today! Hope this is helpful for the next time you encounter a Latch Timeout issue.
sqlserver 调试WINDBG ---troubleshootingsql.com的更多相关文章
- .NET 5 程序高级调试-WinDbg
上周和大家分享了.NET 5开源工作流框架elsa,程序跑起来后,想看一下后台线程的执行情况.抓了个进程Dump后,使用WinDbg调试,加载SOS调试器扩展,结果无法正常使用了: 0:000> ...
- SQLServer调试
1.普通调试 直接点击SSMS客户端上的调试按钮即可 2.存储过程调试 2.1 定义存储过程(以Northwind数据库为例) USE [Northwind] GO /****** Object: S ...
- SQLSERVER LATCH WINDBG
https://mssqlwiki.com/2012/09/07/latch-timeout-and-sql-server-latch/
- 使用Windbg和SoS扩展调试分析.NET程序
在博客堂的不是我舍不得 - High CPU in GC(都是+=惹的祸,为啥不用StringBuilder呢?). 不是我舍不得 - .NET里面的Out Of Memory 看到很多人在问如何分析 ...
- Windbg是windows平台上强大的调试器
基础调试命令 - .dump/.dumpcap/.writemem/!runaway Windbg是windows平台上强大的调试器,它相对于其他常见的IDE集成的调试器有几个重要的优势, Windb ...
- windbg预览版,windbg preview配置win7x64双机调试
目录 一丶简介 二丶步骤 1.下载Windbg Preview (windbg预览版本) 2.配置虚拟机端口 3.虚拟机设置调试湍口 4.windbg preview开始调试. 一丶简介 Windbg ...
- 使用WinDbg内核调试[转]
Technorati 标签: windbg,内核调试 WINDOWS调试工具很强大,但是学习使用它们并不容易.特别对于驱动开发者使用的WinDbg和KD这两个内核调试器(CDB和NTSD是用户态调试器 ...
- WinDBG 调试命令大全
转载收藏于:http://www.cnblogs.com/kekec/archive/2012/12/02/2798020.html #调试命令窗口 ++++++++++++++++++++++++ ...
- 基础调试命令 - .dump/.dumpcap/.writemem/!runaway
Windbg是windows平台上强大的调试器,它相对于其他常见的IDE集成的调试器有几个重要的优势, Windbg可以做内核态调试 Windbg可以脱离源代码进行调试 Windbg可以用来分析dum ...
随机推荐
- [洛谷P1032] 字串变换
洛谷题目链接:字串变换 题目描述 已知有两个字串 A, B 及一组字串变换的规则(至多6个规则): A1 -> B1 A2 -> B2 规则的含义为:在 A$中的子串 A1 可以变换为 B ...
- 51nod 1060 最复杂的数
把一个数的约数个数定义为该数的复杂程度,给出一个n,求1-n中复杂程度最高的那个数. 例如:12的约数为:1 2 3 4 6 12,共6个数,所以12的复杂程度是6.如果有多个数复杂度相等,输出最 ...
- #学习方法2打印为空,说明#延迟加载#解决方案:将nameField等设置改在viewDidLoad中设置
#学习方法#error调试#NSLog调试之无法进行数据传输,Edit无法现实之前编辑 的内容 https://www.evernote.com/shard/s227/sh/c ...
- 【洛谷 P1653】 猴子 (并查集)
题目链接 没删除调试输出,原地炸裂,\(80\)->\(0\).如果你要问剩下的\(20\)呢?答:数组开小了. 这题正向删边判连通性是很不好做的,因为我们并不会并查集的逆操作.于是可以考虑把断 ...
- 渗透测试中如何科学地使用V*P*N
环境说明 Windows7 虚拟机,作为VPN网关,负责拨VPN.VPN可以直接OPENVPN,也可以使用ShadowSocks+SSTap. Kali 虚拟机, 渗透测试工作机 配置步骤 Windo ...
- A trick in Exploit Dev
学习Linux BOF的时候,看了这个文章,https://sploitfun.wordpress.com/2015/06/23/integer-overflow/ ,原文给出的exp无法成功, 此时 ...
- Linux内核中的GPIO系统之(3):pin controller driver代码分析--devm_kzalloc使用【转】
转自:http://www.wowotech.net/linux_kenrel/pin-controller-driver.html 一.前言 对于一个嵌入式软件工程师,我们的软件模块经常和硬件打交道 ...
- Backbone Model 源码简谈 (版本:1.1.0 基础部分完毕)
Model工厂 作为model的主要函数,其实只有12行,特别的简练 var Model = Backbone.Model = function(attributes, options) { va ...
- mpvue-小程序之蹲坑记
1. 不支持 v-html 小程序里所有的 BOM/DOM 都不能用,也就是说 v-html 指令不能用 部分复杂的 JavaScript 渲染表达式 {{}} 双花括号的部分,直接编码到 wxml ...
- 读取pandas修改单列数据类型
import pandas as pd import numpy as np df = pd.read_csv('000917.csv',encoding='gbk') df = df[df['涨跌幅 ...