https://troubleshootingsql.com/tag/stack-dump/

Debugging that latch timeout

Posted on August 26, 2011 by Amit Banerjee

6 Votes

My last post of debugging an assertion didn’t have any cool debugging tips since there is not much that you can do with an assertion dump unless you have access to private symbols and sometimes even access to the source code. In this post, I am going to not disappoint and show you some more cool things that the windows debugger can do for you with public symbols for a latch timeout issue.

When you encounter a latch timeout (buffer or non-buffer latch), the first occurrence of it’s type generates a mini-dump. If there are further occurrences of the same latch timeout, then that is reported as an error message in the SQL Errorlog.

Buffer latch timeouts are typically reported using Error: 844 and 845. The common reasons for such errors are documented in a KB Article. For a non-buffer latch timeout, you will get the an 847 error.

Error #	Error message template (from sys.messages)
844	Time out occurred while waiting for buffer latch — type %d, bp %p, page %d:%d, stat %#x, database id: %d, allocation unit id: %I64d%ls, task 0x%p : %d, waittime %d, flags 0x%I64x, owning task 0x%p. Continuing to wait.
845	Time-out occurred while waiting for buffer latch type %d for page %S_PGID, database ID %d.
846	A time-out occurred while waiting for buffer latch — type %d, bp %p, page %d:%d, stat %#x, database id: %d, allocation unit Id: %I64d%ls, task 0x%p : %d, waittime %d, flags 0x%I64x, owning task 0x%p. Not continuing to wait.
847	Timeout occurred while waiting for latch: class ‘%ls’, id %p, type %d, Task 0x%p : %d, waittime %d, flags 0x%I64x, owning task 0x%p. Continuing to wait.

This is what you will see in the SQL Errorlog when a latch timeout occurs.

spid148     Time out occurred while waiting for buffer latch —type 4, bp 0000000832FE1200, page 3:11234374, stat 0x7c20009,database id: 120, allocation unit id: 72057599731367936, task 0x0000000003C4F2E8 : 0, waittime 300, flags 0x1a, owning task 0x0000000003C129B8. Continuing to wait.
spid148     **Dump thread – spid = 148, PSS = 0x000000044DC17BD0, EC = 0x000000044DC17BE0
spid148     ***Stack Dump being sent to D:\Microsoft SQL Server\MSSQL.1\MSSQL\LOG\SQLDump0001.txt

spid148     * Latch timeout
spid148     * Input Buffer 84 bytes –
spid148     *             DBCC CHECKDB WITH ALL_ERRORMSGS
External dump process returned no errors.

I have only pasted the relevant portion from the Errorlog for brevity. As I have outlined in my previous blog posts on similar topics, that there is a large opportunity for due diligence that can be done with the help of the Windows Event Logs and the SQL Server Errorlogs before you start spawning off windows debugger to analyze the memory dump on your system. The first few obvious things that you will notice is that SPID 148 encountered the issue while performing a CHECKDB on database ID 120. The timeout occurred while waiting for a buffer latch on a page (Page ID is available in the message above). I don’t see a “Timeout waiting for external dump process” message in the SQL Errorlog which means that I have a good chance of extracting useful information from the mini-dump that was generated by SQLDumper.

Latch timeouts are typically victims of either a system related issue (hardware or drivers or operating system or a previous error encountered by SQL Server). So the next obvious action item would be to look into the SQL Errorlogs and find out if there were any additional errors prior to the latch timeout issue. I see a number of OS Error 1450 reported by the same SPID 148 for the same file handle but different offsets.

spid148 The operating system returned error1450(Insufficient system resources exist to complete the requested service.) to SQL Server during a write at offset 0x0000156bf36000 in file with handle 0x0000000000001358. This is usually a temporary condition and the SQL Server will keep retrying the operation. If the condition persists then immediate action must be taken to correct it.

Additionally, I see prior and post (within 5-15 minutes) the latch timeout issue, multiple other SPIDs reporting the same 1450 error message for different offsets but again on the same file.

spid137 The operating system returned error1450(Insufficient system resources exist to complete the requested service.) to SQL Server during a write at offset 0x000007461f8000 in file with handle 0x0000000000001358. This is usually a temporary condition and the SQL Server will keep retrying the operation. If the condition persists then immediate action must be taken to correct it.

I also see the latch timeout message being reported after every 300 ms for the same page and the database.

spid148 Time out occurred while waiting for buffer latch — type 4, bp 0000000832FE1200, page 3:11234374, stat 0x7c20009,database id: 120, allocation unit id: 72057599731367936, task 0x0000000003C4F2E8 : 0, waittime 82800, flags 0x1a, owning task 0x0000000003C129B8. Continuing to wait.

Notice the waittime above, it has increased from 300 to 82800!! So the next thing I do is look up issues related to CHECKDB and 1450 error messages on the web using Bing (Yes, I do use BING!!). These are relevant posts related to the above issue.

http://blogs.msdn.com/b/psssql/archive/2008/07/10/sql-server-reports-operating-system-error-1450-or-1452-or-665-retries.aspx
http://blogs.msdn.com/b/psssql/archive/2009/03/04/sparse-file-errors-1450-or-665-due-to-file-fragmentation-fixes-and-workarounds.aspx

As of now, it is quite clear that the issue is related to a possible sparse file issue related to file fragmentation. Now it is time for me to check if there are other threads in the dump waiting on SyncWritePreemptivecalls.

Use the location provided in the Errorlog snippet reporting the Latch Timeout message to locate the mini-dump for the issue (in this case SQLDump0001.mdmp).

Now when you load the dump using WinDBG, you will see the following information:

Loading Dump File [D:\Microsoft SQL Server\MSSQL.1\MSSQL\LOG\SQLDump0001.mdmp]
User Mini Dump File: Only registers, stack and portions of memory are available

Comment: ‘Stack Trace’
Comment: ‘Latch timeout’

This dump file has an exception of interest stored in it.

The above tells you that this is a mini-dump for a Latch Timeoutcondition and the location from where you loaded the dump. Then I use the command to set my symbol path and direct the symbols downloaded from the Microsoft symbol server to a local symbol file cache on my machine.

.sympath srv*D:\PublicSymbols*http://msdl.microsoft.com/download/symbols

Then I issue a reload command to load the symbols for sqlservr.exe. This can also be done using CTRL+L and providing the complete string above (without .sympath), checking the Reload checkbox and clicking on OK. The only difference here is that the all the public symbols for all loaded modules in the dump will be downloaded from the Microsoft Symbol Server which are available.

.reload /f sqlservr.exe

Next thing is to verify that the symbols were correctly loaded using thelmvm sqlservr command. If the symbols were loaded correctly, you should see the following output. Note the text in green.

0:005> lmvm sqlservr

start end module name
00000000`01000000 00000000`03668000 sqlservr T (pdb symbols) d:\publicsymbols\sqlservr.pdb\2A3969D78EE24FD494837AF090F5EDBC2\sqlservr.pdb

If symbols were not loaded, then you will see an output as shown below.

0:005> lmvm sqlservr
start end module name
00000000`01000000 00000000`03668000 sqlservr (export symbols) sqlservr.exe

I will use the !findstack command to locate all threads which have the function call SyncWritePreemptive on their callstack.

0:070> !findstack sqlservr!FCB::SyncWritePreemptive 0

Thread 069, 1 frame(s) match
Thread 074, 1 frame(s) match
Thread 076, 1 frame(s) match
Thread 079, 1 frame(s) match
Thread 081, 1 frame(s) match
Thread 082, 1 frame(s) match
Thread 086, 1 frame(s) match
Thread 089, 1 frame(s) match
Thread 091, 1 frame(s) match
Thread 095, 1 frame(s) match
Thread 098, 1 frame(s) match
Thread 099, 1 frame(s) match
Thread 104, 1 frame(s) match
Thread 106, 1 frame(s) match
Thread 107, 1 frame(s) match
Thread 131, 1 frame(s) match
Thread 136, 1 frame(s) match

0:070> ~81s
ntdll!ZwWaitForSingleObject+0xa:
00000000`77ef0a2a c3 ret
0:081> kL100

ntdll!ZwDelayExecution
kernel32!SleepEx
sqlservr!FCB::SyncWritePreemptive
sqlservr!FCB::PullPageToReplica
sqlservr!alloca_probe
sqlservr!BUF::CopyOnWrite
sqlservr!PageRef::PrepareToDirty
sqlservr!RecoveryMgr::DoCOWPreWrites

You could get all the callstacks with the function that you are searching for using the command: !findstack sqlservr!FCB::SyncWritePreemptive 3

If you look at the thread that raised ended up raising the Latch Timeout warning was also performing a CHECKDB.

0:074> .ecxr

0:074> kL100

kernel32!RaiseException
sqlservr!CDmpDump::Dump
sqlservr!CImageHelper::DoMiniDump
sqlservr!stackTrace
sqlservr!LatchBase::DumpOnTimeoutIfNeeded
sqlservr!LatchBase::PrintWarning
sqlservr!alloca_probe
sqlservr!BUF::AcquireLatch
…
…
sqlservr!UtilDbccCreateReplica
sqlservr!UtilDbccCheckDatabase
sqlservr!DbccCheckDB
sqlservr!DbccCommand::Execute

If you cannot find the thread which raised the Latch Timeout warning, you could print out all the callstacks in the dump using ~*kL100 and the searching for the function call in blue above. It is quite clear from the callstack above that the thread was also involved in performing a CHECKDB operation as reported in the SQL Errorlog in the input buffer for the Latch Timeout dump.

If case you were not able to identify the issue right off the bat, you need to check the build that you are on and look for issues that were addressed related to Latch Timeouts for the SQL Server release that you are using. The symptoms section would have sufficient amount of information for you to compare with your current symptoms and scenario and determine if the KB Article that you are looking at is applicable in your case.

Now is the time, when you need to have some context about the operations that were occurring on the server to actually determine what the actual issue is. Based on what I heard from the system administrators that there was a CHECKDB being executed on the database while the application was executing DML operations on the database. Additionally, the volume on which the disk resides on has fragmentation and the database in question is large (>750GB).

Based on the two MSDN blog posts that I mentioned above, it is quite clear that it is possible to run into sparse file limitations when there is high amount of fragmentation on the disks or if there are a large number of concurrent changes occurring on the database when a CHECKDB is executing on it. A number of Windows and SQL Server updates along with workarounds to run CHECKDB on such databases is mentioned in the second blog post mentioned above. On a separate note, this is not an issue with CHECKDB! It is limitation that you are hitting with sparse files on the OS layer. Remember CHECKDB, starting from SQL Server 2005, creates an internal snapshot (makes sparse file) to execute the consistency check. Paul Randal’s tweet made me add this line to call this out explicitly!

As always… if you are still stuck, contact Microsoft CSS with the mini-dump file, SQL Errorlog and the Windows Event Logs. It might be quite possible that CSS might ask you to collect additional data as most Latch Timeout issues are generally an after-effect of a previous issue. In this case, it was the OS Error 1450.

Well… That’s it for today! Hope this is helpful for the next time you encounter a Latch Timeout issue.

sqlserver 调试WINDBG ---troubleshootingsql.com的更多相关文章

.NET 5 程序高级调试-WinDbg
上周和大家分享了.NET 5开源工作流框架elsa,程序跑起来后,想看一下后台线程的执行情况.抓了个进程Dump后,使用WinDbg调试,加载SOS调试器扩展,结果无法正常使用了: 0:000> ...
SQLServer调试
1.普通调试直接点击SSMS客户端上的调试按钮即可 2.存储过程调试 2.1 定义存储过程(以Northwind数据库为例) USE [Northwind] GO /****** Object: S ...
SQLSERVER LATCH WINDBG
https://mssqlwiki.com/2012/09/07/latch-timeout-and-sql-server-latch/
使用Windbg和SoS扩展调试分析.NET程序
在博客堂的不是我舍不得 - High CPU in GC(都是+=惹的祸,为啥不用StringBuilder呢?). 不是我舍不得 - .NET里面的Out Of Memory 看到很多人在问如何分析 ...
Windbg是windows平台上强大的调试器
基础调试命令 - .dump/.dumpcap/.writemem/!runaway Windbg是windows平台上强大的调试器,它相对于其他常见的IDE集成的调试器有几个重要的优势, Windb ...
windbg预览版,windbg preview配置win7x64双机调试
目录一丶简介二丶步骤 1.下载Windbg Preview (windbg预览版本) 2.配置虚拟机端口 3.虚拟机设置调试湍口 4.windbg preview开始调试. 一丶简介 Windbg ...
使用WinDbg内核调试[转]
Technorati 标签: windbg,内核调试 WINDOWS调试工具很强大,但是学习使用它们并不容易.特别对于驱动开发者使用的WinDbg和KD这两个内核调试器(CDB和NTSD是用户态调试器 ...
WinDBG 调试命令大全
转载收藏于:http://www.cnblogs.com/kekec/archive/2012/12/02/2798020.html #调试命令窗口 ++++++++++++++++++++++++ ...
基础调试命令 - .dump/.dumpcap/.writemem/!runaway
Windbg是windows平台上强大的调试器,它相对于其他常见的IDE集成的调试器有几个重要的优势, Windbg可以做内核态调试 Windbg可以脱离源代码进行调试 Windbg可以用来分析dum ...

随机推荐

Python基础（5）_文件操作
一.文件处理流程打开文件,得到文件句柄并赋值给一个变量通过句柄对文件进行操作关闭文件二.文件打开模式打开文件时,需要指定文件路径和以何等方式打开文件,打开后,即可获取该文件句柄,日后通过此文 ...
bzoj 1942 斜率优化DP
首先我们贪心的考虑,对于某一天来说,我们只有3中策略,第一种为不做任何行动,这时的答案与前一天相同,第二种为将自己的钱全部换成a,b货币,因为如果换a,b货币,代表在之后的某一天卖出去后会赚钱,那么当 ...
alt+ F8 设置无效（转）
原文转自 https://blog.csdn.net/m372897500/article/details/7310251 具体修改方法如下: 工具-选项-环境-键盘-应用以下其他键盘映射方案,选择v ...
cookie和session的区别与会话跟踪技术
会话跟踪技术: HTTP是一种无状态协议,每当用户发出请求时,服务器就会做出响应,客户端与服务器之间的联系是离散的.非连续的.当用户在同一网站的多个页面之间转换时,根本无法确定是否是同一个客户,会话跟 ...
CSS变形
css3 变形/变换相关属性 transform transform-origin transform-style:flat/preserve-3d perspective: 长度单位 perspe ...
mysql-5.7.15-winx64配置
1. 配置环境变量 1.1 添加path路径选择控制面板>系统和安全>系统>高级系统设置>环境变量 mysql文件目录的绝对路径\bin 1.2 修改mysql ...
mysql管理和基本操作
进去mysql:mysql –uroot –p 重启数据库:[root@nanaLinux ~]# /etc/init.d/mysqld restart 1.Mysql忘记root密码 // 查看my ...
Go语言的切片slice基本操作
感觉比数组好用,首选. package main import ( "fmt" ) //main is the entry of the program func main() { ...
jenkins,已令人发指的简发到如此地步了？
不用tomcat,一条命令.... java -jar jenkins.war
XAMPP配置vhosts多站点/绝对正确
XAMPP有时候你需要一些顶级域名访问方式来访问你本地的项目也就是虚拟主机配置,这时候就需要配置虚拟主机,给你的目录绑定一个域名,实现多域名绑定访问. 在Mac 下一直使用 MAMP 搭建本地 php ...

sqlserver 调试WINDBG ---troubleshootingsql.com

Debugging that latch timeout

sqlserver 调试WINDBG ---troubleshootingsql.com的更多相关文章

随机推荐

热门专题