https://troubleshootingsql.com/tag/stack-dump/

Debugging that latch timeout

 
 
 
 
 
 
6 Votes

My last post of debugging an assertion didn’t have any cool debugging tips since there is not much that you can do with an assertion dump unless you have access to private symbols and sometimes even access to the source code. In this post, I am going to not disappoint and show you some more cool things that the windows debugger can do for you with public symbols for a latch timeout issue.

When you encounter a latch timeout (buffer or non-buffer latch), the first occurrence of it’s type generates a mini-dump. If there are further occurrences of the same latch timeout, then that is reported as an error message in the SQL Errorlog.

Buffer latch timeouts are typically reported using Error: 844 and 845. The common reasons for such errors are documented in a KB Article. For a non-buffer latch timeout, you will get the an 847 error.

Error # Error message template (from sys.messages)
844 Time out occurred while waiting for buffer latch — type %d, bp %p, page %d:%d, stat %#x, database id: %d, allocation unit id: %I64d%ls, task 0x%p : %d, waittime %d, flags 0x%I64x, owning task 0x%p.  Continuing to wait.
845 Time-out occurred while waiting for buffer latch type %d for page %S_PGID, database ID %d.
846 A time-out occurred while waiting for buffer latch — type %d, bp %p, page %d:%d, stat %#x, database id: %d, allocation unit Id: %I64d%ls, task 0x%p : %d, waittime %d, flags 0x%I64x, owning task 0x%p. Not continuing to wait.
847

Timeout occurred while waiting for latch: class ‘%ls’, id %p, type %d, Task 0x%p : %d, waittime %d, flags 0x%I64x, owning task 0x%p. Continuing to wait.

This is what you will see in the SQL Errorlog when a latch timeout occurs.

spid148     Time out occurred while waiting for buffer latch —type 4, bp 0000000832FE1200, page 3:11234374, stat 0x7c20009,database id: 120, allocation unit id: 72057599731367936, task 0x0000000003C4F2E8 : 0, waittime 300, flags 0x1a, owning task 0x0000000003C129B8.  Continuing to wait.
spid148     **Dump thread – spid = 148, PSS = 0x000000044DC17BD0, EC = 0x000000044DC17BE0
spid148     ***Stack Dump being sent to D:\Microsoft SQL Server\MSSQL.1\MSSQL\LOG\SQLDump0001.txt

spid148     * Latch timeout
spid148     * Input Buffer 84 bytes –
spid148     *             DBCC CHECKDB WITH ALL_ERRORMSGS
External dump process returned no errors.

I have only pasted the relevant portion from the Errorlog for brevity. As I have outlined in my previous blog posts on similar topics, that there is a large opportunity for due diligence that can be done with the help of the Windows Event Logs and the SQL Server Errorlogs before you start spawning off windows debugger to analyze the memory dump on your system. The first few obvious things that you will notice is that SPID 148 encountered the issue while performing a CHECKDB on database ID 120. The timeout occurred while waiting for a buffer latch on a page (Page ID is available in the message above). I don’t see a “Timeout waiting for external dump process” message in the SQL Errorlog which means that I have a good chance of extracting useful information from the mini-dump that was generated by SQLDumper.

Latch timeouts are typically victims of either a system related issue (hardware or drivers or operating system or a previous error encountered by SQL Server). So the next obvious action item would be to look into the SQL Errorlogs and find out if there were any additional errors prior to the latch timeout issue. I see a number of OS Error 1450 reported by the same SPID 148 for the same file handle but different offsets.

spid148     The operating system returned error1450(Insufficient system resources exist to complete the requested service.) to SQL Server during a write at offset 0x0000156bf36000 in file with handle 0x0000000000001358. This is usually a temporary condition and the SQL Server will keep retrying the operation. If the condition persists then immediate action must be taken to correct it.

Additionally, I see prior and post (within 5-15 minutes) the latch timeout issue, multiple other SPIDs reporting the same 1450 error message for different offsets but again on the same file.

spid137     The operating system returned error1450(Insufficient system resources exist to complete the requested service.) to SQL Server during a write at offset 0x000007461f8000 in file with handle 0x0000000000001358. This is usually a temporary condition and the SQL Server will keep retrying the operation. If the condition persists then immediate action must be taken to correct it.

I also see the latch timeout message being reported after every 300 ms for the same page and the database.

spid148     Time out occurred while waiting for buffer latch — type 4, bp 0000000832FE1200, page 3:11234374, stat 0x7c20009,database id: 120, allocation unit id: 72057599731367936, task 0x0000000003C4F2E8 : 0, waittime 82800, flags 0x1a, owning task 0x0000000003C129B8.  Continuing to wait.

Notice the waittime above, it has increased from 300 to 82800!! So the next thing I do is look up issues related to CHECKDB and 1450 error messages on the web using Bing (Yes, I do use BING!!). These are relevant posts related to the above issue.

http://blogs.msdn.com/b/psssql/archive/2008/07/10/sql-server-reports-operating-system-error-1450-or-1452-or-665-retries.aspx
http://blogs.msdn.com/b/psssql/archive/2009/03/04/sparse-file-errors-1450-or-665-due-to-file-fragmentation-fixes-and-workarounds.aspx

As of now, it is quite clear that the issue is related to a possible sparse file issue related to file fragmentation. Now it is time for me to check if there are other threads in the dump waiting on SyncWritePreemptivecalls.

Use the location provided in the Errorlog snippet reporting the Latch Timeout message to locate the mini-dump for the issue (in this case SQLDump0001.mdmp).

Now when you load the dump using WinDBG, you will see the following information:

Loading Dump File [D:\Microsoft SQL Server\MSSQL.1\MSSQL\LOG\SQLDump0001.mdmp]
User Mini Dump File: Only registers, stack and portions of memory are available

Comment: ‘Stack Trace’
Comment: ‘Latch timeout’

This dump file has an exception of interest stored in it.

The above tells you that this is a mini-dump for a Latch Timeoutcondition and the location from where you loaded the dump. Then I use the command to set my symbol path and direct the symbols downloaded from the Microsoft symbol server to a local symbol file cache on my machine.

.sympath srv*D:\PublicSymbols*http://msdl.microsoft.com/download/symbols

Then I issue a reload command to load the symbols for sqlservr.exe. This can also be done using CTRL+L and providing the complete string above (without .sympath), checking the Reload checkbox and clicking on OK. The only difference here is that the all the public symbols for all loaded modules in the dump will be downloaded from the Microsoft Symbol Server which are available.

.reload /f sqlservr.exe

Next thing is to verify that the symbols were correctly loaded using thelmvm sqlservr command. If the symbols were loaded correctly, you should see the following output. Note the text in green.

0:005> lmvm sqlservr

start             end                 module name
00000000`01000000 00000000`03668000   sqlservr T (pdb symbols)          d:\publicsymbols\sqlservr.pdb\2A3969D78EE24FD494837AF090F5EDBC2\sqlservr.pdb

If symbols were not loaded, then you will see an output as shown below.

0:005> lmvm sqlservr
start end module name
00000000`01000000 00000000`03668000 sqlservr (export symbols) sqlservr.exe

I will use the !findstack command to locate all threads which have the function call SyncWritePreemptive on their callstack.

0:070> !findstack sqlservr!FCB::SyncWritePreemptive 0

Thread 069, 1 frame(s) match
Thread 074, 1 frame(s) match
Thread 076, 1 frame(s) match
Thread 079, 1 frame(s) match
Thread 081, 1 frame(s) match
Thread 082, 1 frame(s) match
Thread 086, 1 frame(s) match
Thread 089, 1 frame(s) match
Thread 091, 1 frame(s) match
Thread 095, 1 frame(s) match
Thread 098, 1 frame(s) match
Thread 099, 1 frame(s) match
Thread 104, 1 frame(s) match
Thread 106, 1 frame(s) match
Thread 107, 1 frame(s) match
Thread 131, 1 frame(s) match
Thread 136, 1 frame(s) match

0:070> ~81s
ntdll!ZwWaitForSingleObject+0xa:
00000000`77ef0a2a c3              ret
0:081> kL100

ntdll!ZwDelayExecution
kernel32!SleepEx
sqlservr!FCB::SyncWritePreemptive
sqlservr!FCB::PullPageToReplica
sqlservr!alloca_probe
sqlservr!BUF::CopyOnWrite
sqlservr!PageRef::PrepareToDirty
sqlservr!RecoveryMgr::DoCOWPreWrites

You could get all the callstacks with the function that you are searching for using the command: !findstack sqlservr!FCB::SyncWritePreemptive 3

If you look at the thread that raised ended up raising the Latch Timeout warning was also performing a CHECKDB.

0:074> .ecxr

0:074> kL100

kernel32!RaiseException
sqlservr!CDmpDump::Dump
sqlservr!CImageHelper::DoMiniDump
sqlservr!stackTrace
sqlservr!LatchBase::DumpOnTimeoutIfNeeded
sqlservr!LatchBase::PrintWarning
sqlservr!alloca_probe
sqlservr!BUF::AcquireLatch


sqlservr!UtilDbccCreateReplica
sqlservr!UtilDbccCheckDatabase
sqlservr!DbccCheckDB
sqlservr!DbccCommand::Execute

If you cannot find the thread which raised the Latch Timeout warning, you could print out all the callstacks in the dump using ~*kL100 and the searching for the function call in blue above. It is quite clear from the callstack above that the thread was also involved in performing a CHECKDB operation as reported in the SQL Errorlog in the input buffer for the Latch Timeout dump.

If case you were not able to identify the issue right off the bat, you need to check the build that you are on and look for issues that were addressed related to Latch Timeouts for the SQL Server release  that you are using. The symptoms section would have sufficient amount of information for you to compare with your current symptoms and scenario and determine if the KB Article that you are looking at is applicable in your case.

Now is the time, when you need to have some context about the operations that were occurring on the server to actually determine what the actual issue is. Based on what I heard from the system administrators that there was a CHECKDB being executed on the database while the application was executing DML operations on the database. Additionally, the volume on which the disk resides on has fragmentation and the database in question is large (>750GB).

Based on the two MSDN blog posts that I mentioned above, it is quite clear that it is possible to run into sparse file limitations when there is high amount of fragmentation on the disks or if there are a large number of concurrent changes occurring on the database when a CHECKDB is executing on it. A number of Windows and SQL Server updates along with workarounds to run CHECKDB on such databases is mentioned in the second blog post mentioned above. On a separate note, this is not an issue with CHECKDB! It is limitation that you are hitting with sparse files on the OS layer. Remember CHECKDB, starting from SQL Server 2005, creates an internal snapshot (makes sparse file) to execute the consistency check. Paul Randal’s tweet made me add this line to call this out explicitly!

As always… if you are still stuck, contact Microsoft CSS with the mini-dump file, SQL Errorlog and the Windows Event Logs. It might be quite possible that CSS might ask you to collect additional data as most Latch Timeout issues are generally an after-effect of a previous issue. In this case, it was the OS Error 1450.

Well… That’s it for today! Hope this is helpful for the next time you encounter a Latch Timeout issue.

sqlserver 调试WINDBG ---troubleshootingsql.com的更多相关文章

  1. .NET 5 程序高级调试-WinDbg

    上周和大家分享了.NET 5开源工作流框架elsa,程序跑起来后,想看一下后台线程的执行情况.抓了个进程Dump后,使用WinDbg调试,加载SOS调试器扩展,结果无法正常使用了: 0:000> ...

  2. SQLServer调试

    1.普通调试 直接点击SSMS客户端上的调试按钮即可 2.存储过程调试 2.1 定义存储过程(以Northwind数据库为例) USE [Northwind] GO /****** Object: S ...

  3. SQLSERVER LATCH WINDBG

    https://mssqlwiki.com/2012/09/07/latch-timeout-and-sql-server-latch/

  4. 使用Windbg和SoS扩展调试分析.NET程序

    在博客堂的不是我舍不得 - High CPU in GC(都是+=惹的祸,为啥不用StringBuilder呢?). 不是我舍不得 - .NET里面的Out Of Memory 看到很多人在问如何分析 ...

  5. Windbg是windows平台上强大的调试器

    基础调试命令 - .dump/.dumpcap/.writemem/!runaway Windbg是windows平台上强大的调试器,它相对于其他常见的IDE集成的调试器有几个重要的优势, Windb ...

  6. windbg预览版,windbg preview配置win7x64双机调试

    目录 一丶简介 二丶步骤 1.下载Windbg Preview (windbg预览版本) 2.配置虚拟机端口 3.虚拟机设置调试湍口 4.windbg preview开始调试. 一丶简介 Windbg ...

  7. 使用WinDbg内核调试[转]

    Technorati 标签: windbg,内核调试 WINDOWS调试工具很强大,但是学习使用它们并不容易.特别对于驱动开发者使用的WinDbg和KD这两个内核调试器(CDB和NTSD是用户态调试器 ...

  8. WinDBG 调试命令大全

    转载收藏于:http://www.cnblogs.com/kekec/archive/2012/12/02/2798020.html  #调试命令窗口 ++++++++++++++++++++++++ ...

  9. 基础调试命令 - .dump/.dumpcap/.writemem/!runaway

    Windbg是windows平台上强大的调试器,它相对于其他常见的IDE集成的调试器有几个重要的优势, Windbg可以做内核态调试 Windbg可以脱离源代码进行调试 Windbg可以用来分析dum ...

随机推荐

  1. iOS 之持久化存储 plist、NSUserDefaults、NSKeyedArchiver、数据库

    1.什么是持久化? 本人找了好多文章都没有找到满意的答案,最后是从孙卫琴写的<精通Hibernate:Java对象持久化技术详解>中,看到如下的解释,感觉还是比较完整的.摘抄如下: 狭义的 ...

  2. spring boot修改内置容器tomcat的服务端口

    方式一 在spring boot的web 工程中,可以使用内置的web container.有时需要修改服务端口,可以通过配置类和@Configuration注解来完成. // MyConfigura ...

  3. 【SPOJ-QTREE】树链剖分

    树链剖分学习 https://blog.csdn.net/u013368721/article/details/39734871 https://www.cnblogs.com/George1994/ ...

  4. 如何使用SDK在Ubuntu设备(包括仿真器和桌面)上运用应用程序

    简介 有三种运行通过SDK创建的应用程序的方式:在桌面上,在联网的Ubuntu设备上,以及在仿真器中.这些方式为互补性方式,因为各有优缺点.您首先将了解如何管理SDK的设备类型,以及哪一个类型用于测试 ...

  5. 关于gsl库出现access violation 0X00000005问题的解决方法

    gsl即GNU SCIENCE LIBRARY是一个强大c/c++的数值计算函数库. 在使用这一库出现access violation 0X00000005问题,尝试方法一在project->C ...

  6. 详解SHOW PROCESSLIST显示哪些线程正在运行列出的状态

    SHOW PROCESSLIST显示哪些线程正在运行.您也可以使用mysqladmin processlist语句得到此信息.如果您有SUPER权限,您可以看到所有线程.否则,您只能看到您自己的线程( ...

  7. Kuangbin 带你飞 最小生成树题解

    整套题都没什么难度. POJ 1251 Jungle Roads #include <map> #include <set> #include <list> #in ...

  8. HDU 4344 大数分解大素数判定

    这里贴个模板吧.反正是不太理解 看原题就可以理解用法!! #include <cstdio> #include <iostream> #include <algorith ...

  9. malloc和new的区别 end

    3. c++中new的几种用法 c++中,new的用法很灵活,这里进行了简单的总结: 1. new() 分配这种类型的一个大小的内存空间,并以括号中的值来初始化这个变量; 2. new[] 分配这种类 ...

  10. 出现“error c4430缺少类型说明符-假定为int。注意C++不支持默认int

    出现这种错误的原因,是因为函数没有写返回值.是在VC6.0的工程转为高版本(VS2010)的时候经常出现的; #include <stdio.h> main() { printf(&quo ...