缓解 SQL Server has encountered 727 occurrence(s) of I/O requests taking longer than 15 seconds
sql server 会记录IO等待时间超过15 seconds的请求,这时application会有 time out 现象,dba需要判断是workload,concurrecy 所致还是sql server配置没有最优导致?二者之间的比例各占多少?
1:判断sql server的配置是否最优化,相对容易,可以有以下项目检查:
a: 通过datafile/logfile放置在不同的drive上,可以分离random io 和 sequence io 操作,可以极大的缓解读写IO,对于写IO,只需立即做logfile 的 sequence io即可,至于何时把data page 真正写入到datafile,
sql server自己把握。根据我自己的观察,分离datafile/logfile,一个月有个2到3次的 IO stall, 如果都在同一个磁盘,有个8到9次 IO stall 也很常见。
b: 尽量合理的设置[max server memory (MB)], 一般为总内存的7/8即可。
c: tempdb最好在单独的磁盘上,如果不行的话,也要做datafile/logfile分离。
2:如果出现了问题,那如何判断是workload导致的呢?第一要做好IO activity的统计工作,第二要通知开发,是否是业务的变化导致IO增多,困难是这个IO stall 到底涉及到哪些表?如何得到?
a: 通过以下的sp, 利用sys.dm_io_virtual_file_stats DMV 记录每一分钟sql server IO读写操作
USE [master]
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
create proc [dbo].[usp_monitor_virtual_file_latency]
as
begin
if OBJECT_ID('master.dbo.tbl_virtual_file_latency','U') is null
begin
create table master.dbo.tbl_virtual_file_latency
(
id int identity(1,1) primary key,
collect_dt datetime not null,
[database_name] [nvarchar](128) NULL,
[file_id] [smallint] NOT NULL,
[type_desc] [nvarchar](60) NULL,
[sample_ms] [int] NOT NULL,
[num_of_reads] [bigint] NOT NULL,
[num_of_bytes_read] [bigint] NOT NULL,
[io_stall_read_ms] [bigint] NOT NULL,
[num_of_writes] [bigint] NOT NULL,
[num_of_bytes_written] [bigint] NOT NULL,
[io_stall_write_ms] [bigint] NOT NULL,
[io_stall] [bigint] NOT NULL,
[size_on_disk_bytes] [bigint] NOT NULL,
[file_handle] [varbinary](8) NOT NULL,
[physical_name] [nvarchar](260) NOT NULL,
[state_desc] [nvarchar](60) NULL
)
create index idx_collect_dt on tbl_virtual_file_latency(collect_dt) with(online=on)
end insert into master.dbo.tbl_virtual_file_latency
SELECT
--virtual file latency
GETDATE(),
db_name(vfs.[database_id]) as database_name
,vfs.[file_id]
,mf.[type_desc]
,vfs.[sample_ms]
,vfs.[num_of_reads]
,vfs.[num_of_bytes_read]
,vfs.[io_stall_read_ms]
,vfs.[num_of_writes]
,vfs.[num_of_bytes_written]
,vfs.[io_stall_write_ms]
,vfs.[io_stall]
,vfs.[size_on_disk_bytes]
,vfs.[file_handle]
,mf.[physical_name]
,mf.[state_desc] FROM sys.dm_io_virtual_file_stats (NULL,NULL) AS [vfs]
JOIN sys.master_files(nolock) AS [mf]
ON [vfs].[database_id] = [mf].[database_id]
AND [vfs].[file_id] = [mf].[file_id]
WHERE DB_NAME(vfs.database_id) not in('master','model','msdb') --delete old data delete from master.dbo.tbl_virtual_file_latency where collect_dt <DATEADD(MONTH,-2,getdate())
end
b: 通过以下的sp,求每分钟内每个数据文件的IO的size和读写情况
USE [master]
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
create proc [dbo].[usp_calculate_virtual_file_stats_PerMinutes]
@physical_name nvarchar(1000),
@collect_starttime datetime,
@collect_endtime datetime
as
with snap as
(
select
row_number() over (order by id asc) as row_num,
id
,[collect_dt]
,[database_name]
,[file_id]
,[type_desc]
,[sample_ms]
,[num_of_reads]
,[num_of_bytes_read]
,[io_stall_read_ms]
,[num_of_writes]
,[num_of_bytes_written]
,[io_stall_write_ms]
,[io_stall]
from master.dbo.tbl_virtual_file_latency(nolock) AS [vfs1]
WHERE vfs1.physical_name=@physical_name
and vfs1.collect_dt>=@collect_starttime
and vfs1.collect_dt<=@collect_endtime )
select cur.collect_dt,
read_io_num= case when (cur.num_of_reads-pre.num_of_reads)=0 then 0 else cur.num_of_reads-pre.num_of_reads end ,
write_io_num= case when (cur.num_of_writes-pre.num_of_writes)=0 then 0 else cur.num_of_writes-pre.num_of_writes end ,
read_latency_MS =case when (cur.num_of_reads-pre.num_of_reads)=0 then 0 else cast((cur.io_stall_read_ms-pre.io_stall_read_ms)*1./(cur.num_of_reads-pre.num_of_reads) as decimal(10,1)) end ,
write_latency_MS= case when (cur.num_of_writes-pre.num_of_writes)=0 then 0 else cast((cur.io_stall_write_ms-pre.io_stall_write_ms)*1./(cur.num_of_writes-pre.num_of_writes) as decimal(10,1)) end ,
average_latency_MS=case when (cur.num_of_reads-pre.num_of_reads+cur.num_of_writes-pre.num_of_writes)=0 then 0 else cast((cur.io_stall-pre.io_stall)*1./(cur.num_of_reads-pre.num_of_reads+cur.num_of_writes-pre.num_of_writes) as decimal(10,1)) end,
averageKB_read=case when (cur.num_of_reads-pre.num_of_reads)=0 then 0 else cast((cur.num_of_bytes_read-pre.num_of_bytes_read)*1./(cur.num_of_reads-pre.num_of_reads)/1024 as decimal(10,1))end ,
averageKB_write=case when (cur.num_of_writes-pre.num_of_writes)=0 then 0 else cast((cur.num_of_bytes_written-pre.num_of_bytes_written)*1./(cur.num_of_writes-pre.num_of_writes)/1024 as decimal(10,1)) end,
averageKB_Transfer=case when (cur.num_of_reads-pre.num_of_reads+cur.num_of_writes-pre.num_of_writes)=0 then 0 else cast((cur.num_of_bytes_read-pre.num_of_bytes_read+cur.num_of_bytes_written-pre.num_of_bytes_written)*1./(cur.num_of_reads-pre.num_of_reads+cur.num_of_writes-pre.num_of_writes)/1024 as decimal(10,1)) end
from snap as pre join snap as cur
on pre.row_num=cur.row_num-1
c: 最后根据历史数据来看IO读写的趋势
exec usp_calculate_virtual_file_stats_PerMinutes
'E:\TestDB\test.mdf',
'2014-12-14 13:49:01.000',
'2014-12-14 15:49:01.000'

3:根据IO stall的信息找出sql server 读取哪个表导致了IO Stall.
sql server log text:
SQL Server has encountered 727 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [F:\MSSQL\Test_1.ndf:MSSQL_DBCC7] in database [Test] (7). The OS file handle is 0x0000000000001A40. The offset of the latest long I/O is: 0x000023101f0000
a: 把偏移量0x000023101f0000转化为十进制为:150594322432
b: 把该值除以8192即为page number: 150594322432/8192=18383096
c:在通过page number求的table name, 再与开发讨论IO的问题.
select object_name(object_id) as tblname,*
from sys.partitions (nolock)as sp
where sp.partition_id in (
select au.container_id from
sys.dm_os_buffer_descriptors(nolock) as obd
join sys.allocation_units(nolock) as au
on obd.allocation_unit_id=au.allocation_unit_id
where obd.page_id=18383096)
d: 如果该page number不再内存中的化,该SQL 是没有结果的。可以使用dbcc page(dbid,fileid,pageid,3)来得到该表的object_id.
结论:IO stall 只可能缓解,是没法避免的,业务量导致的IO永远在某一点上是超过存储的容量的,dba能通过各种事前事后方法到找到root cause, 据此提出解决的方法。希望以上的方法能对大家有所启发。
缓解 SQL Server has encountered 727 occurrence(s) of I/O requests taking longer than 15 seconds的更多相关文章
- SQL Server has encountered 1 occurrence(s) of cachestore flush for the 'Object Plans' cachestore (part of plan cache) due to some database maintenance or reconfigure operations.
2017-11-01 09:49:44.35 spid166 SQL Server has encountered 1 occurrence(s) of cachestore flush for th ...
- Microsoft SQL Server Trace Flags
Complete list of Microsoft SQL Server trace flags (585 trace flags) REMEMBER: Be extremely careful w ...
- 疑难杂症--SQL SERVER 18056的错误
朋友遇到一个很棘手的问题,查看服务器日志,报以下错误: ::,spid296,未知,错误: ,严重性: ,状态: . ::,spid495,未知, The client was unable < ...
- SQL Server 磁盘请求超时的833错误原因分析以及解决
本文出处:http://www.cnblogs.com/wy123/p/6984885.html 最近遇到一个SQL Server服务器响应极度缓慢,并且出现客户端请求报错的情况,在数据库中的erro ...
- 非常全面的SQL Server巡检脚本来自sqlskills团队的Glenn Berry 大牛
非常全面的SQL Server巡检脚本来自sqlskills团队的Glenn Berry 大牛 Glenn Berry 大牛会对这个脚本持续更新 -- SQL Server 2012 Diagnost ...
- SQL Server 诊断查询-(2)
Query #13 SQL Server Error Log(FC) -- Shows you where the SQL Server failover cluster diagnostic log ...
- 非常全面的SQL Server巡检脚本来自sqlskills团队的Glenn Berry
非常全面的SQL Server巡检脚本来自sqlskills团队的Glenn Berry Glenn Berry 曾承诺对这个脚本持续更新 -- SQL Server 2012 Diagnostic ...
- SQL Server Instance无法启动了, 因为TempDB所在的分区没有了, 怎么办?
我的SQL 2014的虚拟机在迁移的时候, 存放TempDB的LUN被删掉了. 在虚拟机的操作系统启动了之后, SQL Server Instance却启动不起来了. 检查Event Log, 报错. ...
- Microsoft SQL Server Version List [sqlserver 7.0-------sql server 2016]
http://sqlserverbuilds.blogspot.jp/ What version of SQL Server do I have? This unofficial build ch ...
随机推荐
- 获取移除指定Url参数(原创)
/// <summary> /// 移除指定的Url参数 /// 来自:http://www.cnblogs.com/cielwater /// </summary> /// ...
- javaScript中的空值和假值
javaScript中有五种空值和假值,分别为false,null,undefined,"",0.从广义上来说,这五个值都是对应数据类型的无效值或空值. 这五个值的共同点是在执行i ...
- mysql实现分组和组内序号
SELECT CASE WHEN @mid = t.PAY_TIME THEN ELSE END SEQ, @mid := t.PAY_TIME, t.AMOUNTS, t.CHARGE_PRICE, ...
- 提倡IT从业人员终身学习
经常听身边的团队成员抱怨,今天太累了,回到家连动都不想动;这形成了目前圈子里大多从业者的生活常态. 但有一部分人,在团队里身居要职,薪水不少拿却工作不那么累,你想过他们么? 答案很简单,人家会学习,而 ...
- JavaScript之ES6
ECMAScript 6(以下简称ES6)是JavaScript语言的下一代标准.因为当前版本的ES6是在2015年发布的,所以又称ECMAScript 2015. 也就是说,ES6就是ES2015. ...
- bash shell + python简单教程,让你的mac/linux终端灰起来~
前提条件:已经安装python,命令行支持bash 在bash_profile中添加 function ccolor { python /Users/xirtam/Documents/tools/cc ...
- mssql java 运行
public void rlgy() throws IOException { Statement sql; ResultSet rs; String driverName = "com.m ...
- 对Live Writer支持的继续改进:设置随笔地址别名(EntryName)
在我们发布[功能改进]Live Writer发博支持"建分类.加标签.写摘要"之后,Artech提了一个很好的建议:希望在Live Writer发布随笔时可以设置EntryName ...
- maven常用插件pom配置
一.问题描述: 部署一个maven打包项目时,jar包,依赖lib包全部手动上传至服务器,然后用maven部署报错:Exception in thread "main" java. ...
- Windows Server 2008 R2 域控服务器运行nslookup命令默认服务器显示 UnKnown
一.问题: 域控服务器DOS窗口运行nslookup命令提示如下: 二.原因分析: 主要原因在于域控服务器的DNS服务器没有设置反向查找区域,计算机名称是通过IP地址反向查找到域控服务器的计算机名称. ...