Wednesday, 2 November 2016

Troubleshooting with Wait Statistics

Troubleshooting is a fundamental activity for any DBA and when it comes to SQL Server one of the most important tasks is analysing the wait statistics that have occurred within the instance.

So what are wait statistics? In short whenever a task is “waiting” its type of wait and duration are recorded by SQL Server. This information is incredibly useful and we can query DMV's (Dynamic Management Views) to obtain this information that we can use to aid our troubleshooting efforts.

One particular view is the sys.dm_os_wait_stats DMV. This is an aggregated view that shows us the waits that have been occurring in our instance since the last restart or when the wait statistics were cleared down. Queries like the one below are typically sorted by the wait_time_ms column to indicate the most prominent wait types that have occurred over time in our SQL server.

SELECT TOP 10 * FROM sys.dm_os_wait_stats
ORDER BY wait_time_ms DESC

Running this query without as it is will produce a lot of different wait types so it’s important to know that many of them are likely to be a result of completely normal operations that are being undertaken by SQL Server. This isn’t the problem with sys.dm_os_wait_stats that I am going to demonstrate today though so I will keep the queries very simplistic. What I want to show you is that although this view is great as a summary, it isn’t necessarily helpful when it comes to real-time analysis.

I’ll start by creating a new table in a non-production database on a non-production instance of SQL Server and insert a handful of rows.

CREATE TABLE [dbo].[tblPeople](
       [ID] [smallint] IDENTITY(1,1) NOT NULL,
       [FirstName] [varchar](50) NULL,
       [LastName] [varchar](50) NULL
) ON [PRIMARY]

GO

USE [TestingDatabase]
GO

INSERT INTO tblPeople (FirstName, LastName) VALUES ('Peter','Morgan');
INSERT INTO tblPeople (FirstName, LastName) VALUES ('Sarah','Jones');
INSERT INTO tblPeople (FirstName, LastName) VALUES ('Liam','Davies');
INSERT INTO tblPeople (FirstName, LastName) VALUES ('Joanne','Jenkins');

In order to remove any previous information and clear down the wait statistics we have to run the following DBCC command:

DBCC SQLPERF('sys.dm_os_wait_stats', CLEAR);

To force a wait in SQL I will create a blocking scenario by issuing an update command on our new table without committing or rolling back the transaction.

BEGIN TRAN

UPDATE tblPeople
SET LastName = 'Davies' WHERE LastName = 'Jenkins'

In another window (to stress, use another window) I will try to select data from that table knowing that I won’t be able to because it is being updated by another transaction. As it cannot acquire a lock my select query it should start to accumulate a LCK wait type.

SELECT TOP 1000 [ID]
      ,[FirstName]
      ,[LastName]
  FROM [TestingDatabase].[dbo].[tblPeople]

I’ll leave that query running for a few seconds and then look at sys.dm_os_wait_satas again and as mentioned previously I’ll order by wait_time_ms descending so I can see the worst offenders in my instance since the stats were cleared down. At this point I might expect that I will see a lock wait type appear very high in the list.

SELECT TOP 10 * FROM sys.dm_os_wait_stats
ORDER BY wait_time_ms DESC



wait_type
waiting_tasks_count
wait_time_ms
max_wait_time_ms
signal_wait_time_ms
SP_SERVER_DIAGNOSTICS_SLEEP
531
300000
299995
300000
LOGMGR_QUEUE
1109
142871
156
6
DIRTY_PAGE_POLL
1417
142842
122
16
SLEEP_SYSTEMTASK
9041
142832
54
98
HADR_FILESTREAM_IOMGR_IOCOMPLETION
285
142670
503
7
LAZYWRITER_SLEEP
150
142097
1008
5
SQLTRACE_INCREMENTAL_FLUSH_SLEEP
35
140008
4000
0
XE_TIMER_EVENT
54
140000
4999
140000
REQUEST_FOR_DEADLOCK_SEARCH
28
139990
5000
139990
QDS_SHUTDOWN_QUEUE
2
120000
60000
0

Odd, no lock waits! We know we have a blocked task because we’ve just created it but it isn’t showing here. Now this might be because my LCK wait type isn’t in the top ten waits based on duration, it’s unlikely, but we can alter our query to just return rows for the sort of wait type that we are after:

SELECT TOP 10 * FROM sys.dm_os_wait_stats
WHERE wait_type LIKE 'LCK%'
ORDER BY wait_time_ms DESC



wait_type
waiting_tasks_count
wait_time_ms
max_wait_time_ms
signal_wait_time_ms
LCK_M_UIX
0
0
0
0
LCK_M_SIX
0
0
0
0
LCK_M_SIU
0
0
0
0
LCK_M_IX
0
0
0
0
LCK_M_IU
0
0
0
0
LCK_M_IS
0
0
0
0
LCK_M_X
0
0
0
0
LCK_M_U
0
0
0
0
LCK_M_S
1
0
0
0
LCK_M_SCH_M
0
0
0
0

Here I can see one waiting task, I know in this test scenario its my task and that is the only thing running but ordinarily, because it's a cumulative value, it's difficult to pinpoint what is happening right now. It's also worth noting that although the waiting task count has incremented the wait time hasn't which begs the question, when do these wait times get populated?

In order to check I'm going to roll back the transaction by running the following command underneath the update statement (the same window this time).

ROLLBACK TRAN

Let’s run the query again to check our running totals now our query has finished and I'll leave the LCK type clause in there.

SELECT TOP 10 * FROM sys.dm_os_wait_stats
WHERE wait_type LIKE 'LCK%'
ORDER BY wait_time_ms DESC





wait_type
waiting_tasks_count
wait_time_ms
max_wait_time_ms
signal_wait_time_ms
LCK_M_S
1
420051
420051
0
LCK_M_SCH_M
0
0
0
0
LCK_M_SCH_S
0
0
0
0
LCK_M_UIX
0
0
0
0
LCK_M_SIX
0
0
0
0
LCK_M_SIU
0
0
0
0
LCK_M_IX
0
0
0
0
LCK_M_IU
0
0
0
0
LCK_M_IS
0
0
0
0
LCK_M_X
0
0
0
0


Now that the transaction has finished we can see that our cumulative values have been updated. I was initially surprised by this as I always considered the DMV to contain “running” totals but looking on the Books Online article on the DMV you will find the following line:

Returns information about all the waits encountered by threads that executed.

Executed; that would be past tense! As we’ve just demonstrated the duration values are only updated after a thread has finished executing. It does make sense as it is an aggregated view after all but as such it does raise an important point; using the sys.dm_os_wait_stats DMV is not the best option for real time troubleshooting.

In order to do that a bit better we have to use a different DMV and one at the task level rather than the summary view that sys.dm_os_wait_stats gives us. So let’s run the update and select query from our previous test again to create the same blocking scenario and after a few seconds run the following script:

SELECT session_id, wait_type, wait_duration_ms, blocking_session_id
FROM sys.dm_os_waiting_tasks
WHERE session_id > 50



This query selects from the sys.dm_os_waiting_tasks management view and shows the tasks that are currently waiting. Here I’m selecting the type of wait, the duration in milliseconds (that we know isn’t available in the sys.dm_os_wait_stats view in real-time) and finally I’ve included the blocking_session_id.





session_id
wait_type
wait_duration_ms
blocking_session_id
58
LCK_M_S
62790
56

Now we get a much better picture of which executing tasks are currently waiting, what the wait types are and importantly (and what we didn’t get before), for how long. You can refresh the queries results to see the duration increase whilst the select query is being locked.

We can now repeat the rollback transaction command (or commit it depending how optimistic you’re feeling) that we did previously and check to see that sys.dm_os_wait_stats has been updated accordingly.

So here are two different ways that we can use SQL Servers wait statistics for troubleshooting purposes. Both views give us really useful information but both have different purposes. If we wanted to look back over time then the sys.dm_os_wait_stats will give us a view of wait time totals. Typically we would capture the information via a scheduled job and analyse the data for spikes during periods where issues might be suspected.

For performing real-time analysis of wait statistics then we should base queries on the sys.dm_os_waiting_tasks view where we can see accurate wait duration values as they are happening within our instance.

In my opinion wait statistics are the most important piece of information when troubleshooting SQL Server so learning about the different types is vital for anyone using SQL. Thankfully there is a wealth of really useful information about wait statistics out there; I’ve listed some of my favourite links below.


More Reading.


No comments:

Post a Comment