SQL Clarity: November 2016

This post is about a wait type that I have seen being safely ignored by admins in the past but actually it does highlight an particular that I like to check up on in systems that I’m maintaining.

The wait type I’m referring to is TRACEWRITE and it’s easy to see why it often gets classed as being safe to ignore. SQL Server does run a default trace continuously so naturally this wait will show up in wait stat queries for your instances and be quite prominent (like the one below) but that said; whilst it’s relatively safe for the default trace to be ruled out of any further investigation what about traces that have been started by other users?

I typically see a scenario in non-production systems where a trace has been turned on to monitor a particular process for testing and then forgotten about and not stopped again once the process has finished. Tracing is not particularly cost friendly at the best of times so one running way beyond its expected duration is something you do want to know about.

The following query looks at what traces are currently running on the system:

SELECT

start_time,

stop_time,

[status],

[path],

is_default,

reader_spid

FROM sys.traces

This query returns information on all the current traces running on your system. It’s worth adding at this point that tracing to some degree (I still see tracing, a lot! has been replaced by extended events which are much more lightweight in operation, so another reason to be checking. Here's what a default trace looks like:

The columns returned are pretty self-explanatory, a value of 1 in the status column means the trace is currently running (0 = stopped) and you can use the reader_spid column to return information on who is using it, and who to ask about turning the trace off.

To summarise; in most cases trace related waits like TRACEWRITE are safe to ignore but it's good practice to routinely run a check on those systems where people might be running traces or you could even set up jobs that will automatically stop any traces outside of working hours.

One final point, it is very common for monitoring applications to sue some form of trace analysis, these could well be part of routine maintenance so prior to switching anything off ensure you know how the trace has originated and what its purpose is.

Troubleshooting is a fundamental activity for any DBA and when it comes to SQL Server one of the most important tasks is analysing the wait statistics that have occurred within the instance.

So what are wait statistics? In short whenever a task is “waiting” its type of wait and duration are recorded by SQL Server. This information is incredibly useful and we can query DMV's (Dynamic Management Views) to obtain this information that we can use to aid our troubleshooting efforts.

One particular view is the sys.dm_os_wait_stats DMV. This is an aggregated view that shows us the waits that have been occurring in our instance since the last restart or when the wait statistics were cleared down. Queries like the one below are typically sorted by the wait_time_ms column to indicate the most prominent wait types that have occurred over time in our SQL server.

SELECT TOP 10 * FROM sys.dm_os_wait_stats

ORDER BY wait_time_ms DESC

Running this query without as it is will produce a lot of different wait types so it’s important to know that many of them are likely to be a result of completely normal operations that are being undertaken by SQL Server. This isn’t the problem with sys.dm_os_wait_stats that I am going to demonstrate today though so I will keep the queries very simplistic. What I want to show you is that although this view is great as a summary, it isn’t necessarily helpful when it comes to real-time analysis.

I’ll start by creating a new table in a non-production database on a non-production instance of SQL Server and insert a handful of rows.

CREATE TABLE [dbo].[tblPeople](

[ID] [smallint] IDENTITY(1,1) NOT NULL,

[FirstName] [varchar](50) NULL,

[LastName] [varchar](50) NULL

) ON [PRIMARY]

USE [TestingDatabase]

INSERT INTO tblPeople (FirstName, LastName) VALUES ('Peter','Morgan');

INSERT INTO tblPeople (FirstName, LastName) VALUES ('Sarah','Jones');

INSERT INTO tblPeople (FirstName, LastName) VALUES ('Liam','Davies');

INSERT INTO tblPeople (FirstName, LastName) VALUES ('Joanne','Jenkins');

In order to remove any previous information and clear down the wait statistics we have to run the following DBCC command:

DBCC SQLPERF('sys.dm_os_wait_stats', CLEAR);

To force a wait in SQL I will create a blocking scenario by issuing an update command on our new table without committing or rolling back the transaction.

BEGIN TRAN

UPDATE tblPeople

SET LastName = 'Davies' WHERE LastName = 'Jenkins'

In another window (to stress, use another window) I will try to select data from that table knowing that I won’t be able to because it is being updated by another transaction. As it cannot acquire a lock my select query it should start to accumulate a LCK wait type.

SELECT TOP 1000 [ID]

,[FirstName]

,[LastName]

FROM [TestingDatabase].[dbo].[tblPeople]

I’ll leave that query running for a few seconds and then look at sys.dm_os_wait_satas again and as mentioned previously I’ll order by wait_time_ms descending so I can see the worst offenders in my instance since the stats were cleared down. At this point I might expect that I will see a lock wait type appear very high in the list.

SELECT TOP 10 * FROM sys.dm_os_wait_stats

ORDER BY wait_time_ms DESC

wait_type	waiting_tasks_count	wait_time_ms	max_wait_time_ms	signal_wait_time_ms
SP_SERVER_DIAGNOSTICS_SLEEP	531	300000	299995	300000
LOGMGR_QUEUE	1109	142871	156	6
DIRTY_PAGE_POLL	1417	142842	122	16
SLEEP_SYSTEMTASK	9041	142832	54	98
HADR_FILESTREAM_IOMGR_IOCOMPLETION	285	142670	503	7
LAZYWRITER_SLEEP	150	142097	1008	5
SQLTRACE_INCREMENTAL_FLUSH_SLEEP	35	140008	4000	0
XE_TIMER_EVENT	54	140000	4999	140000
REQUEST_FOR_DEADLOCK_SEARCH	28	139990	5000	139990
QDS_SHUTDOWN_QUEUE	2	120000	60000	0

Odd, no lock waits! We know we have a blocked task because we’ve just created it but it isn’t showing here. Now this might be because my LCK wait type isn’t in the top ten waits based on duration, it’s unlikely, but we can alter our query to just return rows for the sort of wait type that we are after:

SELECT TOP 10 * FROM sys.dm_os_wait_stats

WHERE wait_type LIKE 'LCK%'

ORDER BY wait_time_ms DESC

wait_type	waiting_tasks_count	wait_time_ms	max_wait_time_ms	signal_wait_time_ms
LCK_M_UIX	0	0	0	0
LCK_M_SIX	0	0	0	0
LCK_M_SIU	0	0	0	0
LCK_M_IX	0	0	0	0
LCK_M_IU	0	0	0	0
LCK_M_IS	0	0	0	0
LCK_M_X	0	0	0	0
LCK_M_U	0	0	0	0
LCK_M_S	1	0	0	0
LCK_M_SCH_M	0	0	0	0

Here I can see one waiting task, I know in this test scenario its my task and that is the only thing running but ordinarily, because it's a cumulative value, it's difficult to pinpoint what is happening right now. It's also worth noting that although the waiting task count has incremented the wait time hasn't which begs the question, when do these wait times get populated?

In order to check I'm going to roll back the transaction by running the following command underneath the update statement (the same window this time).

ROLLBACK TRAN

Let’s run the query again to check our running totals now our query has finished and I'll leave the LCK type clause in there.

SELECT TOP 10 * FROM sys.dm_os_wait_stats

WHERE wait_type LIKE 'LCK%'

ORDER BY wait_time_ms DESC

wait_type	waiting_tasks_count	wait_time_ms	max_wait_time_ms	signal_wait_time_ms
LCK_M_S	1	420051	420051	0
LCK_M_SCH_M	0	0	0	0
LCK_M_SCH_S	0	0	0	0
LCK_M_UIX	0	0	0	0
LCK_M_SIX	0	0	0	0
LCK_M_SIU	0	0	0	0
LCK_M_IX	0	0	0	0
LCK_M_IU	0	0	0	0
LCK_M_IS	0	0	0	0
LCK_M_X	0	0	0	0

Now that the transaction has finished we can see that our cumulative values have been updated. I was initially surprised by this as I always considered the DMV to contain “running” totals but looking on the Books Online article on the DMV you will find the following line:

Returns information about all the waits encountered by threads that executed.

Executed; that would be past tense! As we’ve just demonstrated the duration values are only updated after a thread has finished executing. It does make sense as it is an aggregated view after all but as such it does raise an important point; using the sys.dm_os_wait_stats DMV is not the best option for real time troubleshooting.

In order to do that a bit better we have to use a different DMV and one at the task level rather than the summary view that sys.dm_os_wait_stats gives us. So let’s run the update and select query from our previous test again to create the same blocking scenario and after a few seconds run the following script:

SELECT session_id, wait_type, wait_duration_ms, blocking_session_id

FROM sys.dm_os_waiting_tasks

WHERE session_id > 50

This query selects from the sys.dm_os_waiting_tasks management view and shows the tasks that are currently waiting. Here I’m selecting the type of wait, the duration in milliseconds (that we know isn’t available in the sys.dm_os_wait_stats view in real-time) and finally I’ve included the blocking_session_id.

session_id	wait_type	wait_duration_ms	blocking_session_id
58	LCK_M_S	62790	56

Now we get a much better picture of which executing tasks are currently waiting, what the wait types are and importantly (and what we didn’t get before), for how long. You can refresh the queries results to see the duration increase whilst the select query is being locked.

We can now repeat the rollback transaction command (or commit it depending how optimistic you’re feeling) that we did previously and check to see that sys.dm_os_wait_stats has been updated accordingly.

So here are two different ways that we can use SQL Servers wait statistics for troubleshooting purposes. Both views give us really useful information but both have different purposes. If we wanted to look back over time then the sys.dm_os_wait_stats will give us a view of wait time totals. Typically we would capture the information via a scheduled job and analyse the data for spikes during periods where issues might be suspected.

For performing real-time analysis of wait statistics then we should base queries on the sys.dm_os_waiting_tasks view where we can see accurate wait duration values as they are happening within our instance.

In my opinion wait statistics are the most important piece of information when troubleshooting SQL Server so learning about the different types is vital for anyone using SQL. Thankfully there is a wealth of really useful information about wait statistics out there; I’ve listed some of my favourite links below.

SQL Clarity

Monday, 21 November 2016

TRACEWRITE and querying sys.traces

SQL Server 2016 - SP1: Warning!

Monday, 7 November 2016

What is a DMO?

Wednesday, 2 November 2016

Troubleshooting with Wait Statistics