Thursday 2 February 2017

Torn Pages and using DBCC PAGE

Today's post is about torn pages and using the undocumented DBCC PAGE command within SQL. Occasionally you might find you need to run this command and in this example it's because I have received an error message that looks a bit like this:

SQL Server detected a logical consistency-based I/O error: torn page (expected signature: 0xaaaaaaaa; actual signature: 0xa9aaaaaa). It occurred during a read of page (1:25756919) in database ID 9 at offset 0x000031209ee000 in file 'F:\DATA\ScratchDatabase.mdf'

This is an error that has been picked up on one of my test systems and indicates that SQL Server has detected a torn page, that is a page that has been incorrectly written by SQL Server and possibly indicates a problem in the IO subsystem.

The problem here is that whilst we know the database and the page where the error has occurred we don't know the specific table the page belongs and importantly what type of page is in error. The reason why the page type is important is because this will drastically impact our recovery process but the first thing we will do is check a system table to see if any other page errors have been reported:

SELECT * FROM msdb..suspect_pages


database_id

file_id

page_id

event_type

error_count

last_update_date

9

1

25756919

3

1

2017-01-29 08:24:05.760

Here we can see that only one row has been returned, the event_type refers to a torn page, here's the full type descriptions and a link for more information:

1 = An 823 error that causes a suspect page (such as a disk error) or an 824 error other than a bad checksum or a torn page (such as a bad page ID).

2 = Bad checksum.

3 = Torn page.

4 = Restored (page was restored after it was marked bad).

5 = Repaired (DBCC repaired the page).

7 = Deallocated by DBCC.


To identify the table in error we will use DBCC PAGE. This command takes a database id (or name), file id and page id and will return the actual page information back to the results window (we need to enable trace flag 3604 to do this though); for my example I would use the following command:

DBCC PAGE (9, 1, 25756919, 0)

This returns a lot of information to the results window (I can use the option WITH TABLERESULTS for a grid view) so I have removed a fair chunk of it and highlighted the two parts that I really need.


PAGE: (1:25756919)
 

PAGE HEADER:


Page @0x0000000728010000


m_pageId = (1:25756919)             m_headerVersion = 1                 m_type = 2


m_typeFlagBits = 0x0                m_level = 0                         m_flagBits = 0x104


m_objId (AllocUnitId.idObj) = 266296m_indexId (AllocUnitId.idInd) = 256


Metadata: AllocUnitId = 72057611489902592                               

Metadata: PartitionId = 72057607060652032                                Metadata: IndexId = 20

Metadata: ObjectId = 567673070      m_prevPage = (1:25756918)           m_nextPage = (1:38369560)

The first is Metadata: ObjectId = 567673070 which refers to the table that this page belongs to. The second is Metadata: IndexId = 20 which tells me as it isn't 0 (heap) or 1 (clustered index) that the page error is within a non-clustered index and I can use the following code to identify the actual index where the problem lies.

SELECT * FROM sys.indexes WHERE index_id = 20 AND object_id = 567673070


This returns the suspect index and to resolve we can very simply drop and recreate the index effectively reallocating the pages, suspect and all. This is potentially much quicker than performing a full restore and I have lost count of the times I have seen people resort to a full restore without checking the actual type of page corruption that they're attempting to resolve.


Another option here is that because we know the effected table we can run a DBCC CHECKTABLE command (rather than a DBCC CHECKDB on the entire database), passing in the full table name as a parameter; to get the name of the table we use our ObjectId again that was returned from DBCC PAGE with the following:

SELECT OBJECT_NAME (567673070)

This would return the table name that we can use to check the structure to see if our page issue has been resolved:

DBCC CHECKTABLE ('dbo.ETL_TestLoad) WITH NO_INFOMSGS

DBCC results for 'ETL_TestLoad'.
There are 341465 rows in 93513 pages for object "ETL_TestLoad".
DBCC execution completed. If DBCC printed error messages, contact your system administrator.


Using DBCC CHECKTABLE in this scenario may have proven that our torn page has been resolved but you should always, without question, perform a full database consistency check to look for further errors. In the event of IO issues this could easily return more and more page problems, maybe irrecoverable data page ones at that. Any page read/write errors must result in both a full consistency check and a check with the storage people to see if they can see any underlying issues.



For some further reading check out my post: How to consistency check large databases

No comments:

Post a Comment