One DBA's Ongoing Search for Clarity in the Middle of Nowhere


*or*

Yet Another Andy Writing About SQL Server

Thursday, August 24, 2017

The Transient Database Snapshot Has Been Marked Suspect

Yet another tale from the ticket queue...

The DBCC CheckDB was failing on INSTANCE99 and after some investigation it looked like a space issue, not an actual corruption issue.

http://baddogneedsrottenhome.com/images/emails/55ce060daa58b.jpg
--

The Job Failure error text was this:

--

Executed as user: DOMAIN\svc_acct. Microsoft (R) SQL Server Execute Package Utility  Version 10.50.6000.34 for 64-bit  Copyright (C) Microsoft Corporation 2010. All rights reserved.    Started:  2:00:00 AM  Progress: 2017-08-20 02:00:01.11     Source: {11E1AA7B-A7AC-4043-916B-DC6EABFF772B}      Executing query "DECLARE @Guid UNIQUEIDENTIFIER      EXECUTE msdb..sp...".: 100% complete  End Progress  Progress: 2017-08-20 02:00:01.30     Source: Check Database Integrity Task      Executing query "USE [VLDB01]  ".: 50% complete  End Progress  Error: 2017-08-20 03:38:19.28     Code: 0xC002F210     Source: Check Database Integrity Task Execute SQL Task     Description: Executing the query "DBCC CHECKDB(N'VLDB01')  WITH NO_INFOMSGS  " failed with the following error: "Check terminated.  The transient database snapshot for database 'VLDB01' (database ID 5) has been marked suspect due to an IO operation failure.  Refer to the SQL Server error log for details.  A severe error occurred on the current command.  The results, if any, should be discarded.". Possible failure reasons: Problems with the query, "ResultSet" property not set correctly, parameters not set correctly, or connection not established correctly.  End Error  Warning: 2017-08-20 03:38:19.28     Code: 0x80019002     Source: VLDB01 Integrity      Description: SSIS Warning Code DTS_W_MAXIMUMERRORCOUNTREACHED.  The Execution method succeeded, but the number of errors raised (1) reached the maximum allowed (1); resulting in failure. This occurs when the number of errors reaches the number specified in MaximumErrorCount. Change the MaximumErrorCount or fix the errors.  End Warning  DTExec: The package execution returned DTSER_FAILURE (1).  Started:  2:00:00 AM  Finished: 3:38:19 AM  Elapsed:  5899.51 seconds.  The package execution failed.  The step failed.

--

Looking in the SQL Error Log there were hundreds of these combinations in the minutes immediately preceding the job failure:

--

The operating system returned error 665(The requested operation could not be completed due to a file system limitation) to SQL Server during a write at offset 0x000048a123e000 in file 'E:\SQL_Data\VLDB01.mdf:MSSQL_DBCC17'. Additional messages in the SQL Server error log and system event log may provide more detail. This is a severe system-level error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

--

Error: 17053, Severity: 16, State: 1.

--

E:\SQL_Data\VLDB01.mdf:MSSQL_DBCC17: Operating system error 665(The requested operation could not be completed due to a file system limitation) encountered.

--

I have seen DBCC snapshot errors in the past and they almost always come back to disk space issues.  If you look at the first listing of the 665 error above you can see it was trying to write to the snapshot file it was creating on the E: drive, which is where the primary DATA/MDF file for VLDB01 was located.

By default, CheckDB and its component commands use a snapshot of the database to perform their work.  As described here by Paul Randal (@PaulRandal/blog) from SQLskills:  http://sqlmag.com/blog/why-can-database-snapshot-run-out-space, snapshot files are “sparse” files that reserve a very small amount of space and then grow as needed to handle the required data.  Because of this mechanism, they do not require the full amount of space up front.

https://technet.microsoft.com/en-us/library/bb457112.f13zs11_big(l=en-us).jpg

A sparse file only uses the physical space required to hold the actual ("meaningful") data.  As seen in this diagram from Technet, in this example a regular/non-sparse file would be 17GB while a comparable sparse file would only be 7GB.

The text of the out of space error has since been updated from the error message seen in Paul’s article to the “transient database snapshot suspect” error we see above as described here http://www.sqlcoffee.com/Troubleshooting177.htm.

--

Looking at the E: drive it was a 900GB drive with 112GB currently free.  The catch is that in the 675GB VLDB01 database there are two tables larger than 112GB and another that is almost 100GB!

Top 10 largest tables out of 1261 total tables in VLDB01:

InstanceName
DatabaseName
TableName
NumberOfRows
SizeinMB
DataSizeinMB
IndexSizeinMB
UnusedSizeinMB
INSTANCE99
VLDB01
BigTable1
1011522
136548.20
136523.80
10.71
13.69
INSTANCE99
VLDB01
BigTable2
9805593
122060.29
114534.34
5709.13
1816.82
INSTANCE99
VLDB01
BigTable3
17747326
91143.74
65405.88
25464.23
273.63
INSTANCE99
VLDB01
BigTable4
137138292
78046.15
39646.33
38305.33
94.49
INSTANCE99
VLDB01
Table01
1650232
46884.70
46422.93
419.40
42.37
INSTANCE99
VLDB01
Table02
76827734
26780.02
9153.05
17566.23
60.75
INSTANCE99
VLDB01
Table03
35370640
26766.98
20936.73
5733.40
96.86
INSTANCE99
VLDB01
Table04
12152300
22973.11
11173.06
11764.65
35.40
INSTANCE99
VLDB01
Table05
12604262
19292.02
7743.06
11511.93
37.03
INSTANCE99
VLDB01
Table06
31649960
14715.57
5350.62
9327.30
37.65

The biggest unit of work in a CheckDB is the individual DBCC CHECKTABLE’s of each table, and trying to run a CHECKTABLE of a 133GB table in a 112GB space was not going to fly.

Note that you don’t need 675GB of free space for the CheckDB snapshot of a 675GB database – just space for the largest object and a little more – 145GB-150GB free should be sufficient to CheckDB this particular database as it currently stands, but we need to be mindful of these large tables if they grow over time as they would then require more CheckDB snapshot space as well.

--

There are a couple of potential fixes here.

First and possibly most straightforward would be to clear more space on E: or to expand the drive – if we could get the drive to 150+GB free we should be good for the present (acknowledging the threat of future growth of the large tables).  The catch was that there were only three files on E: and none of them had much useful free space to reclaim:

DBFileName
Path
FileSizeMB
SpaceUsedMB
FreeSpaceMB
VLDB01
E:\SQL_Data\VLDB01.mdf
654267.13
649746.81
4520.31
VLDB01_data2
E:\SQL_Data\VLDB01_1.ndf
29001.31
28892.81
108.5
VLDB01_CONFIG
E:\SQL_Data\VLDB01_CONFIG.mdf
16.25
12.06
4.19

This means that going this route would requiring expanding the E: drive.  I would recommend expanding it by 100GB-150GB – this is more than we immediately need but should prevent us from asking for more space in the short term. 

ProTip - consider this method any time you are asking for additional infrastructure resources – asking for just the amount of CPU/RAM/Disk/whatever that you need right now means you will probably need to ask again soon, and most infra admins I have known would rather give you more up front then have you bother them every month! 

https://imgflip.com/i/1unt0z

(However, be realistic – don’t ask for an insane amount or you will just get shut down completely!)


--

Another option in this case since INSTANCE99 is SQL Server Enterprise Edition would be to create a manual snapshot somewhere else with more space and then to run CheckDB against that manual snapshot.  This process is described here by Microsoft Certified Master Robert Davis (@SQLSoldier/blog): http://www.sqlsoldier.com/wp/sqlserver/day1of31daysofdisasterrecoverydoesdbccautomaticallyuseexistingsnapshot and is relatively straightforward:

--

1)  Create a snapshot of your database on a different drive – something like:

CREATE DATABASE VLDB01_Snapshot ON (NAME = N' VLDB01_Data_Snap', FILENAME = N'O:\Snap\VLDB01_Data.snap') AS SNAPSHOT OF VLDB01;

2)      Run CheckDB against the snapshot directly:

DBCC CHECKDB (VLDB01_Snapshot);

3)      Drop the snapshot – because the snapshot is functionally a database, this is just a DROP DATABASE statement:

DROP DATABASE VLDB01_Snapshot

4)      Modify the existing job to exclude VLDB01 so that it doesn’t continue to try to run with the default internal process!

--

Luckily, in this case there were several drives with sufficient space!

--

I advised the client that if they preferred to go this second way (the manual snapshot) I strongly recommend removing any existing canned maintenance plans and changing this server to the Ola Hallengren scripted maintenance.  Not only is this my general recommendation anyway (#OlaRocks), but it also makes excluding a database much easier and safer. 

To exclude a database under a regular maintenance plan you have to edit the job and manually check every database except the offending database, but this causes trouble when new databases are added to the instance as they must then be manually added to the maintenance plans.  Under the Hallengren scripts you can say “all databases except this one” which continues to automatically pick up new databases in the future (there is no “all but this one” option in a regular maintenance plan).

Here is what the command would look like under Ola:

EXECUTE dbo.DatabaseIntegrityCheck
@Databases = 'USER_DATABASES, -VLDB01',
@CheckCommands = 'CHECKDB'


--

If you find yourself in this situation consider carefully which way you prefer to go and document, document, document so that future DBA’s know what happened (even if that future DBA is just you in 6/12/24 months!)

https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjuq98MpDcUSCOac_2XIIT6caragbp1G1gd3iAGjzneOWKoCgrTPnlWWFgWyttSVtMUXS1h-XxJthIe6BvSRl7iflQPXCSzS1KgqTtiwZaEKH36nLD8LMCFNlK30r2dT6aL02TCHbguH7Y/s1600/wait+here1b.jpg

Hope this helps!