Nebraska SQL from @DBA_ANDY: August 2017

Yet another tale from the ticket queue...

The DBCC CheckDB was failing on INSTANCE99 and after some investigation it looked like a space issue, not an actual corruption issue.

http://baddogneedsrottenhome.com/images/emails/55ce060daa58b.jpg

The Job Failure error text was this:

Executed as user: DOMAIN\svc_acct. Microsoft (R) SQL Server Execute Package Utility Version 10.50.6000.34 for 64-bit Copyright (C) Microsoft Corporation 2010. All rights reserved. Started: 2:00:00 AM Progress: 2017-08-20 02:00:01.11 Source: {11E1AA7B-A7AC-4043-916B-DC6EABFF772B} Executing query "DECLARE @Guid UNIQUEIDENTIFIER EXECUTE msdb..sp...".: 100% complete End Progress Progress: 2017-08-20 02:00:01.30 Source: Check Database Integrity Task Executing query "USE [VLDB01] ".: 50% complete End Progress Error: 2017-08-20 03:38:19.28 Code: 0xC002F210 Source: Check Database Integrity Task Execute SQL Task Description: Executing the query "DBCC CHECKDB(N'VLDB01') WITH NO_INFOMSGS " failed with the following error: "Check terminated. The transient database snapshot for database 'VLDB01' (database ID 5) has been marked suspect due to an IO operation failure. Refer to the SQL Server error log for details. A severe error occurred on the current command. The results, if any, should be discarded.". Possible failure reasons: Problems with the query, "ResultSet" property not set correctly, parameters not set correctly, or connection not established correctly. End Error Warning: 2017-08-20 03:38:19.28 Code: 0x80019002 Source: VLDB01 Integrity Description: SSIS Warning Code DTS_W_MAXIMUMERRORCOUNTREACHED. The Execution method succeeded, but the number of errors raised (1) reached the maximum allowed (1); resulting in failure. This occurs when the number of errors reaches the number specified in MaximumErrorCount. Change the MaximumErrorCount or fix the errors. End Warning DTExec: The package execution returned DTSER_FAILURE (1). Started: 2:00:00 AM Finished: 3:38:19 AM Elapsed: 5899.51 seconds. The package execution failed. The step failed.

Looking in the SQL Error Log there were hundreds of these combinations in the minutes immediately preceding the job failure:

The operating system returned error 665(The requested operation could not be completed due to a file system limitation) to SQL Server during a write at offset 0x000048a123e000 in file 'E:\SQL_Data\VLDB01.mdf:MSSQL_DBCC17'. Additional messages in the SQL Server error log and system event log may provide more detail. This is a severe system-level error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

Error: 17053, Severity: 16, State: 1.

E:\SQL_Data\VLDB01.mdf:MSSQL_DBCC17: Operating system error 665(The requested operation could not be completed due to a file system limitation) encountered.

I have seen DBCC snapshot errors in the past and they almost always come back to disk space issues. If you look at the first listing of the 665 error above you can see it was trying to write to the snapshot file it was creating on the E: drive, which is where the primary DATA/MDF file for VLDB01 was located.

By default, CheckDB and its component commands use a snapshot of the database to perform their work. As described here by Paul Randal (@PaulRandal/blog) from SQLskills: http://sqlmag.com/blog/why-can-database-snapshot-run-out-space, snapshot files are “sparse” files that reserve a very small amount of space and then grow as needed to handle the required data. Because of this mechanism, they do not require the full amount of space up front.

https://technet.microsoft.com/en-us/library/bb457112.f13zs11_big(l=en-us).jpg

A sparse file only uses the physical space required to hold the actual ("meaningful") data. As seen in this diagram from Technet, in this example a regular/non-sparse file would be 17GB while a comparable sparse file would only be 7GB.

The text of the out of space error has since been updated from the error message seen in Paul’s article to the “transient database snapshot suspect” error we see above as described here http://www.sqlcoffee.com/Troubleshooting177.htm.

Looking at the E: drive it was a 900GB drive with 112GB currently free. The catch is that in the 675GB VLDB01 database there are two tables larger than 112GB and another that is almost 100GB!

Top 10 largest tables out of 1261 total tables in VLDB01:

InstanceName	DatabaseName	TableName	NumberOfRows	SizeinMB	DataSizeinMB	IndexSizeinMB	UnusedSizeinMB
INSTANCE99	VLDB01	BigTable1	1011522	136548.20	136523.80	10.71	13.69
INSTANCE99	VLDB01	BigTable2	9805593	122060.29	114534.34	5709.13	1816.82
INSTANCE99	VLDB01	BigTable3	17747326	91143.74	65405.88	25464.23	273.63
INSTANCE99	VLDB01	BigTable4	137138292	78046.15	39646.33	38305.33	94.49
INSTANCE99	VLDB01	Table01	1650232	46884.70	46422.93	419.40	42.37
INSTANCE99	VLDB01	Table02	76827734	26780.02	9153.05	17566.23	60.75
INSTANCE99	VLDB01	Table03	35370640	26766.98	20936.73	5733.40	96.86
INSTANCE99	VLDB01	Table04	12152300	22973.11	11173.06	11764.65	35.40
INSTANCE99	VLDB01	Table05	12604262	19292.02	7743.06	11511.93	37.03
INSTANCE99	VLDB01	Table06	31649960	14715.57	5350.62	9327.30	37.65

The biggest unit of work in a CheckDB is the individual DBCC CHECKTABLE’s of each table, and trying to run a CHECKTABLE of a 133GB table in a 112GB space was not going to fly.

Note that you don’t need 675GB of free space for the CheckDB snapshot of a 675GB database – just space for the largest object and a little more – 145GB-150GB free should be sufficient to CheckDB this particular database as it currently stands, but we need to be mindful of these large tables if they grow over time as they would then require more CheckDB snapshot space as well.

There are a couple of potential fixes here.

First and possibly most straightforward would be to clear more space on E: or to expand the drive – if we could get the drive to 150+GB free we should be good for the present (acknowledging the threat of future growth of the large tables). The catch was that there were only three files on E: and none of them had much useful free space to reclaim:

DBFileName	Path	FileSizeMB	SpaceUsedMB	FreeSpaceMB
VLDB01	E:\SQL_Data\VLDB01.mdf	654267.13	649746.81	4520.31
VLDB01_data2	E:\SQL_Data\VLDB01_1.ndf	29001.31	28892.81	108.5
VLDB01_CONFIG	E:\SQL_Data\VLDB01_CONFIG.mdf	16.25	12.06	4.19

This means that going this route would requiring expanding the E: drive. I would recommend expanding it by 100GB-150GB – this is more than we immediately need but should prevent us from asking for more space in the short term.

ProTip - consider this method any time you are asking for additional infrastructure resources – asking for just the amount of CPU/RAM/Disk/whatever that you need right now means you will probably need to ask again soon, and most infra admins I have known would rather give you more up front then have you bother them every month!

https://imgflip.com/i/1unt0z

(However, be realistic – don’t ask for an insane amount or you will just get shut down completely!)

Another option in this case since INSTANCE99 is SQL Server Enterprise Edition would be to create a manual snapshot somewhere else with more space and then to run CheckDB against that manual snapshot. This process is described here by Microsoft Certified Master Robert Davis (@SQLSoldier/blog): http://www.sqlsoldier.com/wp/sqlserver/day1of31daysofdisasterrecoverydoesdbccautomaticallyuseexistingsnapshot and is relatively straightforward:

1) Create a snapshot of your database on a different drive – something like:

CREATE DATABASE VLDB01_Snapshot ON (NAME = N' VLDB01_Data_Snap', FILENAME = N'O:\Snap\VLDB01_Data.snap') AS SNAPSHOT OF VLDB01;

2) Run CheckDB against the snapshot directly:

DBCC CHECKDB (VLDB01_Snapshot);

3) Drop the snapshot – because the snapshot is functionally a database, this is just a DROP DATABASE statement:

DROP DATABASE VLDB01_Snapshot

4) Modify the existing job to exclude VLDB01 so that it doesn’t continue to try to run with the default internal process!

Luckily, in this case there were several drives with sufficient space!

I advised the client that if they preferred to go this second way (the manual snapshot) I strongly recommend removing any existing canned maintenance plans and changing this server to the Ola Hallengren scripted maintenance. Not only is this my general recommendation anyway (#OlaRocks), but it also makes excluding a database much easier and safer.

To exclude a database under a regular maintenance plan you have to edit the job and manually check every database except the offending database, but this causes trouble when new databases are added to the instance as they must then be manually added to the maintenance plans. Under the Hallengren scripts you can say “all databases except this one” which continues to automatically pick up new databases in the future (there is no “all but this one” option in a regular maintenance plan).

Here is what the command would look like under Ola:

EXECUTE dbo.DatabaseIntegrityCheck

@Databases = 'USER_DATABASES, -VLDB01',

@CheckCommands = 'CHECKDB'

If you find yourself in this situation consider carefully which way you prefer to go and document, document, document so that future DBA’s know what happened (even if that future DBA is just you in 6/12/24 months!)

https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjuq98MpDcUSCOac_2XIIT6caragbp1G1gd3iAGjzneOWKoCgrTPnlWWFgWyttSVtMUXS1h-XxJthIe6BvSRl7iflQPXCSzS1KgqTtiwZaEKH36nLD8LMCFNlK30r2dT6aL02TCHbguH7Y/s1600/wait+here1b.jpg

Hope this helps!

Thursday, August 24, 2017

The Transient Database Snapshot Has Been Marked Suspect