Yet another tale from the ticket queue...
The DBCC CheckDB was
failing on INSTANCE99 and after some investigation it looked like a space
issue, not an actual corruption issue.
http://baddogneedsrottenhome.com/images/emails/55ce060daa58b.jpg |
--
The Job Failure
error text was this:
--
Executed as user: DOMAIN\svc_acct. Microsoft
(R) SQL Server Execute Package Utility Version 10.50.6000.34 for
64-bit Copyright (C) Microsoft Corporation 2010. All rights
reserved. Started: 2:00:00 AM Progress:
2017-08-20 02:00:01.11 Source: {11E1AA7B-A7AC-4043-916B-DC6EABFF772B}
Executing query "DECLARE @Guid UNIQUEIDENTIFIER
EXECUTE msdb..sp...".: 100% complete End Progress Progress:
2017-08-20 02:00:01.30 Source: Check Database Integrity
Task Executing query "USE [VLDB01]
".: 50% complete End Progress Error: 2017-08-20
03:38:19.28 Code: 0xC002F210
Source: Check Database Integrity Task Execute SQL Task
Description: Executing the query "DBCC CHECKDB(N'VLDB01') WITH
NO_INFOMSGS " failed with the following error: "Check terminated. The
transient database snapshot for database 'VLDB01' (database ID 5) has been
marked suspect due to an IO operation failure. Refer to the SQL Server
error log for details. A severe error occurred on the current
command. The results, if any, should be discarded.". Possible
failure reasons: Problems with the query, "ResultSet" property not
set correctly, parameters not set correctly, or connection not established
correctly. End Error Warning: 2017-08-20
03:38:19.28 Code: 0x80019002
Source: VLDB01 Integrity Description: SSIS
Warning Code DTS_W_MAXIMUMERRORCOUNTREACHED. The Execution method
succeeded, but the number of errors raised (1) reached the maximum allowed (1);
resulting in failure. This occurs when the number of errors reaches the number
specified in MaximumErrorCount. Change the MaximumErrorCount or fix the
errors. End Warning DTExec: The package execution returned
DTSER_FAILURE (1). Started: 2:00:00 AM Finished: 3:38:19
AM Elapsed: 5899.51 seconds. The package execution failed.
The step failed.
--
Looking in the SQL
Error Log there were hundreds of these combinations in the minutes immediately
preceding the job failure:
--
The operating system returned error 665(The requested operation could
not be completed due to a file system limitation) to SQL Server
during a write at offset 0x000048a123e000 in file 'E:\SQL_Data\VLDB01.mdf:MSSQL_DBCC17'.
Additional messages in the SQL Server error log and system event log may
provide more detail. This is a severe system-level error condition that
threatens database integrity and must be corrected immediately. Complete a full
database consistency check (DBCC CHECKDB). This error can be caused by many
factors; for more information, see SQL Server Books Online.
--
Error: 17053, Severity: 16, State: 1.
--
E:\SQL_Data\VLDB01.mdf:MSSQL_DBCC17:
Operating system error 665(The requested operation could not be completed due to a file system
limitation) encountered.
--
I have seen DBCC
snapshot errors in the past and they almost always come back to disk space
issues. If you look at the first listing of the 665 error above you can
see it was trying to write to the snapshot file it was creating on the E:
drive, which is where the primary DATA/MDF file for VLDB01 was located.
By default, CheckDB
and its component commands use a snapshot of the database to perform their
work. As described here by Paul Randal (@PaulRandal/blog)
from SQLskills: http://sqlmag.com/blog/why-can-database-snapshot-run-out-space,
snapshot files are “sparse” files that reserve a very small amount of space and
then grow as needed to handle the required data. Because of this mechanism,
they do not require the full amount of space up front.
https://technet.microsoft.com/en-us/library/bb457112.f13zs11_big(l=en-us).jpg |
The text of the out of space error has since been updated from the error message seen in Paul’s article to the
“transient database snapshot suspect” error we see above as described here http://www.sqlcoffee.com/Troubleshooting177.htm.
--
Looking at the E:
drive it was a 900GB drive with 112GB currently free. The catch is that
in the 675GB VLDB01 database there are two tables larger than 112GB and another
that is almost 100GB!
Top 10 largest
tables out of 1261 total tables in VLDB01:
InstanceName
|
DatabaseName
|
TableName
|
NumberOfRows
|
SizeinMB
|
DataSizeinMB
|
IndexSizeinMB
|
UnusedSizeinMB
|
INSTANCE99
|
VLDB01
|
BigTable1
|
1011522
|
136548.20
|
136523.80
|
10.71
|
13.69
|
INSTANCE99
|
VLDB01
|
BigTable2
|
9805593
|
122060.29
|
114534.34
|
5709.13
|
1816.82
|
INSTANCE99
|
VLDB01
|
BigTable3
|
17747326
|
91143.74
|
65405.88
|
25464.23
|
273.63
|
INSTANCE99
|
VLDB01
|
BigTable4
|
137138292
|
78046.15
|
39646.33
|
38305.33
|
94.49
|
INSTANCE99
|
VLDB01
|
Table01
|
1650232
|
46884.70
|
46422.93
|
419.40
|
42.37
|
INSTANCE99
|
VLDB01
|
Table02
|
76827734
|
26780.02
|
9153.05
|
17566.23
|
60.75
|
INSTANCE99
|
VLDB01
|
Table03
|
35370640
|
26766.98
|
20936.73
|
5733.40
|
96.86
|
INSTANCE99
|
VLDB01
|
Table04
|
12152300
|
22973.11
|
11173.06
|
11764.65
|
35.40
|
INSTANCE99
|
VLDB01
|
Table05
|
12604262
|
19292.02
|
7743.06
|
11511.93
|
37.03
|
INSTANCE99
|
VLDB01
|
Table06
|
31649960
|
14715.57
|
5350.62
|
9327.30
|
37.65
|
The biggest unit of
work in a CheckDB is the individual DBCC CHECKTABLE’s of each table, and trying
to run a CHECKTABLE of a 133GB table in a 112GB space was not going to fly.
Note that you don’t need 675GB of free
space for the CheckDB snapshot of a 675GB database – just space for the largest
object and a little more – 145GB-150GB free should be sufficient to CheckDB
this particular database as it currently stands, but we need to be mindful of
these large tables if they grow over time as they would then require more
CheckDB snapshot space as well.
--
There are a couple
of potential fixes here.
First and possibly
most straightforward would be to clear more space on E: or to expand the drive
– if we could get the drive to 150+GB free we should be good for the present
(acknowledging the threat of future growth of the large tables). The
catch was that there were only three files on E: and none of them had much
useful free space to reclaim:
DBFileName
|
Path
|
FileSizeMB
|
SpaceUsedMB
|
FreeSpaceMB
|
VLDB01
|
E:\SQL_Data\VLDB01.mdf
|
654267.13
|
649746.81
|
4520.31
|
VLDB01_data2
|
E:\SQL_Data\VLDB01_1.ndf
|
29001.31
|
28892.81
|
108.5
|
VLDB01_CONFIG
|
E:\SQL_Data\VLDB01_CONFIG.mdf
|
16.25
|
12.06
|
4.19
|
This means that
going this route would requiring expanding the E: drive. I would recommend
expanding it by 100GB-150GB – this is more than we immediately need but should
prevent us from asking for more space in the short term.
ProTip - consider
this method any time you are asking for additional infrastructure resources –
asking for just the amount of CPU/RAM/Disk/whatever that you need right now
means you will probably need to ask again soon, and most infra admins I have
known would rather give you more up front then have you bother them every
month!
https://imgflip.com/i/1unt0z |
(However, be realistic – don’t ask for an insane amount or you will just
get shut down completely!)
--
Another option in
this case since INSTANCE99 is SQL Server Enterprise Edition would be to create
a manual snapshot somewhere else with more space and then to run CheckDB
against that manual snapshot. This process is described here by Microsoft
Certified Master Robert Davis (@SQLSoldier/blog):
http://www.sqlsoldier.com/wp/sqlserver/day1of31daysofdisasterrecoverydoesdbccautomaticallyuseexistingsnapshot
and is relatively straightforward:
--
1) Create
a snapshot of your database on a different drive – something like:
CREATE DATABASE VLDB01_Snapshot ON (NAME = N'
VLDB01_Data_Snap', FILENAME = N'O:\Snap\VLDB01_Data.snap') AS SNAPSHOT OF VLDB01;
2)
Run CheckDB against the snapshot directly:
DBCC CHECKDB (VLDB01_Snapshot);
3)
Drop the snapshot – because the snapshot is
functionally a database, this is just a DROP DATABASE statement:
DROP DATABASE VLDB01_Snapshot
4)
Modify the existing job to exclude VLDB01 so
that it doesn’t continue to try to run with the default internal process!
--
Luckily, in this case
there were several drives with sufficient space!
--
I advised the
client that if they preferred to go this second way (the manual snapshot) I
strongly recommend removing any existing canned maintenance plans and changing
this server to the Ola Hallengren scripted maintenance.
Not only is this my general recommendation anyway (#OlaRocks), but it also
makes excluding a database much easier and safer.
To exclude a
database under a regular maintenance plan you have to edit the job and manually
check every database except the offending database, but this causes trouble
when new databases are added to the instance as they must then be manually added
to the maintenance plans. Under the Hallengren scripts you can say “all
databases except this one” which continues to automatically pick up new
databases in the future (there is no “all but this one” option in a regular
maintenance plan).
Here is what the
command would look like under Ola:
EXECUTE
dbo.DatabaseIntegrityCheck
@Databases = 'USER_DATABASES, -VLDB01',
@CheckCommands = 'CHECKDB'
--
If you find
yourself in this situation consider carefully which way you prefer to go and
document, document, document so that future DBA’s know what happened (even if
that future DBA is just you in 6/12/24 months!)
Hope this helps!