One DBA's Ongoing Search for Clarity in the Middle of Nowhere


Yet Another Andy Writing About SQL Server

Thursday, September 14, 2017

Come See Me in Minnesota in October!

The schedule for SQL Saturday #682 Minnesota 2017 is out, and I will be giving my talk on Extended Events:

Getting Started with Extended Events
Speaker: Andy Galbraith
Duration: 60 minutes
Track: Administration
Few subjects in Microsoft SQL Server inspire the same amount of Fear, Uncertainty, and Doubt (FUD) as Extended Events.  Many DBA's continue to use Profiler and SQL Trace even though they have been deprecated for years.  Why is this?
Extended Events started out in SQL Server 2008 with no user interface and only a few voices in the community documenting the features as they found them.  Since then it has blossomed into a full feature of SQL Server and an amazingly low-impact replacement for Profiler and Trace.
Come learn how to get started - the basics of sessions, events, actions, targets, packages, and more.  We will look at some base scenarios where Extended Events can be very useful as well as considering a few gotchas along the way.  You may never go back to Profiler again!

The event is October 7th at St Paul College, and is *free* (with an optional $15 lunch).

I have spoken at numerous SQL Saturdays over the last few years and can definitely say the crew in Minnesota does a great job assembling a schedule and taking care of the attendees (and speakers!)

There is also a great slate of full day pre-conference sessions on Friday the 6th:

The Complete Primer to SQL Server Infrastructure and Cloud
David Klee
Level: Intermediate
Powering Up with Power BI
Brian Larson, John Thompson...
Level: Beginner
DBA Fundamentals - give yourself a solid SQL foundation!
Kevin Hill
Level: Beginner
Azure Data Services Hands On Lab
Matt Stenzel
Level: Intermediate

The pre-cons cost $110 which makes them a great value for a full day of training from some great SQL Server professionals!

Hope to see you there!

Wednesday, September 6, 2017

Toolbox - IO, IO, Why are You So Slow?

This is the next in a series of blogs I am going to create talking about useful tools (mostly scripts) that I use frequently in my day-to-day life as a production DBA.  I work as a Work-From-Home DBA for a Remote DBA company, but 90+% of my job functions are the same as any other company DBA.

Many of these scripts come directly from blogs and articles created and shared by other members of the SQL Server community; some of them I have slightly modified and some I have lifted directly from those articles.  I will always give attribution back to the original source and note when I have made modifications.


Troubleshooting performance in SQL Server (or almost any other system) is often an iterative process of discovering a bottleneck, fixing it, and then discovering the next bottleneck.
SQL Server bottlenecks frequently fall into one of three resource categories:
  • CPU
  • Memory
  • I/O (Disk)
For example, you may find that "your instance is slow" (Client-speak for "I'm unhappy - FIX IT!") and upon investigation you see in your Perfmon collector (you have a Perfmon collector running, right???) that CPU is consistently at 90%+.  You consider the evidence and realize a set of queries is poorly written (darn SharePoint) but you also realize that in this case you can fix them (whew!), and this helps the CPU numbers drop significantly.  You're the hero, right?

...until you realize the instance is still "slow."
You look at the numbers again, and now you find that disk latency, which had previously been fine, is now completely in the tank during the business day, showing that I/O delays are through the roof.

What happened?

This demonstrates the concept of shifting bottleneck - while CPU use was through the roof, the engine so bogged down that it couldn't generate that much I/O, but once the CPU issue was resolved queries started moving through more quickly until the next choke point was met at the I/O limit.  Odds are once you resolve the I/O situation, you would find a new bottleneck.

How do you ever defeat a bad guy that constantly moves around and frequently changes form?
The next concept in this series is the concept of "good enough" - all computing systems have theoretical top speeds (speed of electrons through the ether, etc.) but you never get there.  At the end of the day the important question is:

How fast does your system need to be to meet the business SLAs and keep your customers happy (aka QUIET)?

Unfortunately in many cases the answer seems to be this:,fl_progressive,g_center,h_450,q_80,w_800/1344998898279631762.jpg

Once you can determine this "good enough" speed (the whole discussion of  "Do you have SLAs?" could fill dozens of blog posts) you can determine where an acceptable end point is to your iterative process.


I have a small handful of things I check when a system is "slow" and one of them is disk latency.  At its core, disk latency is a calculation of how much delay there is between when SQL Server asks for data and when it gets it back.

The data regarding latency comes from the DMV sys.dm_io_virtual_file_stats, and the relevant metrics are io_stall, io_stall_read_ms, and io_stall_write_ms (table from Technet):

Column name Data type Description
database_id smallint ID of database.
file_id smallint ID of file.
sample_ms int Number of milliseconds since the computer was started. This column can be used to compare different outputs from this function.
num_of_reads bigint Number of reads issued on the file.
num_of_bytes_read bigint Total number of bytes read on this file.
io_stall_read_ms bigint Total time, in milliseconds, that the users waited for reads issued on the file.
num_of_writes bigint Number of writes made on this file.
num_of_bytes_written bigint Total number of bytes written to the file.
io_stall_write_ms bigint Total time, in milliseconds, that users waited for writes to be completed on the file.
io_stall bigint Total time, in milliseconds, that users waited for I/O to be completed on the file.
size_on_disk_bytes bigint Number of bytes used on the disk for this file. For sparse files, this number is the actual number of bytes on the disk that are used for database snapshots.
file_handle varbinary Windows file handle for this file.

How do you turn these numbers into something meaningful?  As is almost always the case - someone has already done that work - you just need to find it!

Paul Randal (blog/@PaulRandal) has a great blog post "How to examine IO subsystem latencies from within SQL Server" and in that post he references a query created by Jimmy May (blog/@aspiringgeek).  Jimmy was at Microsoft for many years and is currently at the SanDisk Data Propulsion Lab.

Jimmy's query is so useful that it is also included in Glenn Berry's (blog/@glennalanberry) great Diagnostic Information (DMV) Queries.  Here is my *very slightly* modified version of that query:

-- Drive level latency information (Query 19) (Drive Level Latency)
-- Based on code from Jimmy May
SELECT @@ServerName as instanceName, [Drive], volume_mount_point,
WHEN num_of_reads = 0 THEN 0
ELSE (io_stall_read_ms/num_of_reads)
END AS [Read Latency],
WHEN io_stall_write_ms = 0 THEN 0
ELSE (io_stall_write_ms/num_of_writes)
END AS [Write Latency],
WHEN (num_of_reads = 0 AND num_of_writes = 0) THEN 0
ELSE (io_stall/(num_of_reads + num_of_writes))
END AS [Overall Latency],
WHEN num_of_reads = 0 THEN 0
ELSE (num_of_bytes_read/num_of_reads)
END AS [Avg Bytes/Read],
WHEN io_stall_write_ms = 0 THEN 0
ELSE (num_of_bytes_written/num_of_writes)
END AS [Avg Bytes/Write],
WHEN (num_of_reads = 0 AND num_of_writes = 0) THEN 0
ELSE ((num_of_bytes_read + num_of_bytes_written)/(num_of_reads + num_of_writes))
END AS [Avg Bytes/Transfer]
FROM (SELECT LEFT(UPPER(mf.physical_name), 2) AS Drive, UPPER (volume_mount_point) as volume_mount_point
,SUM(num_of_reads) AS num_of_reads,
        SUM(io_stall_read_ms) AS io_stall_read_ms, SUM(num_of_writes) AS num_of_writes,
        SUM(io_stall_write_ms) AS io_stall_write_ms, SUM(num_of_bytes_read) AS num_of_bytes_read,
        SUM(num_of_bytes_written) AS num_of_bytes_written, SUM(io_stall) AS io_stall
      FROM sys.dm_io_virtual_file_stats(NULL, NULL) AS vfs
      INNER JOIN sys.master_files AS mf WITH (NOLOCK)
      ON vfs.database_id = mf.database_id AND vfs.file_id = mf.file_id
 CROSS APPLY sys.dm_os_volume_stats(mf.database_id, mf.FILE_ID)
      GROUP BY LEFT(UPPER(mf.physical_name), 2),UPPER (volume_mount_point)) AS tab
-- Shows you the drive-level latency for reads and writes, in milliseconds
-- Latency above 20-25ms is usually a problem

The query performs a calculation comparing the stall in milliseconds to the number of disk operations to create a latency in milliseconds.  A rule of thumb (as noted in the query comments) is that anything >=20ms in latency is pretty bad and worthy of investigation.

Here is a sample output of the query:


instanceName Drive volume_mount_point Read Latency Write Latency Overall Latency Avg Bytes/ Read Avg Bytes/ Write Avg Bytes/ Transfer
INSTANCE35 F: F:\SPARE-01\ 69 7 64 1002982 236026 940840
INSTANCE35 F: F:\DATA01\ 26 1 25 55005 2629 54384
INSTANCE35 F: F:\280532-02_SECOND\ 8 5 8 105124 8257 105114
INSTANCE35 T: T:\ 2 25 3 65483 64899 65187
INSTANCE35 E: E:\LOGS01\ 7 0 0 20080 51327 51198


On this instance you can see that the SPARE-01 and DATA01 mount points on the F: drive have significant overall latency (64ms and 25ms respectively) - significant enough that the users are almost certainly experiencing I/O-related impact.

If you look at the query you will see that the "overall" latency covers both reads and writes and as such is functionally a weighted average of the read and write latencies.  For example you can see that the DATA01 does significantly more reads that writes since the overall latency of 25 is almost equal to the read latency of 26.

One more point from the sample numbers above - look at the T: drive.  The write latency is 25 (bad) but the overall latency is only 3.  Is this a problem?  Usually it is not, because this overall latency shows that this drive does relatively few writes - there are so many reads comparatively that the good read latency heavily outweighs the bad write latency.  In most cases the overall latency is the most important statistic of the three.

(Of course as always this is a big #ItDepends - if you have a known issue with write performance - even for the small number of writes being done on the drive - then the write latency is important!)


What causes latency?  There are a wide variety of issues - some inside SQL but many outside the engine as well.  In my experience I usually follow-up on both sides, tasking infrastructure teams to investigate numbers on the storage back-end *and* the storage network while the DBA investigates SQL Server-side I/O-related issues such as poor query plans and missing or bad indexes that cause excessive unnecessary I/O.  Sometimes you will read guidance to investigate SQL first (Paul makes that recommendation in his post referenced above) but I like starting both investigations at once so that you can get both sets of results in a more timely fashion.

Disk Latency is one of the easiest things to find - run the single simple query shown above and look at the numbers.  Once you determine that you *do* have a latency issue, you can move forward with internal SQL Server and external infrastructure investigations.


Remember - at the end of the day, you want your customer to think this:


Hope this helps!

Thursday, August 24, 2017

The Transient Database Snapshot Has Been Marked Suspect

Yet another tale from the ticket queue...

The DBCC CheckDB was failing on INSTANCE99 and after some investigation it looked like a space issue, not an actual corruption issue.

The Job Failure error text was this:


Executed as user: DOMAIN\svc_acct. Microsoft (R) SQL Server Execute Package Utility  Version 10.50.6000.34 for 64-bit  Copyright (C) Microsoft Corporation 2010. All rights reserved.    Started:  2:00:00 AM  Progress: 2017-08-20 02:00:01.11     Source: {11E1AA7B-A7AC-4043-916B-DC6EABFF772B}      Executing query "DECLARE @Guid UNIQUEIDENTIFIER      EXECUTE msdb..sp...".: 100% complete  End Progress  Progress: 2017-08-20 02:00:01.30     Source: Check Database Integrity Task      Executing query "USE [VLDB01]  ".: 50% complete  End Progress  Error: 2017-08-20 03:38:19.28     Code: 0xC002F210     Source: Check Database Integrity Task Execute SQL Task     Description: Executing the query "DBCC CHECKDB(N'VLDB01')  WITH NO_INFOMSGS  " failed with the following error: "Check terminated.  The transient database snapshot for database 'VLDB01' (database ID 5) has been marked suspect due to an IO operation failure.  Refer to the SQL Server error log for details.  A severe error occurred on the current command.  The results, if any, should be discarded.". Possible failure reasons: Problems with the query, "ResultSet" property not set correctly, parameters not set correctly, or connection not established correctly.  End Error  Warning: 2017-08-20 03:38:19.28     Code: 0x80019002     Source: VLDB01 Integrity      Description: SSIS Warning Code DTS_W_MAXIMUMERRORCOUNTREACHED.  The Execution method succeeded, but the number of errors raised (1) reached the maximum allowed (1); resulting in failure. This occurs when the number of errors reaches the number specified in MaximumErrorCount. Change the MaximumErrorCount or fix the errors.  End Warning  DTExec: The package execution returned DTSER_FAILURE (1).  Started:  2:00:00 AM  Finished: 3:38:19 AM  Elapsed:  5899.51 seconds.  The package execution failed.  The step failed.


Looking in the SQL Error Log there were hundreds of these combinations in the minutes immediately preceding the job failure:


The operating system returned error 665(The requested operation could not be completed due to a file system limitation) to SQL Server during a write at offset 0x000048a123e000 in file 'E:\SQL_Data\VLDB01.mdf:MSSQL_DBCC17'. Additional messages in the SQL Server error log and system event log may provide more detail. This is a severe system-level error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.


Error: 17053, Severity: 16, State: 1.


E:\SQL_Data\VLDB01.mdf:MSSQL_DBCC17: Operating system error 665(The requested operation could not be completed due to a file system limitation) encountered.


I have seen DBCC snapshot errors in the past and they almost always come back to disk space issues.  If you look at the first listing of the 665 error above you can see it was trying to write to the snapshot file it was creating on the E: drive, which is where the primary DATA/MDF file for VLDB01 was located.

By default, CheckDB and its component commands use a snapshot of the database to perform their work.  As described here by Paul Randal (@PaulRandal/blog) from SQLskills:, snapshot files are “sparse” files that reserve a very small amount of space and then grow as needed to handle the required data.  Because of this mechanism, they do not require the full amount of space up front.

A sparse file only uses the physical space required to hold the actual ("meaningful") data.  As seen in this diagram from Technet, in this example a regular/non-sparse file would be 17GB while a comparable sparse file would only be 7GB.

The text of the out of space error has since been updated from the error message seen in Paul’s article to the “transient database snapshot suspect” error we see above as described here


Looking at the E: drive it was a 900GB drive with 112GB currently free.  The catch is that in the 675GB VLDB01 database there are two tables larger than 112GB and another that is almost 100GB!

Top 10 largest tables out of 1261 total tables in VLDB01:


The biggest unit of work in a CheckDB is the individual DBCC CHECKTABLE’s of each table, and trying to run a CHECKTABLE of a 133GB table in a 112GB space was not going to fly.

Note that you don’t need 675GB of free space for the CheckDB snapshot of a 675GB database – just space for the largest object and a little more – 145GB-150GB free should be sufficient to CheckDB this particular database as it currently stands, but we need to be mindful of these large tables if they grow over time as they would then require more CheckDB snapshot space as well.


There are a couple of potential fixes here.

First and possibly most straightforward would be to clear more space on E: or to expand the drive – if we could get the drive to 150+GB free we should be good for the present (acknowledging the threat of future growth of the large tables).  The catch was that there were only three files on E: and none of them had much useful free space to reclaim:


This means that going this route would requiring expanding the E: drive.  I would recommend expanding it by 100GB-150GB – this is more than we immediately need but should prevent us from asking for more space in the short term. 

ProTip - consider this method any time you are asking for additional infrastructure resources – asking for just the amount of CPU/RAM/Disk/whatever that you need right now means you will probably need to ask again soon, and most infra admins I have known would rather give you more up front then have you bother them every month!

(However, be realistic – don’t ask for an insane amount or you will just get shut down completely!)


Another option in this case since INSTANCE99 is SQL Server Enterprise Edition would be to create a manual snapshot somewhere else with more space and then to run CheckDB against that manual snapshot.  This process is described here by Microsoft Certified Master Robert Davis (@SQLSoldier/blog): and is relatively straightforward:


1)  Create a snapshot of your database on a different drive – something like:


2)      Run CheckDB against the snapshot directly:


3)      Drop the snapshot – because the snapshot is functionally a database, this is just a DROP DATABASE statement:


4)      Modify the existing job to exclude VLDB01 so that it doesn’t continue to try to run with the default internal process!


Luckily, in this case there were several drives with sufficient space!


I advised the client that if they preferred to go this second way (the manual snapshot) I strongly recommend removing any existing canned maintenance plans and changing this server to the Ola Hallengren scripted maintenance.  Not only is this my general recommendation anyway (#OlaRocks), but it also makes excluding a database much easier and safer. 

To exclude a database under a regular maintenance plan you have to edit the job and manually check every database except the offending database, but this causes trouble when new databases are added to the instance as they must then be manually added to the maintenance plans.  Under the Hallengren scripts you can say “all databases except this one” which continues to automatically pick up new databases in the future (there is no “all but this one” option in a regular maintenance plan).

Here is what the command would look like under Ola:

EXECUTE dbo.DatabaseIntegrityCheck
@Databases = 'USER_DATABASES, -VLDB01',
@CheckCommands = 'CHECKDB'


If you find yourself in this situation consider carefully which way you prefer to go and document, document, document so that future DBA’s know what happened (even if that future DBA is just you in 6/12/24 months!)

Hope this helps!

Monday, July 24, 2017

Sudden Differential Backup Failures after an Availability Group Failover

As so many stories do, this story starts with a failover - in this case an Availability Group (AG) failover.

There were several different backup and status check jobs failing on NODE01 and NODE02 because the AG01 availability group was now primary on NODE02 instead of NODE01 after a failover.

The failover occurred the previous Saturday morning at 8:30am local server time because of an application-initiated server reboot of the 01 node:

Log Name:      System
Source:        User32
Date:          7/22/2017 8:31:08 AM
Event ID:      1074
Task Category: None
Level:         Information
Keywords:      Classic
User:          SYSTEM
The process C:\Program Files\Application12\AppAgent\01\patch\ApplicationService.exe (NODE01) has initiated the restart of computer NODE01 on behalf of user NT AUTHORITY\SYSTEM for the following reason: Legacy API shutdown
Reason Code: 0x80070000
Shutdown Type: restart
Comment: The Update Agent will now reboot this machine

Always fun when an application “unexpectedly” restarts a server without advance warning.

Even though the Quest/Dell LiteSpeed Fast Compression maintenance plans are configured to pick up “all user databases” the availability group databases were not being backed up currently and hadn’t been since the failover:



The error messages on the SQL Error Log on the 01 node are telling:


Date       7/24/2017 1:09:23 AM
Log        SQL Server (Current - 7/24/2017 12:59:00 PM)
Source     Backup
Message    Error: 3041, Severity: 16, State: 1.
Date       7/24/2017 1:09:23 AM
Log        SQL Server (Current - 7/24/2017 12:59:00 PM)
Source     Backup
Message    BACKUP failed to complete the command BACKUP DATABASE AG_DATABASE22. Check the backup application log for detailed messages.
Date       7/24/2017 1:09:23 AM
Log        SQL Server (Current - 7/24/2017 12:59:00 PM)
Source     spid269
Message    FastCompression Alert: 62301 : SQL Server has returned a failure message to LiteSpeed™ for SQL Server® which has prevented the operation from succeeding.
The following message is not a LiteSpeed™ for SQL Server® message. Please refer to SQL Server books online or Microsoft technical support for a solution:  
BACKUP DATABASE is terminating abnormally. This BACKUP or RESTORE command is not supported on a database mirror or secondary replica.
Date       7/24/2017 1:09:24 AM
Log        SQL Server (Current - 7/24/2017 12:59:00 PM)
Source     spid258
Message    Error: Maintenance plan 'Maint Backup User Databases'.LiteSpeed™ for SQL Server®
© 2016 Dell Inc.
           Selecting full backup: Maximum interval (7 days) has elapsed since last full
Executing SQLLitespeed.exe: Write new full backup \\adcnas21\SQL\NODE01\AG_DATABASE22\AG_DATABASE22.litespeed.f22.bkp
           Msg 62301, Level 16, State 1, Line 0: SQL Server has returned a failure message to LiteSpeed™ for SQL Server® which has prevented the operation from succeeding.
The following message is not a LiteSpeed™ for SQL Server® message. Please refer to SQL Server books online or Microsoft technical support for a solution:

           BACKUP DATABASE is terminating abnormally.
This BACKUP or RESTORE command is not supported on a database mirror or secondary replica.

 ** IMPORTANT **  - while this specific scenario is on a cluster running Dell/Quest LiteSpeed Fast Compression backups (a process that uses an algorithm to determine whether a differential is sufficient or a full backup is required each day), the problem does *not* directly relate to LiteSpeed - the problem is with running differential backups on availability group secondaries or database mirror partners in general.

The situation around running differentials on secondaries is described in this post from Anup Warrier (blog/@AnupWarrier):


The particular issue here is the fact that the AG’s aren’t weighted evenly, so even though it is “invalid” for the differentials, the AG still prefers NODE01 because it is more heavily weighted:


My recommendation to the client to fix this on LiteSpeed Fast Compression was to change the backup option on the AG from “Any Replica” to “Primary” – this would keep the backup load on the primary, which means the differential backups would work.

The cost to this is that the backup I/O load would always be on the primary, but since the “normal” state for AG01 is to live on NODE01 with backups on NODE01 then requiring the backups to run on the primary node would not be different in most situations.

Note this is the state for this particular AG - in many cases part of the use of AG's is being able to offload backups to the secondary, so in many cases this is a cost to be weighed before making this change.

If you want to run backups on the secondary node then my personal best suggestion for fixing this would be outside of LiteSpeed/Fast Compression – if you want/need to stay with differentials, we could use a scripted backup solution (like Ola Hallengren's Maintenance Solution) to enable/disable the full/differential backups on the secondary node, probably by adding a check in the backup commands to ignore databases that aren’t the AG primary.  

A script to perform this Availability Group replica check appears in my previous post on determining the primary Availability Group replica:

SELECT AS AvailabilityGroupName,
HAGS.primary_replica AS PrimaryReplicaName, 
HARS.role_desc as LocalReplicaRoleDesc, 
DRCS.database_name AS DatabaseName, 
HDRS.synchronization_state_desc as SynchronizationStateDesc, 
HDRS.is_suspended AS IsSuspended, 
DRCS.is_database_joined AS IsJoined
FROM master.sys.availability_groups AS AG
LEFT OUTER JOIN master.sys.dm_hadr_availability_group_states as HAGS
ON AG.group_id = HAGS.group_id
INNER JOIN master.sys.availability_replicas AS AR
ON AG.group_id = AR.group_id
INNER JOIN master.sys.dm_hadr_availability_replica_states AS HARS
ON AR.replica_id = HARS.replica_id 
AND HARS.is_local = 1
INNER JOIN master.sys.dm_hadr_database_replica_cluster_states AS DRCS
ON HARS.replica_id = DRCS.replica_id
LEFT OUTER JOIN master.sys.dm_hadr_database_replica_states AS HDRS
ON DRCS.replica_id = HDRS.replica_id
AND DRCS.group_database_id = HDRS.group_database_id
ORDER BY, DRCS.database_name

Another option would be to go away from differentials completely and just run FULL’s and LOG’s which would allow for the current 80/50 weight model to continue.

I further advised the client that because of the differentials situation I don’t see a way to force backups to the secondary with LiteSpeed Fast Compression - since Fast Compression requires differentials as part of its process, Fast Compression backups *have* to run on the primary. (Maybe there is some setting in Fast Compression to fix this but I am not familiar with one.) 


Hope this helps!