One DBA's Ongoing Search for Clarity in the Middle of Nowhere


*or*

Yet Another Andy Writing About SQL Server

Wednesday, September 6, 2017

Toolbox - IO, IO, Why are You So Slow?

This is the next in a series of blogs I am going to create talking about useful tools (mostly scripts) that I use frequently in my day-to-day life as a production DBA.  I work as a Work-From-Home DBA for a Remote DBA company, but 90+% of my job functions are the same as any other company DBA.

Many of these scripts come directly from blogs and articles created and shared by other members of the SQL Server community; some of them I have slightly modified and some I have lifted directly from those articles.  I will always give attribution back to the original source and note when I have made modifications.

--

Troubleshooting performance in SQL Server (or almost any other system) is often an iterative process of discovering a bottleneck, fixing it, and then discovering the next bottleneck.

https://memegenerator.net/instance/65795220/grumpy-cat-5-bottleneck-sighting-optimize-the-system
SQL Server bottlenecks frequently fall into one of three resource categories:
  • CPU
  • Memory
  • I/O (Disk)
For example, you may find that "your instance is slow" (Client-speak for "I'm unhappy - FIX IT!") and upon investigation you see in your Perfmon collector (you have a Perfmon collector running, right???) that CPU is consistently at 90%+.  You consider the evidence and realize a set of queries is poorly written (darn SharePoint) but you also realize that in this case you can fix them (whew!), and this helps the CPU numbers drop significantly.  You're the hero, right?

...until you realize the instance is still "slow."

https://hakanforss.files.wordpress.com/2014/08/lawofbottlenecks.png
You look at the numbers again, and now you find that disk latency, which had previously been fine, is now completely in the tank during the business day, showing that I/O delays are through the roof.

What happened?

This demonstrates the concept of shifting bottleneck - while CPU use was through the roof, the engine so bogged down that it couldn't generate that much I/O, but once the CPU issue was resolved queries started moving through more quickly until the next choke point was met at the I/O limit.  Odds are once you resolve the I/O situation, you would find a new bottleneck.

How do you ever defeat a bad guy that constantly moves around and frequently changes form?

https://movietalkexpress.files.wordpress.com/2016/08/3.jpg
The next concept in this series is the concept of "good enough" - all computing systems have theoretical top speeds (speed of electrons through the ether, etc.) but you never get there.  At the end of the day the important question is:

How fast does your system need to be to meet the business SLAs and keep your customers happy (aka QUIET)?

Unfortunately in many cases the answer seems to be this:

https://i.kinja-img.com/gawker-media/image/upload/s--Wtfu6LxV--/c_fill,fl_progressive,g_center,h_450,q_80,w_800/1344998898279631762.jpg
--

Once you can determine this "good enough" speed (the whole discussion of  "Do you have SLAs?" could fill dozens of blog posts) you can determine where an acceptable end point is to your iterative process.

--

I have a small handful of things I check when a system is "slow" and one of them is disk latency.  At its core, disk latency is a calculation of how much delay there is between when SQL Server asks for data and when it gets it back.

The data regarding latency comes from the DMV sys.dm_io_virtual_file_stats, and the relevant metrics are io_stall, io_stall_read_ms, and io_stall_write_ms (table from Technet):

Column name Data type Description
database_id smallint ID of database.
file_id smallint ID of file.
sample_ms int Number of milliseconds since the computer was started. This column can be used to compare different outputs from this function.
num_of_reads bigint Number of reads issued on the file.
num_of_bytes_read bigint Total number of bytes read on this file.
io_stall_read_ms bigint Total time, in milliseconds, that the users waited for reads issued on the file.
num_of_writes bigint Number of writes made on this file.
num_of_bytes_written bigint Total number of bytes written to the file.
io_stall_write_ms bigint Total time, in milliseconds, that users waited for writes to be completed on the file.
io_stall bigint Total time, in milliseconds, that users waited for I/O to be completed on the file.
size_on_disk_bytes bigint Number of bytes used on the disk for this file. For sparse files, this number is the actual number of bytes on the disk that are used for database snapshots.
file_handle varbinary Windows file handle for this file.

How do you turn these numbers into something meaningful?  As is almost always the case - someone has already done that work - you just need to find it!

Paul Randal (blog/@PaulRandal) has a great blog post "How to examine IO subsystem latencies from within SQL Server" and in that post he references a query created by Jimmy May (blog/@aspiringgeek).  Jimmy was at Microsoft for many years and is currently at the SanDisk Data Propulsion Lab.

Jimmy's query is so useful that it is also included in Glenn Berry's (blog/@glennalanberry) great Diagnostic Information (DMV) Queries.  Here is my *very slightly* modified version of that query:

--
-- Drive level latency information (Query 19) (Drive Level Latency)
-- Based on code from Jimmy May
SELECT @@ServerName as instanceName, [Drive], volume_mount_point,
CASE
WHEN num_of_reads = 0 THEN 0
ELSE (io_stall_read_ms/num_of_reads)
END AS [Read Latency],
CASE
WHEN io_stall_write_ms = 0 THEN 0
ELSE (io_stall_write_ms/num_of_writes)
END AS [Write Latency],
CASE
WHEN (num_of_reads = 0 AND num_of_writes = 0) THEN 0
ELSE (io_stall/(num_of_reads + num_of_writes))
END AS [Overall Latency],
CASE
WHEN num_of_reads = 0 THEN 0
ELSE (num_of_bytes_read/num_of_reads)
END AS [Avg Bytes/Read],
CASE
WHEN io_stall_write_ms = 0 THEN 0
ELSE (num_of_bytes_written/num_of_writes)
END AS [Avg Bytes/Write],
CASE
WHEN (num_of_reads = 0 AND num_of_writes = 0) THEN 0
ELSE ((num_of_bytes_read + num_of_bytes_written)/(num_of_reads + num_of_writes))
END AS [Avg Bytes/Transfer]
FROM (SELECT LEFT(UPPER(mf.physical_name), 2) AS Drive, UPPER (volume_mount_point) as volume_mount_point
,SUM(num_of_reads) AS num_of_reads,
        SUM(io_stall_read_ms) AS io_stall_read_ms, SUM(num_of_writes) AS num_of_writes,
        SUM(io_stall_write_ms) AS io_stall_write_ms, SUM(num_of_bytes_read) AS num_of_bytes_read,
        SUM(num_of_bytes_written) AS num_of_bytes_written, SUM(io_stall) AS io_stall
      FROM sys.dm_io_virtual_file_stats(NULL, NULL) AS vfs
      INNER JOIN sys.master_files AS mf WITH (NOLOCK)
      ON vfs.database_id = mf.database_id AND vfs.file_id = mf.file_id
 CROSS APPLY sys.dm_os_volume_stats(mf.database_id, mf.FILE_ID)
      GROUP BY LEFT(UPPER(mf.physical_name), 2),UPPER (volume_mount_point)) AS tab
ORDER BY [Overall Latency] DESC OPTION (RECOMPILE);
-- Shows you the drive-level latency for reads and writes, in milliseconds
-- Latency above 20-25ms is usually a problem
--

The query performs a calculation comparing the stall in milliseconds to the number of disk operations to create a latency in milliseconds.  A rule of thumb (as noted in the query comments) is that anything >=20ms in latency is pretty bad and worthy of investigation.

Here is a sample output of the query:

--

instanceName Drive volume_mount_point Read Latency Write Latency Overall Latency Avg Bytes/ Read Avg Bytes/ Write Avg Bytes/ Transfer
INSTANCE35 F: F:\SPARE-01\ 69 7 64 1002982 236026 940840
INSTANCE35 F: F:\DATA01\ 26 1 25 55005 2629 54384
INSTANCE35 F: F:\280532-02_SECOND\ 8 5 8 105124 8257 105114
INSTANCE35 T: T:\ 2 25 3 65483 64899 65187
INSTANCE35 E: E:\LOGS01\ 7 0 0 20080 51327 51198

--

On this instance you can see that the SPARE-01 and DATA01 mount points on the F: drive have significant overall latency (64ms and 25ms respectively) - significant enough that the users are almost certainly experiencing I/O-related impact.

If you look at the query you will see that the "overall" latency covers both reads and writes and as such is functionally a weighted average of the read and write latencies.  For example you can see that the DATA01 does significantly more reads that writes since the overall latency of 25 is almost equal to the read latency of 26.

One more point from the sample numbers above - look at the T: drive.  The write latency is 25 (bad) but the overall latency is only 3.  Is this a problem?  Usually it is not, because this overall latency shows that this drive does relatively few writes - there are so many reads comparatively that the good read latency heavily outweighs the bad write latency.  In most cases the overall latency is the most important statistic of the three.

(Of course as always this is a big #ItDepends - if you have a known issue with write performance - even for the small number of writes being done on the drive - then the write latency is important!)

--

What causes latency?  There are a wide variety of issues - some inside SQL but many outside the engine as well.  In my experience I usually follow-up on both sides, tasking infrastructure teams to investigate numbers on the storage back-end *and* the storage network while the DBA investigates SQL Server-side I/O-related issues such as poor query plans and missing or bad indexes that cause excessive unnecessary I/O.  Sometimes you will read guidance to investigate SQL first (Paul makes that recommendation in his post referenced above) but I like starting both investigations at once so that you can get both sets of results in a more timely fashion.

Disk Latency is one of the easiest things to find - run the single simple query shown above and look at the numbers.  Once you determine that you *do* have a latency issue, you can move forward with internal SQL Server and external infrastructure investigations.

--

Remember - at the end of the day, you want your customer to think this:

http://imgur.com/waLAsxg

--

Hope this helps!


2 comments:

  1. Latency can change throughout the day, so I sample it around the clock so we can see it change. The code for how we do this is here:

    http://www.heraflux.com/resources/utilities/sql-server-storage-latency-collector/

    ReplyDelete