One DBA's Ongoing Search for Clarity in the Middle of Nowhere


*or*

Yet Another Andy Writing About SQL Server

Monday, January 25, 2016

Speaking at SQL Saturday #500 in Boston!


SQLSaturday #500 - Boston 2016

I have been selected to speak at SQL Saturday #500 in Boston on March 19th - I will be speaking (at 9am in the morning!) on "Getting Started with Extended Events":

--
Few subjects in Microsoft SQL Server inspire the same amount of Fear, Uncertainty, and Doubt (FUD) as Extended Events.  Many DBA's continue to use Profiler and SQL Trace even though they have been deprecated for years.  Why is this?
Extended Events started out in SQL Server 2008 with no user interface and only a few voices in the community documenting the features as they found them.  Since then it has blossomed into a full feature of SQL Server and an amazingly low-impact replacement for Profiler and Trace. 
Come learn how to get started - the basics of sessions, events, actions, targets, packages, and more.  We will look at some base scenarios where Extended Events can be very useful as well as considering a few gotchas along the way.  You may never go back to Profiler again!
--

SQL Saturdays are put on by PASS and are amazing full-day events of free training and networking ($10 for lunch), with sessions presented by a wide array of speakers.  Many of the speakers are SQL Server MVPs and the topics cover a wide array of fields from DBA to Developer to BI and beyond.

--

If you are interested there are also full day sessions (albeit not free!) Thursday and Friday from SQL Server MVP Jes Borland and MVP Denny Cherry respectively.  Jes will be presenting "How to Get Started Using Microsoft Azure" and Denny is covering "SQL Server Performance Tuning and Optimization" - if you are coming out Saturday and have the availability Thursday and/or Friday, consider them as well.

--

Hope to see you there!


Thursday, January 21, 2016

Counting Your VLFs, or, Temp Tables Inside IF...ELSE Blocks

There are many many blog posts out there about Virtual Log Files (VLFs) - one of the absolute best is "8 Steps to Better Transaction Log Throughput" from Kimberly Tripp (blog/@KimberlyLTripp) of SQLskills as well as the several other posts she links to from that post - if you haven't read them, click the link right now and do so - my post will be right here when you get back.

--

VLFs are the logical units that make up your transaction logs, and at the end of the day the TL;DR boils down to "Usually, too many VLFs Are bad" - they can decrease performance by functionally "fragmenting" your transaction log and slowing down everything transaction log-related (so basically, everything).

phil hartman frankenstein  - Too Many VLFs BAD!
http://memegenerator.net/Phil-Hartman-Frankenstein/caption
VLF count is something that most vendors check during their health checks, and many SQL Server pros recommend it as well.  Of course "too many VLFs" is a relative term, with people throwing around numbers of 50 or 100 or 200 as their threshold of concern.

--

The resource I have always fallen back on to run this check is the script from Michelle Ufford (blog/@sqlfool).  She created it back in 2010 and it is the basis for the VLF script included in Glenn Berry's (blog/@GlennAlanBerry) Diagnostic Information (DMV) Queries.

Michelle's query relies on the undocumented DBCC LogInfo command to gather its VLF data - DBCC LogInfo returns a row for each VLF, so the count(*) of that query gives the number of VLFs for the database.  The catch is that in SQL Server 2012, Microsoft added a column to the front of the resultset (RecoveryUnitID).  As the DBCC command is undocumented, this new column is undocumented as well.

Michelle's code uses INSERT...EXEC to populate a temporary table with the VLF info, and the addition of this extra column breaks the original script.  Glenn's versions of the scripts handle this issue easily since they are version-specific - in the SQL 2012/2014/2016 versions of the script, the temp table declaration is modified to include the extra RecoveryUnitID column, which allows the rest of the script to function as designed.

--

My problem is I wanted a version of the script that could be used across versions 2005+, and this presented a problem.  At first I tried to add an IF...ELSE block to the start of the script to handle the differing CREATE TABLE statements:

--

If (select LEFT(cast(serverproperty('ProductVersion') as varchar),2)) in ('8.','9.','10')
BEGIN
Create Table #stage
(
FileID int
, FileSize bigint
, StartOffset bigint
, FSeqNo bigint
, [Status] bigint
, Parity bigint
, CreateLSN numeric(38)
);
END
ELSE
BEGIN
Create Table #stage
(
RecoveryUnitID int /* This is the new column as of SQL 2012 */
, FileID int
, FileSize bigint
, StartOffset bigint
, FSeqNo bigint
, [Status] bigint
, Parity bigint
, CreateLSN numeric(38)
);
END

--

http://memegenerator.net/Grumpy-Cat

Regardless of the SQL version I tested, I received this:

Msg 2714, Level 16, State 1, Line 16
There is already an object named '#stage' in the database.

I played with it a little, including adding an IF EXISTS check to the beginning of the second block (and yes, I directed it to tempdb..#stage to reference the temp table) and none of it worked.  I poked around a little online and couldn't find a way to make it work - many people saying that it couldn't be done with temp tables, and that you should use a "regular" table or maybe a regular view instead.

My problem is that I am creating a script I want to run on lots of different servers across lots of environments, and I don't want to assume that the table name I am using doesn't already exist.  Is it likely that a client server will have a table named dbo.AndyGVLFCountReallyUniqueTableNameGUIDPurpleMonkeyDishwasher?  Well no, but you never know...  Also, many environments have rules about creating "real" objects without change control - even an object that will be created, exist for <30 seconds, and be dropped.

Besides at this point it had become a challenge of how to make it work - there had to be a different way of looking at the problem.  I fiddled with a table variable solution and had no better luck, resulting in a similar "already exists" error.

I realized part of the problem was how my script was laid out - I was checking for the lower version as my decision gate (in 8/9/10 ELSE) and while that was what needs to happen (I didn't want to hard code 11/12/13 and have it break with future versions) I didn't need to have the CREATE be part of the check - I just needed to handle the fact that the down-level object couldn't have the offending column:

--

Create Table #stage
(
RecoveryUnitID int /* This is the new column as of SQL 2012 */
, FileID      int
, FileSize    bigint
, StartOffset bigint
, FSeqNo      bigint
, [Status]    bigint
, Parity      bigint
, CreateLSN   numeric(38)
);

If (select LEFT(cast(serverproperty('ProductVersion') as varchar),2)) in ('8.','9.','10')
ALTER TABLE #stage DROP COLUMN RecoveryUnitID

--

https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgpToMii3ijkw0vgfUfzJTiBAvTU-Pm7hAFUU0MUXCT58ocU5CDRzdXK-cP6YYvG6-1fw7iOONuYbVwk6H6yFLACZ9XP8XSS-MLiZs6wPSEselOmuH-IFbZYQHlFjLhim7sAaZmAWe7Uk-6/s1600/33617404.jpg

In this case, I was able to create the table before the version check *with* the extra column, and then run a version check to drop the column if the instance is down-level.

With this in hand, I was able to modify Michelle's script to run for all current versions of SQL:

--

/*

VLF Count Script

Modifed From Michelle Ufford @sqlfool 
http://sqlfool.com/2010/06/check-vlf-counts/

Added version check code due to changes in DBCC LOGINFO

Tested on MSSQL 2005/2008/2008R2/2012/2014

*/

/*
NOTE - the output of DBCC LogInfo adds an extra 
column as of SQL 2012 so there is a version check 
to drop that column for older versions
*/

Create Table #stage
(
RecoveryUnitID int /* This is the new column as of SQL 2012 */
, FileID int
, FileSize bigint
, StartOffset bigint
, FSeqNo bigint
, [Status] bigint
, Parity bigint
, CreateLSN numeric(38)
);

If (select LEFT(cast(serverproperty('ProductVersion') as varchar),2)) in ('8.','9.','10')
ALTER TABLE #stage DROP COLUMN RecoveryUnitID

Create Table #results(
Database_Name sysname
, VLF_count int 
);


Exec sp_msforeachdb N'Use [?]; 

Insert Into #stage 
Exec sp_executeSQL N''DBCC LogInfo([?])''; 

Insert Into #results 
Select DB_Name(), Count(*) 
From #stage; 

Truncate Table #stage;'

Select * 
From #results
Order By VLF_count Desc;

Drop Table #stage;
Drop Table #results;

--

I am happy with the final product (the modified VLF count script) but also with my brief path of discovery on handling Temp Tables in IF...ELSE blocks - I know I have had similar problems before as outputs vary from version to version and now I have another idea to try the next time it comes up!

Hope this helps!


Tuesday, January 12, 2016

T-SQL Tuesday #74 - Who Grew The Database?



This month T-SQL Tuesday is hosted by Robert Davis (blog/@SQLSoldier) and his topic of choice is “Be The Change

(If you don’t know about T-SQL Tuesday check out the information here – each month there is a new topic and it is a great excuse to write each month (and to start writing!) because someone offers a topic, so you already have taken the first step!).

--

Robert's choice of topics threw me:
The topic for T-SQL Tuesday #74 is Be the Change. More specifically, data changes. How do you track changing data? How do you do your ETL? How do you clean or scrub your data? Anything related to changing data. Give us your most interesting post involving changing data next Tuesday, January 12, 2016.
As a mostly operational DBA, I have very limited contact with data changes and tracking them.  I have some very limited experience with Change Data Capture, but others have already written about it and done a very good job (also - read Mickey Stuewe's (blog/@SQLMickey) post "Knowing When Data Changes Occur in Standard Edition" about how to work around CDC being Enterprise Only!)

I had a breakthrough earlier today reading Jason Brimhall's (blog/@sqlrnnr) post "All about the Change" in which he writes about changes as related to SQL Audit.  I realized that a fair amount of the administrative triage that I do is directly caused by data changes, especially large scale ones. Here is a solution that I have modified over the years to help me track the changes in my systems *caused* by data changes.  (see how I spun that?)

--

The changes in question here are file size changes - as operational DBAs one of the problems we constantly deal with is files that grow out of control, often from maintenance work such as index rebuilds or from unusual ad-hoc operations such as the analyst working at 1am trying to create a personal copy of the giant Sales table.

We all know that the "best" thing to do (remember, #ItDepends) is to appropriately size your DATA and LOG files ahead of time, and if possible to manually grow those files after hours so that there isn't any production impact.  Even in this absolute best case (almost theoretical) scenario, it is still usually right to leave auto-growth enabled "just in case."

Very few of us live in that world, and we size our files as best as we can with almost no real business requirements using our DBA "Spidey Sense."  Our files are then "managed by auto-grow" as we try to find the best steady state for the DATA and LOG files and minimize impact while getting the best performance.

Does that sound familiar?

http://cdn.meme.am/instances/53655925.jpg
As a service provider, we monitor the drive free space and SQL Server Error Logs on our client servers (along with dozens of other things) - auto-growth problems can often be seen through messages like this:

Could not allocate space for object 'dbo.Table1'.'PK_Table1' in database 'myDatabase' because the 'PRIMARY' filegroup is full. Create disk space by deleting unneeded files, dropping objects in the filegroup, adding additional files to the filegroup, or setting autogrowth on for existing files in the filegroup.
...or...
The transaction log for database 'SomeOtherDatabase' is full due to 'ACTIVE_TRANSACTION'.
These messages usually mean that either the file has hit its MAXSIZE cap or that the drive hosting the file is full (or at least sufficiently full that it can't hold another FILEGROWTH increment worth of space). The first question that comes up in the root cause analysis is often "Who grew the database?"

Sometimes there are obvious culprits - is an index maintenance job running?  Is there a nightly ETL running that had an exceptionally large file last night?  Often there is no such smoking gun...and then you need something more.

--

I feel a little dirty writing about the Default Trace in the world of Extended Events, but I also know that many people simply don't know how to use XEvents, and this can be faster if you already have it in your toolbox.  Also it will work back to SQL 2005 where XEvents were new in SQL 2008.

I have modified this several times to improve it - I started with a query from Tibor Karaszi (blog/@TiborKaraszi), modified it with some code from Jason Strate (blog/@StrateSQL), and then modified that myself for what is included and what is filtered.  There are links to both Tibor's and Jason's source material in the code below.

--

/*

Default Trace Query

Especially useful for Auto Grow and Shrink Events 
but works for all default trace info

Modified from Tibor Karaszi at 
http://sqlblog.com/blogs/tibor_karaszi/archive/2008/06/19/did-we-have-recent-autogrow.aspx 

Default Trace Path query modified from
http://www.jasonstrate.com/2013/01/determining-default-trace-location/

*/

DECLARE @fn VARCHAR(1000), @df bit
SELECT @fn =REVERSE(SUBSTRING(REVERSE(path),CHARINDEX('\',REVERSE(path)),LEN(path)))+'log.trc',
/* reading log.trc instead of a specific file will query all current log files */
@df = is_default 
FROM sys.traces 
WHERE id = 1

IF @df = 0 OR @df IS NULL
BEGIN
  RAISERROR('No default trace running!', 16, 1)
  RETURN
END

SELECT te.name as EventName
, t.DatabaseName
, t.FileName
, t.TextData
, t.StartTime
, t.EndTime
, t.ApplicationName
, HostName
, LoginName
, Duration
, cast(Duration/1000000.0 as decimal(10,2)) as DurationSeconds
FROM fn_trace_gettable(@fn, DEFAULT) AS t  
INNER JOIN sys.trace_events AS te 
ON t.EventClass = te.trace_event_id  
WHERE 1=1 /* necessary to cleanly build WHERE clause */
/* find autogrowth events */ 
--and te.name LIKE '%grow%'
/* find autoshrink events */ 
--and te.name LIKE '%shrink%'
/* find manual shrink events */ 
--and (te.name = 'Audit DBCC Event' and (TextData like '%shrinkfile%' or TextData like '%shrinkdatabase%'))
--and DatabaseName='tempdb'
--and StartTime>'01/10/2016 00:00'
--and EndTime<='01/11/2016 13:00'
ORDER BY StartTime desc  

--

The base query pulls all events from the Default Trace.  As noted in the Variable assignment query from @fn, reading data from log.trc (rather than log_05.trc or log_44.trc for example) will combine the rows in the five current default trace TRC files.

The WHERE clause is built so that you can uncomment whichever lines you need.  The initial 1=1 is present so that all of the commented out lines can start with an 'AND' to allow them to flow together regardless of which lines you uncomment.

Want to find autogrowth events for the Bob database?  Uncomment the "and te.name like '%grow%' and the "and DatabaseName='Bob'" lines and you are set!  Need to add time filters?  Uncomment out those lines and modify the times.  And so on....

If you run the query, you can see that the autogrowth of Bob was caused by application "Sales" running on AppServer01, or by a .NET application on WebServer99 running as Domain\WebUser, or even by SQLCMD running on the server locally as the Agent service account (in my world this often means the growth is being caused by a Hallengren maintenance job, since his SQL Agent jobs run under SQLCMD).

--

Remember, this can be done via XEvents as well (a topic for another blog post) and since Trace is deprecated that is the "better" way to do this - but this still works and is quick to use.

Hope this helps!


Friday, January 8, 2016

Why is my SQL Server using all of the RAM on the server?

The TL;DR on this is simple:

"BECAUSE IT'S SUPPOSED TO!"

A frequent complaint we receive comes from a client that has an in-house sysadmin monitoring tool like Microsoft SCOM/SCCM.  They turn the tool on and it starts red-alerting because the Windows server hosting SQL Server is at 90%+ used RAM.  The sysadmin (or local DBA) logs on to the server and finds that there is 5GB free (~4%), and the sqlservr.exe process is using 120GB of the 128GB on the server!
http://cdn.meme.am/instances/55759451.jpg
Like almost all things SQL Server, this is a simplification of a situation that requires some serious #ItDepends.

  • What is the baseline - how much RAM is usually free on this server?  This can be seen by tracking the Perfmon counter Memory\Available MBytes.
  • What is the Buffer Manager\Page Life Expectancy number - not right this second, but over time?
  • Is there any actual performance issue on the server?
I have written before about the need for baselines, especially regarding memory and CPU use on the Windows server itself.  As a DBA, I don't care that my Windows server is at 95% RAM *if* both Windows and SQL Server aren't having performance problems *over time*.  

I started my IT career on a support/help desk supporting servers and desktops, and 16 years ago being at 5% free RAM probably meant you had 500MB-1GB free, which was definitely an OMG! (although then we probably didn't actually say OMG.)

Today's servers often have 96GB or more of RAM, and the larger the server the less relevant a percentage-based measurement is - just because your server has 256B of RAM doesn't mean Windows is suddenly going to need 25GB+ of RAM to run cleanly.


This brings us back to our three questions above.  Rather than worrying about some artificial guideline like 10%, how much memory is usually free on this server?  Track Memory\Available MBytes over time and see where it tops out - if the regular operation of your server always leaves 2-4GB free, you are probably OK from an Windows point of view.

What is the Page Life Expectancy of the SQL Server instance over time?  As many other have written about, PLE is one of the most misused statistics in SQL Server.  A DBA checks Perfmon or runs a DMV query (like this one from the awesome set curated by Glenn Berry (blog/@GlennAlanBerry)):
-- Page Life Expectancy (PLE) value for each NUMA node in current instance  (Query 36) (PLE by NUMA Node)
SELECT @@SERVERNAME AS [Server Name], [object_name], instance_name, cntr_value AS [Page Life Expectancy]
FROM sys.dm_os_performance_counters WITH (NOLOCK)
WHERE [object_name] LIKE N'%Buffer Node%' -- Handles named instances
AND counter_name = N'Page life expectancy' OPTION (RECOMPILE);
They find that the PLE is some relatively small number and freak out, or worse, they find it is a large number and think "my SQL Server doesn't need this much RAM, I can cut back."

danger-cat.jpg (500×421)
https://lost100pounds.files.wordpress.com/2011/05/danger-cat.jpg
Like Available MBytes, PLE is a number that needs to be measured over time.  On the servers I touch I always recommend setting up a "permanent" Perfmon collector of SQL Server-related counters so that you can look at the numbers over time. For example, a usual PLE chart over a week can look like this:


...or it can look like this:


Even over time, some people look at this second chart and say one of two things:

  • My PLE goes down to 14, what do I do?
  • My PLE maxes out at 50,000, what do I do?
Again, #ItDepends - what is going on in SQL Server during each of these times?  If you look at the above chart you can see that PLE generally climbs during the day (with some minor fluctuations), but then tanks over night pretty regularly. 

Look at when SQL Server dips to 14 - is that when reindexing or CheckDB runs?  If so, is performance good enough?  If CheckDB runs in two hours, and your window is three to four hour, isn't that acceptable?  Are other things running at the same time that are negatively impacted, or is that window clear?

If you do need to improve the processes running at that moment (when PLE dips), what is the Available MBytes at this same time and over time?  If it is high over time, it shows that there is head room on your server to consider increasing your Max Server Memory and to help your PLE stay higher.  If your Available MBytes is not consistently high, you can't consider increasing Max Server Memory unless you increase actual server RAM.

This is the 60-second version of quick memory troubleshooting, but think about it - isn't this (even just this small piece of it) a lot more complicated than "I have less than 10% free RAM."


Is anybody or any process complaining (besides SCCM)?  Are any numbers out of line - for example is a virus scan that normally takes 2 hours suddenly taking 4-5?  If nothing else is abnormal, is there a problem?

--

As I mentioned at the start of this post, quite often your SQL Server process(es) are simply *supposed* to be using the lion's share by far of the resources on the Windows server - especially RAM.  As long as you reserve sufficient head room for the operating system (usually 4GB-16GB as described here by Jonathan Kehayias (blog/@SQLPoolBoy) but modified by monitoring Avail MBytes) then the percentage of RAM used probably doesn't matter.

...and if you have good sysadmins they will listen to you and realize that SQL Servers (like Oracle, Exchange, and other database servers) are simply different from web servers or file servers.

Hope this helps!