One DBA's Ongoing Search for Clarity in the Middle of Nowhere


*or*

Yet Another Andy Writing About SQL Server

Thursday, December 13, 2018

Things I Learned at Summit v20 - Visual Studio Code

Part of a series of posts on cool stuff I learned at PASS Summit v20 (2018) - in this post we'll look at a code editing tool that I hadn't seen before - Visual Studio Code.

--

When I attend a session at a technical event - Summit/SQL Saturday/etc. - the question inevitably comes:

ScaryDBA and current PASS President Grant Fritchey at SQL Saturday Iowa City 2018
I always answer that I'm a DBA, but really we're all Developers as well - it's just what language we use.

I have been developing in T-SQL for almost 20 years, and for much of that I have been using the ultimate editing tool...Microsoft Notepad.

Yup...Notepad.

In the early days I developed in Query Analyzer (Yes, I'm old) and then Management Studio, but over time most of the work I have done is now at client sites on client servers, so I don't run Management Studio on my laptop very often (pretty much just when I write blogs and work on presentations actually).

I write my code in Notepad and then copy-paste into the client window and execute...and it has worked for me for some time.

Sure there's no Intellisense, but I lived with it.  I have had times in the past (although I haven't re-tested in some time) where Management Studio became a resource hog - especially with multiple sessions open - and made my laptop unhappy, so running little tiny Notepad was the trade-off and it worked for me.

https://i.chzbgr.com/full/1300811520/h8F002C7D/
As a "TSQL Developer" I never had the need for full-on MS Visual Studio to justify the cost to myself or my employers.

--

In several sessions at Summit v20 I saw the presenters open a different tool; you could tell from the appearance that it wasn't commercial MS Visual Studio, but it wasn't Management Studio either.


The second time I saw it, I asked the speaker what they were running, and they gave a three minute demo of an open source tool called Visual Studio Code.

https://code.visualstudio.com/

It is an open-source tool...

https://memegenerator.net/img/instances/68516476/i-felt-a-great-disturbance-in-the-force-as-if-millions-of-voices-suddenly-cried-out-in-terror-and-we.jpg
Yes, yes...OPEN SOURCE!  #TheWaveOfTheFuture

As I was saying, it is an open-source tool, and because of that it can run on multiple operating systems, including Windows, MacOS, and Linux.

It turned out that was why the speakers I saw were running it - it was a tool they could run on the MacBooks they use to present.

(Not going to get into a philosophical MS vs Mac argument here, but the next time you are at a SQL Server event, watch how many high-end presenters are using Macs - it may surprise you!)

I have played with it some since I returned from Summit and it is growing on me - the first time I opened a TSQL file it prompted me to install a SQL extension, which wasn't an issue.

I can see how this tool will be useful to all DBA's, but especially those that also developer in some other language - Visual/Python/etc. - as well as those that like Mac or Linux.

--

Hope this helps!


Thursday, November 29, 2018

Things I Learned at Summit v20 - Trace Flag 4199

Part of a series of posts on cool stuff I learned at PASS Summit v20 (2018) - in this first post we'll look at a trace flag I had never even heard of, let alone deployed - 4199.

https://i.imgflip.com/15frvt.jpg
What is a Trace Flag?  According to Microsoft:
Trace flags are used to temporarily set specific server characteristics or to switch off a particular behavior. 
At the most basic, you can set or unset a trace flag in order to modify how SQL Server works.  This can be done globally (instance-wide) or for a particular session.

There are a LOT of trace flags, and several are widely used, such as 1204 and 1222 (Deadlock Info) and 1118 (Disable Single Page Allocations), but I haven't heard a Microsoft employee tell me I should "almost always" turn any trace flag on - until now.

--

On Tuesday of Summit week, I was in the pre-conference session put on by the Microsoft MSSQL Tiger Team - the Senior Product Managers of SQL Server Engineering (blog/@mssqltiger).

https://fico.i.lithium.com/t5/image/serverpage/image-id/31298i0D30CC780E2784AD?v=1.0
The session was titled "Modernize Your SQL Server with Bob Ward, the Tiger Team, and CSS Escalation Engineers" and it...was...awesome!

One of the presenters was Pedro Lopes (blog/@SQLPedro), a Senior PM for the Relational Engine.  In his part of the day he talked about several features of the engine and the optimizer, but the "What...what did he say?" moment for me was when he talked about trace flag 4199...and how we should have it turned on pretty much everywhere.

Wait...what?

https://vignette.wikia.nocookie.net/halofanon/images/d/de/BEAN-say-what.jpg/revision/latest?cb=20130302200428

Trace Flag 4199 enables query optimizer fixes that are included in CUs (and the CU breakpoint releases formerly known as "service packs") after RTM.

What does this mean?  If you aren't running with TF 4199, then your Query Optimizer (QO) is running the same as RTM, regardless of what CU level you have applied to your engine.

For example, if you are running SQL Server 2012 SP4 (11.0.7001.0) but you aren't running with TF 4199, then your QO is running the same as someone whose instance is running 2012 RTM (11.0.2100.60).  There are plenty of other things changed by applying the CU, but Query Optimizer function isn't one of them...without 4199.

That melted my mind a little, and I could tell the same was true for the attendees sitting near me - we can religiously apply CUs and fixes but unless we're running this TF (which none of us had ever even *heard* of) then you aren't fixing (improving) your Query Optimizer.

https://i.chzbgr.com/full/3519512832/h0B54C8A3/

Query Optimizer fixes are shipped as OFF by default in order to maintain compatibility and performance when CUs are deployed (so there aren't plan regressions).  Originally there was a new trace flag for each fix, so to turn on multiple fixes you would have to enable multiple trace flags.

As described at https://support.microsoft.com/en-us/help/974006/sql-server-query-optimizer-hotfix-trace-flag-4199-servicing-model, starting way back in SQL2005-SP3-CU6, SQL2008-SP1-CU7, and SQL2008R2 RTM, the ability to enable query optimizer fixes was rolled into one switch, TF 4199. 

While reading the article I tripped over one sentence:
Because trace flag 4199 is recommended only for customers who are seeing specific performance issues, customers are advised to remove trace flag 4199 after they migrate their databases to the latest compatibility level because trace flag 4199 will be reused for future fixes that may not apply to your application and could cause unexpected plan performance changes on a production system
This was interesting to see, because it ran contrary to the advice from this session and several other places I found references to enabling 4199, but this sentence reads like standard lawyerly boilerplate which is enough to make me treat it with healthy distrust.

(For this blog post I took a step I never have for a blog post before - I reached out to Pedro directly to verify my interpretation of his message to make sure I wasn't misrepresenting his content.  He graciously responded with a few comments, including the fact that he is trying to get the above highlighted sentence updated to reflect changes in how TF 4199 works under SQL 2016+ (see below)).

--

At the pre-conference session, Pedro's comments were basically that while there was a time when we all "waited for SP1" to let everyone else test, that isn't how the modern SQL Server development model works.  With the extensive testing that occurs these days as well as the "Cloud First" model that has everything live in production in Azure long before it gets to the on-premises product, there simply isn't that much risk to deploying RTM software or to deploying CUs as they come out.

IMPORTANT NOTE - There is still always (*always*) due diligence - patch DEV before PROD (we all do *that* at least, right?) - but by the time we get a new CU for an on-prem SQL Server it has already been used by thousands (hundreds of thousands) of users in Microsoft testing and on Azure.

In effect, we are already waiting for other people to test it, even without waiting at all.

https://ih0.redbubble.net/image.400523286.2984/flat,550x550,075,f.u3.jpg

While researching this topic I found another well-put opinion in one of the forum answers here by SQL MVP Gianluca Sartori (blog/@spaghettidba):

--
Personally, whenever I build a new server for a new project I always enable TF4199 globally. The same applies when I upgrade existing instances to newer versions.
The TF enables new fixes that would affect the behaviour of the application, but for new projects the risk of regression is not an issue. For instances upgraded from previous versions, the differences between old and new version are a concern on their own and having to deal with plan regression is expected anyway, so I prefer fighting with it with TF4199 enabled. <emphasis mine>
AS far as existing databases is concerned, there is only one way to know: test it. You can capture a workload on the existing setup and replay it after enabling the flag. RML Utilities can help you automate the process, as described in this answer.
Obviously, the flag affects the whole instance, so you'll have to test all the databases sitting there.
--

Another piece to the TF 4199 story is how it relates to database compatibility level.  Prior to SQL 2016, 4199 was an all or nothing - if you were running SQL 2014 SP1 with 4199 enabled, you got the Query Optimizer fixes included in 2014 SP1, regardless of the DB Compat level.

Starting in SQL 2016, DB Compat Level is relevant with 4199 enabled:

https://sqlbits.com/Downloads/550/SQLBits%20-%20Enhancements%20that%20will%20make%20your%20SQL%20database%20engine%20Part%202.pdf

On a SQL Server 2016 server, running in 130 compatibility automatically enables fixes from previous compatibility levels, even without running 4199; the purpose of 4199 in SQL 2016 is to enables fixes *after 2016 RTM* for databases in compatibility 130 - fixes before 2016 RTM are already enabled in compatibility 130.

This model is continued in 2017+ as described here.

This is consistent with another piece of info I learned at Summit v20 from the Tiger Team - start certifying for Compatibility Level, not SQL Server version.  When developing and testing, verify your code for DBCompat 130, *not* SQL Server 2016. (I may write more about that later.)

--

One other item Pedro highlighted in his email response to me was that there are multiple trace flags that Microsoft recommends for various systems, especially high-traffic/high-performance systems.  These still aren't "turn on everywhere" but are frequently useful.  

They are laid out in a pair of MS support articles:
Recommended updates and configuration options for SQL Server 2012 and SQL Server 2014 with high-performance workloads
Recommended updates and configuration options for SQL Server 2017 and 2016 with high-performance workloads
Many of these items and other related high-performance configurations are detailed at:
http://aka.ms/sqlperfcenter
Thanks Pedro for your response - and thanks Tiger Team for your amazing pre-con!

https://sayingimages.com/wp-content/uploads/dont-try-to-deny-it-awesome-meme.jpg

--

Hope this helps!

Wednesday, October 17, 2018

Invalid value given for parameter PASSWORD. Specify a valid parameter value.

I still use sp_help_revlogin to transfer logins between servers.  I know there are other options, including a set of Powershell commands contained in dbatools (#YouCanDoAnythingWithPowershell) but this is a case where the old method works so I don't mess with it.

http://wanna-joke.com/wp-content/uploads/2013/08/funny-pictures-too-olf-for-internet.jpg

Having said that, I know a lot of people still use sp_help_revlogin, so I wanted to document something I tripped over.

--

This morning trying to use a saved script to transfer a login to a new server, I ran into this:
Msg 15021, Level 16, State 2, Line 1 
Invalid value given for parameter PASSWORD. Specify a valid parameter value.
https://pics.me.me/um-ok-memecrunch-com-19987825.png

This was a new one on me, but a little Google led me to the problem, and it was described in the notes for the sp_help_revlogin procedure itself at https://support.microsoft.com/en-us/help/918992/how-to-transfer-logins-and-passwords-between-instances-of-sql-server:

--
A password can be hashed in the following ways:
  • VERSION_SHA1: This hash is generated by using the SHA1 algorithm and is used in SQL Server 2000 through SQL Server 2008 R2.
  • VERSION_SHA2: This hash is generated by using the SHA2 512 algorithm and is used in SQL Server 2012 and later versions.
--

Sure enough I checked and the server I was trying to deploy onto was a SQL 2008R2 server (yes our clients still have lots of 2005/2008/2008R2).

The script I had saved looked like this:
CREATE LOGIN [MyLogin] 
WITH PASSWORD = 0x0200A52324E0FFEBABFAD9327A62EECD2A5D648FF3B98D97C5D020FC16552BA728464BAAC04C8EFDE3BB0235A1F3648E419B038C04C8EFDE3BB0235A1F3648E419B0385A5933 HASHED
/* Hash slightly character masked :) */

, SID = 0x7FCDC703D88B5C48A47D7D13699658E2

, DEFAULT_DATABASE = [master]

, CHECK_POLICY = OFF

, CHECK_EXPIRATION = OFF
As a test, I ran an sp_help_revlogin for the same login (same password) on SQL 2008 server:
CREATE LOGIN [MyLogin] 
WITH PASSWORD = 0x0100E02F8267F60
81A5AB316F8A6891C81A5AB316F8A6891CCC HASHED
/* Hash slightly character masked :) */

, SID = 0x14B5D0191E908D499C83C1695F13F7F3

, DEFAULT_DATABASE = [master]

, CHECK_POLICY = OFF

, CHECK_EXPIRATION = OFF
Note that the second password hash is significantly shorter (which makes perfect sense since SHA1 is a shorter hash than SHA2, as described here.)  This is one of the key reasons SHA1 has fallen out of favor - there have been known issues with SHA1 for years.

(Another reason to get off SQL 2005/2008/2008R2...as if you should need another reason in the year 2018!)

http://img.izismile.com/img/img6/20130327/original640/old_technology_640_01.jpg

The bottom line is that to have login scripts like this, you need two versions, a SQL2005-2008R2 SHA1 version, and a SQL 2012+ SHA2 version.

*OR*

If you truly only want one copy of your script, it needs to be a SHA1 (short hash) version.  SQL 2005/2008/2008 R2 don't understand the 2012+ SHA2 long hash, but SQL 2012+ is backward-compatible to the SHA1 short hash.

Having said that, I would still use the SHA2 hash script whenever possible.

--

Hope this helps!


Wednesday, September 26, 2018

Coming Soon to Minnesota...Lincoln...and Seattle!!!

It's that time of year when SQLSaturdays start to slow down (in the U.S. anyway) as we march towards Summit and the holiday season.

Having said that, I have three speaking engagements in the near future - two SQLSaturdays and...SUMMIT!!!

https://pbs.twimg.com/profile_images/1037416795282403329/vrSIsVTR_400x400.jpg

I will be giving slight variations on the same talk at all three events as I tweak possible changes for my Summit talk.

--

SQLSaturday #796 - Minnesota - St Paul, MN - October 6th

SQLSaturday #767 - Lincoln - Lincoln, NE - October 27th

PASS Summit v.20 - Seattle, WA - November 5th-9th

--

Minnesota and Lincoln are always amazing events, and this is only my second time speaking at Summit.  I have spoken at more than a dozen SQLSaturdays and yet I get edgy thinking about being on stage at Summit (it's just a really big SQLSaturday, right?  ...right?)

http://he.memegenerator.net/instance/59791161/mr-bean-you-just-keep-telling-yourself-that
--

The talk I am giving is my Extended Events talk:

--
Getting Started with Extended Events 
Few subjects in Microsoft SQL Server inspire the same amount of Fear, Uncertainty, and Doubt (FUD) as Extended Events. Many DBAs continue to use Profiler and SQL Trace even though they have been deprecated for years. Why is this? 
Extended Events started out in SQL Server 2008 with no user interface and only a few voices in the community documenting the features as they found them. Since then, it has blossomed into a full feature of SQL Server and an amazingly low-impact replacement for Profiler and Trace.
Come learn how to get started - the basics of sessions, events, actions, targets, packages, and more. We will look at some base scenarios where Extended Events can be very useful as well as considering a few gotchas along the way. You may never go back to Profiler again!
--

I have been tweaking it to add more demos and extend the length (Summit sessions are 75 minutes and most SQLSaturday sessions are 60-70 minutes.)  Even if you have seen me present this session before I hope you will get something out of it.

--

Another important part of SQLSaturdays and Summit to me are the pre-conference sessions/workshops.  They do cost an extra fee but it is worth it for a full day of in-depth training on a topic by a dedicated speaker (events like this don't give preconference slots to just anybody).

All three of these events have preconference sessions, and to me as primarily a systems DBA I recommend the following:

--

SQLSaturday Minnesota (Full List at https://www.eventbrite.com/e/sqlsaturday-796-minnesota-pre-cons-tickets-49272636783?aff=SQLSaturday)

Data Security for the Cloud by Ed Leighton-Dick from Kingfisher Technologies - I have known Ed for some time and he has put a lot of time into becoming a rising star in the security area.

--

SQLSaturday Lincoln (Full List at https://www.sqlsaturday.com/767/eventhome.aspx)

Level UP! Your Cloud Infrastructure Skills by David Klee of Heraflux - David is one of the top (if not the top) speaker on SQL Server on VMs and the cloud.

Data Security for the Cloud by Ed (see above) - if you don't see Ed in Minnesota you have another chance in Lincoln!

--

Summit (Full List at https://www.pass.org/summit/2018/Learn/Pre-ConferenceSessions.aspx

SQL Server Security by Denny Cherry of Denny Cherry & Associates Consulting (DCAC) - Denny is a security (and all-around) SQL Server expert who always gives a great talk.

Modernize your SQL Server with Bob Ward, the Tiger Team, and CSS Escalation Engineers by Bob Ward and the SQL Tiger Team from Microsoft - nobody can melt your brain like Bob Ward.  Any time you can see Bob speak, you should...period.

Fixing Query Performance Problems from Estimates, Statistics, Heuristics, and Cardinality by Kimberly L. Tripp of SQLskills.com - Kimberly knows more about statistics and the related topics than anyone I know, and she shares it so well. 

Kimberly and Bob are in the same time slot (all-day Tuesday) and in this case I am going to see Bob but I have already had weeks of training from Kimberly and if her topic sounds more interesting to you it will be worth it!

--

Hope to see you at these events and in the preconference sessions!

https://memegenerator.net/img/instances/68890668/see-you-soon.jpg


Tuesday, September 4, 2018

Did Not Know That - Redirecting CMD to the Clipboard

One of the items I frequently deal with as a Production DBA is drive alarms.  I have written previously about looking for space inside database files, but what if your problem is non-database files?

The goto tool in Windows for me has always been du (Disk Usage) created by Mark Russinovich (blog/@markrussinovich) as part of his Sysinternals project.

du is a very straightforward command tool the traverses the directory structure below the current path and returns the folder sizes of the given folders.  For example, the output for C:\Windows on a given server looks like this:


Fine to look at, but a bit onerous to manipulate - after all this is just a screenshot.

The first way I knew to deal with this was via the >> redirect:
du -l 1 >> Output4.txt
This would create a text file at the path being searched containing the output from the command, which can then be copy-pasted out to another site:



To clean it up for manipulation (in Excel for example), you can add a parameter to du to turn the output to comma- or tab-delimited format (either -c or -ct):
du -ct -l 1 >> output5.txt

--

The new trick I learned is that you can use a pipe - similar to Powershell - to redirect the output of the du command to another command, and that there is a command to send the output of a command directly to the clipboard (unbelievably enough, called "clip"):
du -ct -l 1 | clip
http://ru.memegenerator.net/instance/57285847/mind-blown-cat-mind-blown
I found this trick in a helpful blog post here.

Once it's in the clipboard, it can be pasted straight into Excel where it can be easily sorted, for example to find the largest directories:


Pretty cool, huh?

This same process of piping into clip works with a wide variety of command line commands, but this is the one for which I see the most immediate value.

--

Hope this helps!


Thursday, August 2, 2018

The Mystery of the Exploding T-Log

There were a stack of errors overnight in the DB123 database on SQL01, including one horror show error:

--

Log Name:      Application
Source:        MSSQLSERVER
Date:          5/24/2018 12:49:39 AM
Event ID:      3314
Task Category: Server
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      SQL01.client99.com
Description:
During undoing of a logged operation in database 'DB123', an error occurred at log record ID (195237:96178:4). Typically, the specific failure is logged previously as an error in the Windows Event Log service. Restore the database or file from a backup, or repair the database.

--

https://i.imgflip.com/rlu4s.jpg?a424656
The catch is that when I checked the server, the database was online.

Say what?

--

It started with the LOG drive filling generating a long string of 9002’s:

--

Log Name:      Application
Source:        MSSQLSERVER
Date:          5/24/2018 12:49:39 AM
Event ID:      9002
Task Category: Server
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      SQL01.client99.com
Description:
The transaction log for database 'DB123' is full due to 'ACTIVE_TRANSACTION'.

--

Followed by a “crash” including the horror show error:

--

Log Name:      Application
Source:        MSSQLSERVER
Date:          5/24/2018 12:49:39 AM
Event ID:      9001
Task Category: Server
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      SQL01.client99.com
Description:
The log for database 'DB123' is not available. Check the event log for related error messages. Resolve any errors and restart the database.

--

Log Name:      Application
Source:        MSSQLSERVER
Date:          5/24/2018 12:49:39 AM
Event ID:      3314
Task Category: Server
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      SQL01.client99.com
Description:
During undoing of a logged operation in database 'DB123', an error occurred at log record ID (195237:96178:4). Typically, the specific failure is logged previously as an error in the Windows Event Log service. Restore the database or file from a backup, or repair the database.

--

…which seems scary, but then the database seemed to magically recover on its own:

--

Log Name:      Application
Source:        MSSQLSERVER
Date:          5/24/2018 12:49:41 AM
Event ID:      17137
Task Category: Server
Level:         Information
Keywords:      Classic
User:          N/A
Computer:      SQL01.client99.com
Description:
Starting up database 'DB123'.

--

Log Name:      Application
Source:        MSSQLSERVER
Date:          5/24/2018 12:49:45 AM
Event ID:      3450
Task Category: Server
Level:         Information
Keywords:      Classic
User:          N/A
Computer:      SQL01.client99.com
Description:
Recovery of database 'DB123' (5) is 0%% complete (approximately 4502 seconds remain). Phase 1 of 3. This is an informational message only. No user action is required.

--

There was a huge string of Error 18456 (login failures) since DB123 was unavailable during regular recovery, followed eventually by:

--

Log Name:      Application
Source:        MSSQLSERVER
Date:          5/24/2018 1:08:12 AM
Event ID:      3421
Task Category: Server
Level:         Information
Keywords:      Classic
User:          N/A
Computer:      SQL01.client99.com
Description:
Recovery completed for database DB123 (database ID 5) in 1107 second(s) (analysis 52027 ms, redo 238039 ms, undo 700990 ms.) This is an informational message only. No user action is required.

--

…and all was well.

http://www.thingsmadefromletters.com/wp-content/uploads/2013/10/whut-i-dont-understand-this.jpg

I went looking for similar situations and found this:


In this situation the drive filled and the database crashed, going through recovery once there was additional free space on the drive, which sounded exactly like the situation here.

In short, there was no actual corruption crash, just insufficient disk space.

--

Using the default trace query described here, I checked for autogrowths, and here was what they looked like:

EventName
DatabaseName
FileName
ApplicationName
HostName
LoginName
Duration
TextData
Log File Auto Grow
DB123
DB123_log
.Net SqlClient Data Provider
WEB05
WebLogin01
1013000
NULL
Log File Auto Grow
DB123
DB123_log
.Net SqlClient Data Provider
WEB05
WebLogin01
920000
NULL
Log File Auto Grow
DB123
DB123_log
.Net SqlClient Data Provider
WEB05
WebLogin01
873000
NULL
Log File Auto Grow
DB123
DB123_log
.Net SqlClient Data Provider
WEB05
WebLogin01
1083000
NULL
Log File Auto Grow
DB123
DB123_log
.Net SqlClient Data Provider
WEB05
WebLogin01
646000
NULL
Log File Auto Grow
DB123
DB123_log
Ivanti
WEB05
WebLogin01
903000
NULL
Log File Auto Grow
DB123
DB123_log
.Net SqlClient Data Provider
WEB05
WebLogin01
880000
NULL
Log File Auto Grow
DB123
DB123_log
.Net SqlClient Data Provider
WEB05
WebLogin01
846000
NULL
Log File Auto Grow
DB123
DB123_log
.Net SqlClient Data Provider
WEB05
WebLogin01
1176000
NULL
Log File Auto Grow
DB123
DB123_log
.Net SqlClient Data Provider
WEB05
WebLogin01
846000
NULL
Log File Auto Grow
DB123
DB123_log
.Net SqlClient Data Provider
WEB05
WebLogin01
1416000
NULL

Since the TEXTDATA is NULL I can’t say what process caused the growths for sure, but they started at 12:22am local server time. 

I could tell the client that since they came from the .NET provider they aren’t caused by Index/Statistics maintenance or anything else internal.

--

To try to run down what might be causing the autogrowths I set up an Extended Events session to capture statement completion events (rpc_completed, sp_statement_completed, and sql_batchcompleted) to see what was actually running during the time slice.  Sure enough there was a smoking gun:



Check out the row near the middle with a very large duration, logical_reads count, *and* writes count.

https://steemitimages.com/DQmarWgB1ztrx628DTGpkteJeKiU3waad9YRtm5jVedJxws/problem.jpg
Durations in SQL 2012 Extended Events are in microseconds – a very small unit.

The catch is the duration for this statement is 6,349,645,772 (6 billion) microseconds…105.83 minutes!

--

The DB123 database is in SIMPLE recovery (note - when I find this I question it – isn’t point-in-time recovery important? – but that is another discussion).  The relevance here is that in SIMPLE recovery, LOG backups are irrelevant (and actually *can’t* be run) – once a transaction or batch completes the LDF/LOG file space is marked available for re-use, meaning that *usually* the LDF file doesn’t grow very large. (usually!)

A database in SIMPLE recovery growing large LDF/LOG files almost always mean a long-running unit of work or accidentally open transactions (a BEGIN TRAN with no subsequent CLOSE TRAN) – looking at the errors in the SQL Error Log last night, the 9002 log file full errors stopped at 1:45:51am server time, which means the offending unit of work ended then one way or another (crash or success).

Sure enough when I filtered the XEvents event file to things with a duration > 100 microseconds and then scanned down to 1:45:00 I quickly saw the row shown above.   Note this doesn’t mean the unit of work was excessively large in CPU/RAM/IO/etc. (and in fact the query errored out due to lack of LOG space) but the excessive duration made all of the tiny units of work over the previous 105 minutes have to persist in the transaction LDF/LOG file until this unit of work completed, preventing all of the LOG from that time to be marked for re-use until this statement ended.

--

The query in question is this:

--

exec sp_executesql N'DELETE from SecurityAction 
WHERE ActionDate < @1 AND 
(ActionCode != 102 AND ActionCode != 103 AND ActionCode != 129 
AND ActionCode != 130)',N'@1 datetime',@1='2017-09-14 00:00:00'

--

Stripping off the sp_executesql wrapper and the parameter replacement turns it into this:

--

DELETE from SecurityAction
WHERE ActionDate < '2017-09-14 00:00:00'
AND
(
ActionCode != 102
AND ActionCode != 103
AND ActionCode != 129
AND ActionCode != 130
)

--

Checking out DB123.dbo.SecurityAction, the table is 13GB with 14 million rows.  Looking at a random TOP 1000 rows I see rows that would satisfy this query that are at least 7 months old (from February 2017) and there may be even older rows – I didn’t want to spend the resources to run an ORDER BY query to find the absolute oldest.  I don’t know if this means this individual statement has been failing for a long time or maybe this DELETE statement is a recent addition to the overnight processes and has been failing since then.

The DELETE looks like it must be a scheduled/regular operation from a .NET application running on WEB05 since we are seeing it each night in a similar time window.

The immediate term action they chose was to run a batched DELETE to purge this table down – once it has been purged down maybe the nightly operation will run with a short enough duration to not break the universe.

I told them it should be able to be handled with a TOP DELETE something like this:

--

WHILE 1=1
BEGIN
       DELETE TOP (10000) from SecurityAction
       WHERE ActionDate < '2017-09-14 00:00:00'
       AND
       (
       ActionCode != 102
       AND ActionCode != 103
       AND ActionCode != 129
       AND ActionCode != 130
       )
END
      
--

This would DELETE rows in 10,000 row batches; wrapping it in the WHILE 1=1 allows the batch to run over and over without ending.  Once there are no more rows to satisfy the criteria the query *WILL CONTINUE TO RUN* but will simply delete -0- rows each time, and then when the user is ready to stop the batch they simply cancel it.

You could write a WHILE clause that actually checks for the existence of rows that meet the criteria, but in this case there were so many rows that meet the criteria that running the check itself is more intensive than simply allowing the DELETE to run over and over.

It took several hours of running the "batched" loop, but eventually the table got down from 14 million rows to 900,000 rows without growing the LDF file past 2GB!


https://memegenerator.net/img/instances/80336841/booyah.jpg
--

Lesson #1 - don't run giant deletes without considering the impact.

Lesson #2 (and maybe the more important) - "horror show" errors aren't always a horror show - always investigate fully!

--

Hope this helps!