One DBA's Ongoing Search for Clarity in the Middle of Nowhere


*or*

Yet Another Andy Writing About SQL Server

Tuesday, February 11, 2014

T-SQL Tuesday #051: Always Hedge Your Bets

 TSQL2sDay 
It's time for T-SQL Tuesday again and this month's host is my former co-worker, the @SQLRNNR himself, Jason Brimhall.  Jason's chosen theme is Poker Face, asking us to describe situations where someone took a bet that was just a little too risky for our liking.

When I think about users doing risky (AKA boneheaded) things, it almost always comes back to backups - or more accurately, the *lack* of backups.

"We are running a major migration *and* upgrade of our credit card processing system this weekend, but we only have a four hour window, so we won't take backups - we can just revert to last night's Windows backups, right?"

"We don't backup our system databases because they aren't important (even though they have hundreds of SSIS packages in msdb)."

"We backup our databases to the local X: drive, but we don't sweep them off to the network or an appliance - taking the backup is good enough, right?"

"We take full backups every single day and log backups every hour - but we don't have time or storage to test them."

To me, this last one is the most dangerous "risky thing" of all, because by my experience it is easily the most pervasive problem related to backups.  Almost all DBAs take regular backups of most or all of their databases, but very few seem to have regular test restores.

As Paul Randal (blog/@PaulRandal) noted in his excellent blog post (I don't know who said it first, because it is credited to too many different SQL Server professionals to count):

"You need to design a restore strategy, not a backup strategy."

One of the most important parts of a restore strategy is testing your backups.  Unfortunately in my 13+ years as a SQL Server DBA (including 4+ years as a Managed Services DBA/Consultant), I can count on one hand the businesses/clients that had a regular restore test environment and schedule, and even those businesses only did it for several "critical" databases.

Why does this happen?  I can think of a couple "reasons" (excuses) I have heard over the years:

Excuse #1 - "We can't afford the servers/time/licenses/whatever to test our backups."

BZZZZZ - False!  In reality you can't afford *not* to test your backups.  Especially in today's ever-increasingly virtualized world, it is becoming easier and easier to throw up more servers, and this often problematic fact (which can lead to server sprawl and other manageability issues for the DBA) can be leveraged in a positive way to create a backup/restore test environment that can be taken up and down or cloned as quickly as a mouse click. 

As Brent Ozar (blog/@BrentO) notes in his post "Dev, Test and Production SQL Server environments" the ideal situation is to perform nightly restores of your PROD environment into QA (with limitations related to sensitive data, etc. as Brent describes) which not only gives you the benefit of current data in your QA environment, but also tests your backup/restore process as well.

Paul Randal ("Mr. DBCC") notes in his thorough blog series on "CHECKDB From Every Angle" that an additional benefit of performing this type of backup/restore testing is the ability to run your regular CHECKDB - BTW - you do run regular CHECKDB's on all of your databases, user and system, right?  That's another poor DBA practice for another blog post - many DBAs only run CHECKDB once in a while or only when something goes wrong (which is usually too late).  You need to run CHECKDB as often as your system can handle it - daily is great but weekly is definitely better than many DBAs do.  As Paul succinctly puts it in SQLskills's great Immersion Event classes, your data is only as clean as of the time of your last successful CHECKDB. </rant>

As I was saying, with a regular backup/restore strategy, you can run your DBCC CHECKDB against that restored copy and a clean result will show you that your PROD database is clean *as of the time of the backup*.  This last part is especially relevant - if your PROD backup runs at 10pm and you restore it to QA at 4am and get a clean CHECKDB against QA, but then find out later that morning that there was a problem at 1am, your clean CHECKDB doesn't mean anything other than helping put a box around the problem - the corruption occurred sometime between the 10pm backup that resulted in the clean CHECKDB and 1am when the problem was discovered.

Excuse #2 - "We only need to test restores on our critical databases, right?  The other databases don't mean as much."

BZZZZZ - False!  At the risk of another rant, this leads me to another common misconception many DBAs/managers/carbon-based lifeforms have - "It's just DEV, so it doesn't matter."

It took me a few years to get it, but for some time my take on this scenario is:

"Every system is PROD to somebody."

What about a DEV system?  It's PROD for the development team - and heaven help you when the DEV databases that haven't been backed up for six months (you know, the bi-annual DEV backup scheme) goes down and loses all of the code for the latest release that was supposed to be checked into Source Control, but you know, they didn't have time and it shouldn't matter because the DBA backs up the databases, RIGHT?  To management and other IT teams, it is always the Default Blame Acceptor's fault - and you are kidding yourself if you think otherwise.

What about a QA system?  It's PROD to the QA team and their testers - and unless you have the nightly backup/restore to QA cycle described above in place, do you know how long it will take to refresh QA from PROD?  Especially if you have to scrub out the sensitive data - a restore of an already-scrubbed backup of QA will almost certainly be significantly faster than a "scrub & refresh" from PROD. How many hours of lost QA work will there be, possibly impacting an upcoming release date, before QA is back online?

I know that these comments are mostly written from a development shop point of view, but even if you don't build your systems - you just buy third-party software for everything - you still probably (should) have at least one layer of DEV/TEST/QA system to test regular patches and new vendor code releases on, and the same statements made above apply to those systems as well.  Consider this comment that could easily cost you your job: "Sorry, Mr. CIO, but the SharePoint upgrade for tomorrow can't happen because the QA team hasn't finished their work yet and QA is down with no database backups."

--

Too make a long story short (I know - too late) - without regular test restores, your backups do not provide much of a guarantee of recoverability.  (Even successful test restores don't 100% guarantee recoverability, but it's much closer to 100%).

WHATEVER YOU DO - DON'T STOP TAKING BACKUPS JUST BECAUSE YOU CAN'T REGULARLY TEST THEM - THIS WOULD BE WRONG-WRONG-WRONG.

Instead, use this a reminder that backups are good, but backups with test restores are much, much better!

1 comment:

  1. Hedging bets is something that is talked about more than it is understood. It’s also a concept that can be very dangerous because it can easily be used incorrectly in ways that negatively impact your bottom line.

    Sports Cash System Reviews

    ReplyDelete