One DBA's Ongoing Search for Clarity in the Middle of Nowhere


*or*

Yet Another Andy Writing About SQL Server

Tuesday, December 22, 2015

The transaction log for database 'ABC' is full due to 'ACTIVE_BACKUP_OR_RESTORE'

I have recently had a client with a problem each morning where they were having processes fail with this message in their SQL Server Error Log:
Date 12/01/2015 5:58:48 AM
Log SQL Server (Current - 12/01/2015 1:00:00 PM)
Source spid118
Message Error: 9002, Severity: 17, State: 3.
--
Date 12/01/2015 5:58:48 AM
Log SQL Server (Current - 12/01/2015 1:00:00 PM)
Source spid118
Message The transaction log for database 'ABC' is full due to 'ACTIVE_BACKUP_OR_RESTORE'.
Their question was how their log could be full since their database is in SIMPLE recovery.   

Their particular situation is that database ABC is a 650GB OLTP database, and the backup to the remote location was taking 7-8 hours each night over the time in question:

server_name
database_name
backup_start_date
backup_finish_date
physical_device_name
type
BackupSizeGB
Server1
ABC
12/4/2015 23:30
12/5/2015 7:11
VNBU0-29756-4248-1450758615
D
630
Server1
ABC
12/3/2015 0:00
12/3/2015 7:46
VNBU0-33644-31396-1450499421
D
630
Server1
ABC
12/1/2015 23:30
12/2/2015 6:22
VNBU0-30500-35052-1450413013
D
424
Server1
ABC
11/30/2015 23:30
12/1/2015 6:37
VNBU0-18236-33032-1450326613
D
468
Server1
ABC
10/30/2015 23:30
10/31/2015 4:51
VNBU0-5696-14276-1447734610
D
386
Server1
ABC
10/29/2015 0:05
10/29/2015 5:27
VNBU0-14976-21580-1447475427
D
378
Server1
ABC
10/27/2015 23:31
10/28/2015 4:59
VNBU0-18040-27960-1447389025
D
367
Server1
ABC
10/26/2015 23:31
10/27/2015 4:34
VNBU0-20180-26980-1447302625
D
356
Server1
ABC
10/25/2015 23:31
10/26/2015 5:00
VNBU0-22808-28180-1447216223
D
372
Server1
ABC
10/24/2015 23:31
10/25/2015 5:21
VNBU0-6160-15336-1447129821
D
372
Server1
ABC
10/23/2015 23:31
10/24/2015 5:01
VNBU0-5396-24044-1447043425
D
359
Server1
ABC
10/22/2015 23:31
10/23/2015 7:47
VNBU0-8796-18884-1446957027
D
375
Server1
ABC
10/21/2015 23:37
10/22/2015 5:29
VNBU0-18032-28004-1446870623
D
364
Server1
ABC
10/20/2015 23:31
10/21/2015 5:00
VNBU0-8692-19836-1446784216
D
371


The way FULL backups work in SQL Server, the transaction log is not released for re-use during a FULL backup, even if regular LOG backups are occurring or the database is in SIMPLE recovery.  This is due to the fact that the portion of the LOG that is used during the FULL has to be persisted during the FULL in order to be backed up at the end of the FULL – that is, the FULL backup includes the data at the start of the FULL (23:30) *plus* the LOG used until the end of the FULL (in the case of the 12/04-12/05 backup, the LOG used from 23:30 to 07:11).  This is the meaning of the ACTIVE_BACKUP_OR_RESTORE message – the LOG is waiting for the end of the active FULL backup before it can be released for re-use, which in this case was causing the LOG/LDF file to grow to fill its mount point.

What this means is that the LOG file has to be large enough (or be able to auto-grow large enough) to hold all of the work done during the FULL backup.  In this system (like many others) there is maintenance done overnight (the largest on this system being a purge job which runs for 7-8 hours each night against database ABC almost exactly during this same time window).  This maintenance was generating hundreds of GB of LOG each night during the FULL backup, resulting in the errors shown above.

For this client, the problem turned out to be that the NetBackup process performing these particular backups had been replaced by a snapshot style backup and disabled at the end of October but had recently been accidentally re-enabled.  Shutting off this NetBackup schedule killed these backups, directly resolving the issue.

The alternative would be to have a sufficiently large LOG/LDF file to hold the work done during the FULL backup - even in SIMPLE recovery this is required.  Another option would be to examine the nightly maintenance and schedule it to a different window, away from the FULL backup to minimize the amount of LOG/LDF that needs to persist during the FULL.

Many people don't consider this situation, and it doesn't come up frequently - usually only in this specific situation of a relatively large database that is also high traffic to fill the LOG during the FULL.

Hope this helps!

--

NOTE: as I was finishing polishing this for release I noticed that Erik Darling (blog/@dumb_chauffeur) released a similar post earlier today "What happens to transaction log backups during full backups?" on the Brent Ozar Unlimited blog.  His post has good detail on proving the situation's existence via the Log Sequence Numbers (LSNs) stored in the FULL and LOG backups.




Tuesday, December 8, 2015

T-SQL Tuesday #73 - Not Running CHECKDB is Naughty!


This month T-SQL Tuesday is hosted by Bradley Balls (blog/@SQLBalls) and his topic of choice is “Naughty or Nice.”

(If you don’t know about T-SQL Tuesday check out the information here – each month there is a new topic and it is a great excuse to write each month (and to start writing!) because someone offers a topic, so you already have taken the first step!).

--

Probably the number one problem (Naughty, Naughty!) I find when I run across a new server is DBCC CHECKDB - or rather, the complete lack of CHECKDB.


http://i.memecaptain.com/gend_images/fEN0ew.png

I have heard any number of excuses over the years, but they are almost all some variant of this:

It takes too much time/CPU/RAM/IO/Wheatons to run CHECKDB on my databases.

To this I say one thing - <BLEEP!>

At the end of the day you can't afford *not* to run CHECKDB- you just can't.  Your database objects are only as reliable as your last clean CHECKDB (meaning your last CHECKALLOC + CHECKCATALOG + individual CHECKTABLE's on all of your tables - but more on that in a moment).

If you are not running regular CHECKDB, you may have endless unknown corruption across your databases, and you won't find out until someone tried to access the data.  

Ask yourself - wouldn't you rather find out about data corruption from a regular maintenance process than from the Senior Vice President when she tries to run her Management Big Bonuses Report and it fails?

My recommendation is always to run CHECKDB as often as is plausible, with a practical line of at most once per day (there is no need to run CHECKDB every minute just because you can).

If you can't run daily CHECKDB's, then run it weekly (or even monthly), but run it.

One thing I have done at multiple clients that have not had "sufficient" resources to run full CHECKDB on a regular basis is to split the job up into its component parts mentioned above.  This comes from an idea originally given by Paul Randal (blog/@PaulRandal) and works out like this:

First, analyze the size of your tables to create grouping of comparable sizes.  I like a personal variant of the script at http://stackoverflow.com/questions/15896564/get-table-and-index-storage-size-in-sql-server from "marc_s" that breaks out the objects:
SELECT
s.Name AS SchemaName,
t.NAME AS TableName,
p.rows AS RowCounts,
SUM(a.total_pages) * 8 AS TotalSpaceKB,
SUM(a.used_pages) * 8 AS UsedSpaceKB,
(SUM(a.total_pages) - SUM(a.used_pages)) * 8 AS UnusedSpaceKB
FROM sys.tables t
INNER JOIN sys.schemas s
ON s.schema_id = t.schema_id
INNER JOIN sys.indexes i
ON t.OBJECT_ID = i.object_id
INNER JOIN sys.partitions p
ON i.object_id = p.OBJECT_ID AND i.index_id = p.index_id
INNER JOIN sys.allocation_units a
ON p.partition_id = a.container_id
WHERE
t.NAME NOT LIKE 'dt%'    /* filter out system tables for diagramming */
AND t.is_ms_shipped = 0
AND i.OBJECT_ID > 255
GROUP BY t.Name, s.Name, p.Rows
ORDER BY TotalSpaceKB desc
Next, after you have divided the tables into two to four groups, schedule them into jobs running CHECKTABLE individually on groups of tables, with a final job running a catch-all step for all tables not explicitly checked - something like this (I like Ola Hallengren maintenance so this syntax is from there, but the concept will be visible even if you aren't familiar with his code):


Job 1 - run CHECKTABLE on three big tables in BigDatabase01:
EXECUTE [dbo].[DatabaseIntegrityCheck]
@Databases = 'BigDatabase01',
@CheckCommands = 'CHECKTABLE',
@Objects = 'BigDatabase01.dbo.BigTable01, BigDatabase01.dbo.BigTable02, BigDatabase01.dbo.BigTable03',
@LogToTable = 'Y'

Job 2 - run CHECKTABLE on the next two big tables in BigDatabase01:
EXECUTE [dbo].[DatabaseIntegrityCheck]
@Databases = 'BigDatabase01',
@CheckCommands = 'CHECKTABLE',
@Objects = 'BigDatabase01.dbo.BigTable04, BigDatabase01.dbo.BigTable05',
@LogToTable = 'Y'

Job 3 - Step 1 - run CHECKTABLE on all tables in BigDatabase01 except the five big tables from Jobs 1 and 2 - this "catch-all" step is very important because you need to make sure new tables added to the database get checked - if all you have are steps running against individual tables, new table never get checked!
EXECUTE [dbo].[DatabaseIntegrityCheck]
@Databases = 'BigDatabase01',
@CheckCommands = 'CHECKTABLE',
@Objects = 'ALL_OBJECTS, -BigDatabase01.dbo.BigTable01, -BigDatabase01.dbo.BigTable02, -BigDatabase01.dbo.BigTable03, -BigDatabase01.dbo.BigTable04, -BigDatabase01.dbo.BigTable05',
@LogToTable = 'Y'

Job 3 - Step 2 - run CHECKTABLE on all tables in all databases except BigDatabase01 - the same "catch-all" logic applies here - if you run checks against individual named databases (rather than all databases except BigDatabase01) then new databases don't get picked up:
EXECUTE [dbo].[DatabaseIntegrityCheck]
@Databases = 'USER_DATABASES, -BigDatabase01',
@CheckCommands = 'CHECKTABLE',
@Objects = 'ALL_OBJECTS',
@LogToTable = 'Y'

Job 3 - Step 3 - run CHECKALLOC on all databases:
EXECUTE [dbo].[DatabaseIntegrityCheck]
@Databases = 'USER_DATABASES',
@CheckCommands = 'CHECKALLOC',
@LogToTable = 'Y'

Job 3 - Step 4 - run CHECKCATALOG on all databases:
EXECUTE [dbo].[DatabaseIntegrityCheck]
@Databases = 'USER_DATABASES',
@CheckCommands = 'CHECKCATALOG',
@LogToTable = 'Y'

Over the course of three jobs (usually scheduled over three nights), this plan runs CHECKTABLE on all tables in all databases, runs CHECKALLOC on all databases, and runs CHECKCATALOG on all databases - BINGO! CHECKDB!

This is a relatively simple example with a server that has one large database with multiple large tables. (Of course, this is a fairly common situation on many of these "can't CHECKDB" servers.)  The same principle - divide the CHECKDB operation into component parts - allows me to scheduled CHECKDB on multiple systems when a client says "oh we can *never* run CHECKDB here."

This same principle of exclusion is useful to start running CHECKDB (or its components) on systems that have large databases or objects even if you can't get the offending objects in progress.  Usually when I find one of these servers, there are no CHECKDB's in place on *any* databases even though the problem is ReallyBigTable01 in BiggestDatabase01.  A great first step is to set up CHECKDB or its components on all objects except the offending one(s) - as above, set up CHECKTABLE on everything on the server except ReallyBigTable01 and also CHECKCATALOG and CHECKALLOC everywhere, and you have much more protection while you figure out what to do about ReallyBigTable01.

--

At the end of the day - stay off the naughty list - 
RUN CHECKDB!


http://sugarscape.cdnds.net/14/50/980x490/nrm_1418137741-santa-naughty.jpg





Thursday, December 3, 2015

What Do You Mean There is No Current Database Backup?!?

Yet another late night/early morning tale...

The LOG backups for multiple databases on a server failed overnight (not all databases, but many of them), and looking in the output file from the LiteSpeed Maintenance plan (the client's tool of choice) I found entries like this:

Database Database01: Transaction Log Backup...
Destination: "I:\MSSQL\Transaction_Log_Backups\Database01\Database01_tlog_201511211012.TRN"

SQL Litespeed™ Version 4.8.4.00086
Copyright (C) 2004-2006, Quest Software Inc.
Quest Software.
Registered Name: ServerA

Msg 62309, Level 19, State 1, Line 0
SQL Server has returned a failure message to LiteSpeed which has prevented the operation from succeeding.

The following message is not a LiteSpeed message. Please refer to SQL Server books online or Microsoft technical support for a solution:
BACKUP LOG is terminating abnormally.
BACKUP LOG cannot be performed because there is no current database backup.

I checked the FULL backup job, and it had completed normally that night.

Looking back in the SQL Server Error Log I saw that the problems started at or around 9 pm local server time the previous night – when I looked at that time I found paired messages like these for *all* of the databases - not just the offending ones:

Date       11/20/2015 9:08:18 PM
Log        SQL Server (Current - 11/21/2015 10:16:00 AM)
Source     spid110
Message    Setting database option RECOVERY to SIMPLE for database Database01

--

Date       11/20/2015 9:08:22 PM
Log        SQL Server (Current - 11/21/2015 10:16:00 AM)
Source     spid111
Message    Setting database option RECOVERY to FULL for database Database01.

These messages showed that a process of some kind ran just after 9 pm that switched the databases from FULL recovery to SIMPLE and then back again.  This broke the LOG recovery chain and required new FULL backups before any LOG backups could succeed, which is why the LOG backup job was failing.

The regular nightly FULL backup job on this server runs at 6 pm and normally takes 3-4 hours to complete (that night it took 4.25) – the databases that had backups completed prior to the FULL>>SIMPLE>>FULL switch were those that were failing – the databases that didn’t have backups until after the switch were still OK because they had a FULL *after* the switch to re-start their LOG recovery chain.

The fix in this case was to run new FULL backups of the impacted databases, which thankfully I was able to do without much impact during the morning. (Although DIFF backups could have worked too, as discussed here by Paul Randal (blog/@PaulRandal)

I never could get the users to admit who had made a change (although mysteriously it hasn't happened again :))

--

The cautionary tale part of this relates to the recovery switch.  It is not uncommon to see recommendations to switch the recovery model prior to a large data load, or even a nightly import/export process, as part of transaction LOG management (although depending on your operations, a switch to BULK_LOGGED can be just as effective without breaking the LOG chain - there are point-in-time recovery issues for times *during* any actual minimally logged operations).  I am not a fan of switching like this, but it is out there.

The takeaway is that you need to plan your backup/restore schedule accordingly - if you use a recovery model switch mechanism make sure to schedule your FULL backups *after* the recovery switch so that this problem doesn't bite you.

Also - always scan vendor-provided scripts for recovery model changes - we all know it is good practice to backup databases before and after major changes (we know that, right?) but it also often doesn't happen for one reason or another.  If a script changes the recovery model to SIMPLE and back, you will have this problem if you don't take a new FULL or DIFF backup after the script runs.

Hope this helps!