One DBA's Ongoing Search for Clarity in the Middle of Nowhere


*or*

Yet Another Andy Writing About SQL Server

Friday, February 24, 2017

How many Backups is Too Many?

Does this title seem strange?  Of course we know this is the true answer:

https://cdn.meme.am/cache/instances/folder633/58097633.jpg

But here is a story of a client that may have too many backups after all.

--

Our monitoring system found that the backups on a client server had suddenly started running very long.  When I signed on to look I saw that the nightly backups were still running even the next morning!  These were the three currently running processes related to backups:

--

LINE
dd hh:mm:ss.mss
session_id
sql_text
login_name
1
00 15:24:36.674
134
backup database [BobData] to virtual_device = N'CA_BAB_MSSQL_a786700_10002_4c59b51f_BobData' with differential, blocksize = 65536, buffercount = 1, maxtransfersize = 2097152
DOMAIN1\sa_arcbackup
2
00 11:27:32.043
165
BACKUP DATABASE [BobData] TO  DISK = N'B:\SQL_BAC\BobData\BobData_backup_2017_02_23_020011_9582786.bak' WITH  RETAINDAYS = 10, NOFORMAT, NOINIT,  NAME = N'BobData_backup_2017_02_23_020011_9582786', SKIP, REWIND, NOUNLOAD, COMPRESSION,  STATS = 10
DOMAIN1\sqlcluster
3
00 02:58:20.540
84
backup database [FredData] to virtual_device = N'CA_BAB_MSSQL_a786080_10001_4f06c827_FredData' with blocksize = 65536, buffercount = 1, maxtransfersize = 2097152
DOMAIN1\sa_arcbackup



LINE
wait_info
CPU
tempdb_allocations
tempdb_current
blocking_session_id
1
(55475564ms)ASYNC_IO_COMPLETION
236
0
0
NULL
2
(41252214ms)LCK_M_U
0
0
0
134
3
(10699785ms)ASYNC_IO_COMPLETION
357
0
0
NULL



LINE
percent_complete
host_name
database_name
program_name
start_time
login_time
1
79.8087
Server5
BobData
Arcserve Backup
2/22/2017 22:02
2/22/2017 22:02
2
NULL
Server1
BobData
Microsoft SQL Server Management Studio
2/23/2017 2:00
2/23/2017 2:00
3
18.4672
Server5
FredData
Arcserve Backup
2/23/2017 10:30
2/23/2017 1:05


The middle process (Line 2 – session ID/SPID 165) is a regular BACKUP DATABASE statement that had been running since 2am local server time.  As you can see from the wait info, the entire life of this SPID has been spent waiting for an update lock (LCK_M_U).  Also notice that the Percent Complete is still NULL because it is being blocked by the first SPID (134) and hasn’t done anything.

The first and third SPID’s (134 and 84) are Arcserve backups of the BobData and FredData databases respectively.  Looking in the backup history in msdb for the BobData database I see this:

database_name
physical_device_name
type
backup_start_date
backup_finish_date
BobData
B:\SQL_BAC\BobData\BobData_backup_2017_02_22_020015_7258925.bak
D
02/22/2017 04:29:47
02/22/2017 04:40:41
BobData
CA_BAB_MSSQL_8ea33d0_10002_4732d2d8_BobData
I
02/21/2017 22:01:34
02/22/2017 04:29:44
BobData
B:\SQL_BAC\BobData\BobData_backup_2017_02_21_020010_3116691.bak
D
02/21/2017 11:54:41
02/21/2017 12:05:22
BobData
CA_BAB_MSSQL_8c6ff40_10002_420ce264_BobData
I
02/20/2017 22:02:02
02/21/2017 11:54:39
BobData
B:\SQL_BAC\BobData\BobData_backup_2017_02_18_020007_3945711.bak
D
02/19/2017 23:39:43
02/19/2017 23:50:38
BobData
CA_BAB_MSSQL_8d4b360_10001_329872eb_BobData
D
02/17/2017 22:00:34
02/19/2017 23:39:41

The Arcserve backup was kicking off around 10pm local server time, and taking a widely variable amount of time to complete – the first backup (start date 02/17 22:00) was a FULL backup (type D) and took just over *2 days* to complete (02/17 22:00 to 02/19 23:39)!

https://s-media-cache-ak0.pinimg.com/736x/25/b2/82/25b2825a9b57026e1932ba48909699e9.jpg
The catch is the next row up – the regular FULL backup to the B: drive looks like it kicks off right when the Arcserve backup completes, at 23:39 – this is most likely because it was being blocked by the Arcserve backup (just like the current activity above) and when the Arcserve backup completes, the FULL backup to the B: drive begins.

This history shows this same pattern of the B: drive backup kicking off right when the Arcserve backup completes through the following days.

The SQL “regular” Agent job kicks off at 2am, so it is hanging as a blocked SPID each day until the Arcserve job completes and then running in a very reasonable amount of time – BobData is a 340GB database that backs up into a 36GB file (nice compression!) and as seen in the start/finish times above only takes 10-30 minutes to run, which by my experience is very good for a database this size.

From the DBA end we can’t speak to why the Arcserve backup takes such a long time – completely guessing I would assume it isn’t compressing and therefore is sending the full 340GB across the network pipe, but this still seems very slow.

The SPID’s for the Arcserve backups are waiting on “ASYNC_IO_COMPLETION” which means it is waiting on storage read/writes, in this case certainly *writes*.

--

There are two pieces to this for follow-up – one very important and another less so.

First and foremost – as a DBA I strongly recommend not having “competing” backups like this.  The backups should either run to disk (very *very* strongly preferred) or to a tool like Arcserve.  Running to two different tools leads to exactly the kind of problem seen here.

https://imgflip.com/i/1k9t97
Note that this is not related to Arcserve in particular - any third-party backup tool that uses a database agent to directly run backups will display this behavior.

A conversation showed that the Arcserve backups had been recently added and it was definitely a match to the problems.

The workaround to this is to back the databases up to one tool and then to *copy* those backup files to the other tool.  The best recommendation I have in these situation is always to run “regular” SQL Server Agent job backups to a file location (either a local or network drive) and then to have your third-party tool use file backup for the actual BAK/TRN files from that location *rather* than running the “database agent” on the third party tool to backup the database directly.

In this model the third party tool never directly touches SQL Server – in this client's environment you would run SQL Server Agent jobs similar to the current maintenance plan (although something better than maintenance plans, such as Ola Hallengren's scripts or maybe MinionBackup) to backup to actual disk (in thiscase the B: drive) and then use Arcserve to backup the files that have been written on the B: drive.

Another advantage to this backup model is it makes it easier for the DBAs managing your SQL Servers to access the backups for disaster recovery/data copy/etc.  If you backup to the third-party tool alone it usually requires some third party (usually a server admin or SAN admin) to perform a database restore – with regular SQL backups directly to disk, the DBA can perform that work by themselves which can greatly decrease recovery time in case of a disaster.

If you absolutely *must* backup directly to a tool like Arcserve, it needs to be made more reliable than what this client was seeing – wide swings of duration from 6 hours to 13 hours to 49 hours can’t be acceptable – and the bottleneck must be determined and resolved – there is no way it can be acceptable for a backup to take over 2 days on a 340GB database.  Fixing this requires work across your network, storage, and admin teams.

--

Second – as you can see from the limited number of rows in my history grid above, the current SQL Maintenance Plan is purging backup history, etc. that is older than 4 days


I understand a desire to have backup file cleanup of 4 days, as seen in the next item in the maintenance plan, as the actual backups can take up a lot of space:


However I strongly recommend editing the “Clean Up History” task the retain records for a much longer interval, such as 35-40 days.  There are often processes that happen monthly that require troubleshooting (for example, an end-of-month data load that slows the system down), and only keeping 30-31 days can often cause an issue where something happens on a Friday or Saturday but isn’t investigated until Monday (or Tuesday) – by the time the investigation begins, the 30th or 31st day of history has rolled off and the troubleshooting is hampered by lack of data...

http://i0.kym-cdn.com/entries/icons/original/000/019/630/ihnmotp.jpg

Maintenance/Backup history takes up very little space and can easily be retained for even several months (such as 90-100 days) without disk space issues.

--

At the end of the day watch out for duplicative backups - some times they are necessary, but in most cases you can meet your requirements to offshore your backups, send them to a DR site, etc. using a simple file copy of BAK and TRN files.

Also watch your history retentions closely - don't swing the other way - I can't count the number of servers I have found with huge msdb's because no cleanup is happening - but set it to a realistic number that won't hamper your troubleshooting.  Consider the frequency of your key processes; in almost all environments there are at least monthly considerations, but often there are also bi-monthly or quarterly concerns that would require you to keep up to 100 days of history for effective troubleshooting.

--

Hope this helps!









4 comments:

  1. In this case of competing backup processes, is the spid 134 (the blocker) doing a differential ?
    Curious, how did this all resolved itself over time ? Thank you for the article and your time to detail it out.

    ReplyDelete
    Replies
    1. Yes 134 is doing a DIFF - for the purposes of database backup concurrency a FULL and a DIFF are treated the same - you can only have one FULL or DIFF running at a time. (Remember this does *not* impact LOG backups, which *can* run during a FULL or DIFF).

      The client still has not decided what to do, so the collisions are ongoing. I made the recommendatiins outlined in this post and they are deciding :)

      Delete
  2. Thanks for the article. You're advocating taking all backups to drive storage and then using the 3rd party to take selective backups to its storage. So it's taking a SQL backup and just converting it into its own format. At this point it seems you're getting very little out of the tool, just that you have a an additional offsite backup.

    ReplyDelete
    Replies
    1. Exactly - I recommend taking native/standard SQL Server backups to disk and then using the tool to do file backups just like all of the other file backups it already does - don't use the custom "agent" that the vendors provide that functions separately from the SQL Server basics.

      Delete