Does this title seem strange? Of course we know this is the true answer:
But here is a story of a client that may have too many backups after all.
--
Our monitoring system found that the backups on a client server had suddenly started running very long. When I signed on to look I saw that the nightly backups were still running even the next morning! These were the three currently running processes related to backups:
https://cdn.meme.am/cache/instances/folder633/58097633.jpg |
But here is a story of a client that may have too many backups after all.
--
Our monitoring system found that the backups on a client server had suddenly started running very long. When I signed on to look I saw that the nightly backups were still running even the next morning! These were the three currently running processes related to backups:
--
LINE
|
dd hh:mm:ss.mss
|
session_id
|
sql_text
|
login_name
|
1
|
00 15:24:36.674
|
134
|
backup database [BobData] to virtual_device =
N'CA_BAB_MSSQL_a786700_10002_4c59b51f_BobData' with differential, blocksize =
65536, buffercount = 1, maxtransfersize = 2097152
|
DOMAIN1\sa_arcbackup
|
2
|
00 11:27:32.043
|
165
|
BACKUP DATABASE [BobData] TO DISK = N'B:\SQL_BAC\BobData\BobData_backup_2017_02_23_020011_9582786.bak'
WITH RETAINDAYS = 10, NOFORMAT, NOINIT, NAME = N'BobData_backup_2017_02_23_020011_9582786',
SKIP, REWIND, NOUNLOAD, COMPRESSION, STATS = 10
|
DOMAIN1\sqlcluster
|
3
|
00 02:58:20.540
|
84
|
backup database [FredData] to virtual_device =
N'CA_BAB_MSSQL_a786080_10001_4f06c827_FredData' with blocksize = 65536,
buffercount = 1, maxtransfersize = 2097152
|
DOMAIN1\sa_arcbackup
|
LINE
|
wait_info
|
CPU
|
tempdb_allocations
|
tempdb_current
|
blocking_session_id
|
1
|
(55475564ms)ASYNC_IO_COMPLETION
|
236
|
0
|
0
|
NULL
|
2
|
(41252214ms)LCK_M_U
|
0
|
0
|
0
|
134
|
3
|
(10699785ms)ASYNC_IO_COMPLETION
|
357
|
0
|
0
|
NULL
|
LINE
|
percent_complete
|
host_name
|
database_name
|
program_name
|
start_time
|
login_time
|
1
|
79.8087
|
Server5
|
BobData
|
Arcserve Backup
|
2/22/2017 22:02
|
2/22/2017 22:02
|
2
|
NULL
|
Server1
|
BobData
|
Microsoft SQL Server Management Studio
|
2/23/2017 2:00
|
2/23/2017 2:00
|
3
|
18.4672
|
Server5
|
FredData
|
Arcserve Backup
|
2/23/2017 10:30
|
2/23/2017 1:05
|
The middle process (Line 2 – session ID/SPID 165) is a
regular BACKUP DATABASE statement that had been running since 2am local server
time. As you can see from the wait info, the entire life of this SPID has
been spent waiting for an update lock (LCK_M_U). Also notice that the
Percent Complete is still NULL because it is being blocked by the first SPID
(134) and hasn’t done anything.
The first and third SPID’s (134 and 84) are Arcserve backups
of the BobData and FredData databases respectively. Looking in the backup
history in msdb for the BobData database I see this:
database_name
|
physical_device_name
|
type
|
backup_start_date
|
backup_finish_date
|
BobData
|
B:\SQL_BAC\BobData\BobData_backup_2017_02_22_020015_7258925.bak
|
D
|
02/22/2017 04:29:47
|
02/22/2017 04:40:41
|
BobData
|
CA_BAB_MSSQL_8ea33d0_10002_4732d2d8_BobData
|
I
|
02/21/2017 22:01:34
|
02/22/2017 04:29:44
|
BobData
|
B:\SQL_BAC\BobData\BobData_backup_2017_02_21_020010_3116691.bak
|
D
|
02/21/2017 11:54:41
|
02/21/2017 12:05:22
|
BobData
|
CA_BAB_MSSQL_8c6ff40_10002_420ce264_BobData
|
I
|
02/20/2017 22:02:02
|
02/21/2017 11:54:39
|
BobData
|
B:\SQL_BAC\BobData\BobData_backup_2017_02_18_020007_3945711.bak
|
D
|
02/19/2017 23:39:43
|
02/19/2017 23:50:38
|
BobData
|
CA_BAB_MSSQL_8d4b360_10001_329872eb_BobData
|
D
|
02/17/2017 22:00:34
|
02/19/2017 23:39:41
|
The Arcserve backup was kicking off around 10pm local server time, and taking a widely variable amount of time to complete – the first backup (start date 02/17 22:00) was a FULL backup (type D) and took just over *2 days* to complete (02/17 22:00 to 02/19 23:39)!
https://s-media-cache-ak0.pinimg.com/736x/25/b2/82/25b2825a9b57026e1932ba48909699e9.jpg |
This history shows this same pattern of the B: drive backup kicking
off right when the Arcserve backup completes through the following days.
The SQL “regular” Agent job kicks off at 2am, so it is
hanging as a blocked SPID each day until the Arcserve job completes and then
running in a very reasonable amount of time – BobData is a 340GB database that
backs up into a 36GB file (nice compression!) and as seen in the start/finish
times above only takes 10-30 minutes to run, which by my experience is very
good for a database this size.
From the DBA end we can’t speak to why the Arcserve backup
takes such a long time – completely guessing I would assume it isn’t
compressing and therefore is sending the full 340GB across the network pipe,
but this still seems very slow.
The SPID’s for the Arcserve backups are waiting on
“ASYNC_IO_COMPLETION” which means it is waiting on storage read/writes, in this
case certainly *writes*.
--
There are two pieces to this for follow-up – one very
important and another less so.
First and foremost – as a DBA I strongly recommend not
having “competing” backups like this. The backups should either run to
disk (very *very* strongly preferred) or to a tool like Arcserve.
Running to two different tools leads to exactly the kind of problem seen here.
Note that this is not related to Arcserve in particular - any third-party backup tool that uses a database agent to directly run backups will display this behavior.
A conversation showed that the Arcserve backups had been recently added and it was definitely a match to the problems.
https://imgflip.com/i/1k9t97 |
A conversation showed that the Arcserve backups had been recently added and it was definitely a match to the problems.
The workaround to this is to back the databases up to one
tool and then to *copy* those backup files to the other tool. The
best recommendation I have in these situation is always to run “regular” SQL
Server Agent job backups to a file location (either a local or network drive)
and then to have your third-party tool use file backup for the actual BAK/TRN
files from that location *rather* than running the “database agent” on
the third party tool to backup the database directly.
In this model the third party tool never directly touches SQL Server –
in this client's environment you would run SQL Server Agent jobs similar to the current
maintenance plan (although something better than maintenance plans, such as Ola Hallengren's scripts or maybe MinionBackup) to backup to actual disk (in thiscase the B: drive) and then use Arcserve to backup the files that have been written on the B: drive.
Another advantage to this backup model is it makes it easier
for the DBAs managing your SQL Servers to access the backups for disaster
recovery/data copy/etc. If you backup to the third-party tool alone it usually
requires some third party (usually a server admin or SAN admin) to perform a
database restore – with regular SQL backups directly to disk, the DBA can perform that
work by themselves which can greatly decrease recovery time in case of a
disaster.
If you absolutely *must* backup directly to a tool like Arcserve,
it needs to be made more reliable than what this client was seeing – wide swings of duration from 6 hours to 13
hours to 49 hours can’t be acceptable – and the bottleneck must be determined
and resolved – there is no way it can be acceptable for a backup to take over 2
days on a 340GB database. Fixing this requires work across your network, storage, and admin teams.
--
Second – as you can see from the limited number of rows in
my history grid above, the current SQL Maintenance Plan is purging backup
history, etc. that is older than 4 days
However I strongly recommend editing the “Clean Up History” task the
retain records for a much longer interval, such as 35-40 days. There are
often processes that happen monthly that require troubleshooting (for example,
an end-of-month data load that slows the system down), and only keeping 30-31
days can often cause an issue where something happens on a Friday or Saturday
but isn’t investigated until Monday (or Tuesday) – by the time the
investigation begins, the 30th or 31st day of history has
rolled off and the troubleshooting is hampered by lack of data...
http://i0.kym-cdn.com/entries/icons/original/000/019/630/ihnmotp.jpg |
Maintenance/Backup history takes up very little space and
can easily be retained for even several months (such as 90-100 days) without
disk space issues.
--
At the end of the day watch out for duplicative backups - some times they are necessary, but in most cases you can meet your requirements to offshore your backups, send them to a DR site, etc. using a simple file copy of BAK and TRN files.
Also watch your history retentions closely - don't swing the other way - I can't count the number of servers I have found with huge msdb's because no cleanup is happening - but set it to a realistic number that won't hamper your troubleshooting. Consider the frequency of your key processes; in almost all environments there are at least monthly considerations, but often there are also bi-monthly or quarterly concerns that would require you to keep up to 100 days of history for effective troubleshooting.
--
Hope this helps!
Also watch your history retentions closely - don't swing the other way - I can't count the number of servers I have found with huge msdb's because no cleanup is happening - but set it to a realistic number that won't hamper your troubleshooting. Consider the frequency of your key processes; in almost all environments there are at least monthly considerations, but often there are also bi-monthly or quarterly concerns that would require you to keep up to 100 days of history for effective troubleshooting.
--
Hope this helps!
In this case of competing backup processes, is the spid 134 (the blocker) doing a differential ?
ReplyDeleteCurious, how did this all resolved itself over time ? Thank you for the article and your time to detail it out.
Yes 134 is doing a DIFF - for the purposes of database backup concurrency a FULL and a DIFF are treated the same - you can only have one FULL or DIFF running at a time. (Remember this does *not* impact LOG backups, which *can* run during a FULL or DIFF).
DeleteThe client still has not decided what to do, so the collisions are ongoing. I made the recommendatiins outlined in this post and they are deciding :)
Thanks for the article. You're advocating taking all backups to drive storage and then using the 3rd party to take selective backups to its storage. So it's taking a SQL backup and just converting it into its own format. At this point it seems you're getting very little out of the tool, just that you have a an additional offsite backup.
ReplyDeleteExactly - I recommend taking native/standard SQL Server backups to disk and then using the tool to do file backups just like all of the other file backups it already does - don't use the custom "agent" that the vendors provide that functions separately from the SQL Server basics.
Delete