--
'Alert: SQL Failed Jobs on InstanceName is Failed'
Job name: LSRestore_InstanceName_DatabaseName
Date & Message: 2014-07-31 05:24:00.0 The job failed. The Job was invoked by User DOMAIN\LOGIN. The last step to run was step 1 (Log shipping restore log job step.).
--
Sigh.
I am not a fan of log shipping but I do acknowledge that it has its place. In this case the client is using it to have a read-only standby database on a Standard Edition SQL Server (hence no mirroring+snapshot or Availability Group with readable secondary). The database in question is large (>1TB) but with relatively small daily churn (~5GB of LOG backups per day). Due to the traffic on the standby database during the day they have their log shipping on this database configured to only run restores during the overnight, keeping the standby database static and available for their users during the business day.
By the time I received the page, the LSRestore job has been failing for much of the overnight, with service desk'ers trying various things to try to resolve the issue.
When I signed on to the system I drilled into the job step history and found the following:
--
Microsoft (R) SQL Server Log
Shipping Agent [Assembly Version =
10.0.0.0, File Version = 10.50.1600.1 ((KJ_RTM).100402-1539 )] Microsoft Corporation. All rights
reserved. 2014-07-31 03:45:00.32 -----
START OF TRANSACTION LOG RESTORE -----
2014-07-31 03:45:00.39 Starting
transaction log restore. Secondary ID:
'8f45136f-fb73-4815-b849-c7c1b391831b'
2014-07-31 03:45:00.39 Retrieving restore settings. Secondary ID:
'8f45136f-fb73-4815-b849-c7c1b391831b'
2014-07-31 03:45:00.41 Retrieved common restore settings. Primary
Server: 'InstanceName', Primary Database: 'DatabaseName', Backup Destination Directory:
'H:\LogShip\DatabaseName', File Retention Period: 1440 minute(s) 2014-07-31 03:45:00.42 Retrieved database
restore settings. Secondary Database: 'DatabaseName', Restore Delay: 10, Restore All:
True, Restore Mode: Standby, Disconnect Users: False, Last Restored File:
H:\LogShip\DatabaseName\DatabaseName_20140730071500.trn, Block Size: Not Specified, Buffer
Count: Not Specified, Max Transfer Size: Not Specified
2014-07-31 03:45:20.77 ***
Error: Could not apply log backup file 'H:\LogShip\DatabaseName\DatabaseName_20140730073000.trn'
to secondary database 'DatabaseName'.(Microsoft.SqlServer.Management.LogShipping)
*** 2014-07-31 03:45:20.77 *** Error: Exclusive access could not be obtained because
the database is in use. RESTORE LOG is
terminating abnormally.(.Net SqlClient Data Provider) ***
2014-07-31 03:45:21.05 ***
Error: The log backup file 'H:\LogShip\DatabaseName\DatabaseName_20140730073000.trn' was
verified but could not be applied to secondary database
'DatabaseName'.(Microsoft.SqlServer.Management.LogShipping) *** 2014-07-31 03:45:21.05 Deleting old log
backup files. Primary Database: 'DatabaseName'
2014-07-31 03:45:21.06 The restore operation completed with errors.
Secondary ID: '8f45136f-fb73-4815-b849-c7c1b391831b' 2014-07-31 03:45:21.06 ----- END OF
TRANSACTION LOG RESTORE -----
Exit Status: 1 (Error)
--
Seeing the "Exclusive access could not be obtained because the database is in use" error (buried in the middle of the messages), I went looking for something connected to the DatabaseName database. Sure enough there it was:
spid
|
blocked
|
dbid
|
uid
|
login_time
|
last_batch
|
status
|
hostname
|
69
|
0
|
10
|
1
|
7/30/2014 11:07
|
7/30/2014 11:07
|
sleeping
|
ServerName
|
|
|||||||
program_name
|
hostprocess
|
cmd
|
loginame
|
stmt_start
|
stmt_end
|
request_id
|
|
Microsoft SQL Server Management Studio - Query
|
5404
|
AWAITING COMMAND
|
Login1
|
0
|
0
|
0
|
Basically someone
opened a Management Studio window on InstanceName as loginname Login1 at 11:07am the previous day, ran one or two queries (last_batch only one second after the
login_time), and then left the window open, maintaining a connection to the
DatabaseName database.
Log Shipping requires
exclusive access (as noted in the error) to apply the log backups, so this
single Management Studio connection was effectively breaking log shipping.
Because the SPID was
"Awaiting Command" with a Last_Batch of some time ago I went ahead
and killed SPID 69 and manually started the "LSRestore_ServerName_DatabaseName" job. The job completed successfully, restoring the
last day's worth of logs in about ten minutes.
--
How could this be prevented? What was configured incorrectly?
When setting up a log shipping secondary as a read-only standby, there is an option to disconnect any existing connections before attempting log shipping restores. This option is available whether you use the Management Studio GUI or T-SQL commands.
In Management Studio, the option is "Disconnect users in the database before restoring backups." This option is on the "Restore Transaction Log" tab on the "Secondary Database Settings" window after you select to "Add" a secondary database. The option is only exposed if you select Standby mode - there is no need for the option in No Recovery mode since users won't be able to connect to the secondary database anyway:
Via T-SQL, the option is a parameter of the "sp_add_log_shipping_secondary_database" stored procedure. Set "@disconnect_users = 1" to enable the disconnect functionality.
IMPORTANT NOTE - as seen in the screenshot above, this option is not enabled by default in Management Studio, and it is also not enabled by default by the stored procedure (default value is 0). If you wish to use this functionality you need to go "out of your way" to turn it on. Be aware of the impact of this option *before* turning it on. If use pattern/business rules of your standby database is such that queries against the standby are more important than successful restores, do *not* enable this option as it will terminate your queries with extreme prejudice.
http://sd.keepcalm-o-matic.co.uk/i/keep-calm-and-terminate-with-extreme-prejudice.png |
Hope this helps!