Nebraska SQL from @DBA_ANDY: Sudden Differential Backup Failures after an Availability Group Failover

As so many stories do, this story starts with a failover - in this case an Availability Group (AG) failover.

https://cdn.meme.am/cache/instances/folder984/400x/59300984.jpg

There were several different backup and status check jobs failing on NODE01 and NODE02 because the AG01 availability group was now primary on NODE02 instead of NODE01 after a failover.

The failover occurred the previous Saturday morning at 8:30am local server time because of an application-initiated server reboot of the 01 node:

Log Name: System
Source: User32
Date: 7/22/2017 8:31:08 AM
Event ID: 1074
Task Category: None
Level: Information
Keywords: Classic
User: SYSTEM
Computer: NODE01.COMPANY1.com
Description:
The process C:\Program Files\Application12\AppAgent\01\patch\ApplicationService.exe (NODE01) has initiated the restart of computer NODE01 on behalf of user NT AUTHORITY\SYSTEM for the following reason: Legacy API shutdown
Reason Code: 0x80070000
Shutdown Type: restart
Comment: The Update Agent will now reboot this machine

Always fun when an application “unexpectedly” restarts a server without advance warning.

https://i.imgflip.com/w6x8g.jpg

Even though the Quest/Dell LiteSpeed Fast Compression maintenance plans are configured to pick up “all user databases” the availability group databases were not being backed up currently and hadn’t been since the failover:

NODE01:

NODE02:

The error messages on the SQL Error Log on the 01 node are telling:

Date 7/24/2017 1:09:23 AM
Log SQL Server (Current - 7/24/2017 12:59:00 PM)
Source Backup
Message Error: 3041, Severity: 16, State: 1.

--

Date 7/24/2017 1:09:23 AM
Log SQL Server (Current - 7/24/2017 12:59:00 PM)
Source Backup
Message BACKUP failed to complete the command BACKUP DATABASE AG_DATABASE22. Check the backup application log for detailed messages.

--

Date 7/24/2017 1:09:23 AM
Log SQL Server (Current - 7/24/2017 12:59:00 PM)
Source spid269
Message FastCompression Alert: 62301 : SQL Server has returned a failure message to LiteSpeed™ for SQL Server® which has prevented the operation from succeeding.
The following message is not a LiteSpeed™ for SQL Server® message. Please refer to SQL Server books online or Microsoft technical support for a solution:
BACKUP DATABASE is terminating abnormally. This BACKUP or RESTORE command is not supported on a database mirror or secondary replica.

--

Date 7/24/2017 1:09:24 AM
Log SQL Server (Current - 7/24/2017 12:59:00 PM)
Source spid258
Message Error: Maintenance plan 'Maint Backup User Databases'.LiteSpeed™ for SQL Server® 8.2.1.42
© 2016 Dell Inc.
Selecting full backup: Maximum interval (7 days) has elapsed since last full
Executing SQLLitespeed.exe: Write new full backup \\adcnas21\SQL\NODE01\AG_DATABASE22\AG_DATABASE22.litespeed.f22.bkp
Msg 62301, Level 16, State 1, Line 0: SQL Server has returned a failure message to LiteSpeed™ for SQL Server® which has prevented the operation from succeeding.
The following message is not a LiteSpeed™ for SQL Server® message. Please refer to SQL Server books online or Microsoft technical support for a solution:
BACKUP DATABASE is terminating abnormally.
This BACKUP or RESTORE command is not supported on a database mirror or secondary replica.

** IMPORTANT ** - while this specific scenario is on a cluster running Dell/Quest LiteSpeed Fast Compression backups (a process that uses an algorithm to determine whether a differential is sufficient or a full backup is required each day), the problem does *not* directly relate to LiteSpeed - the problem is with running differential backups on availability group secondaries or database mirror partners in general.

The situation around running differentials on secondaries is described in this post from Anup Warrier (blog/@AnupWarrier):

https://sqlsailor.com/2012/05/15/backups-on-secondary-replicas-always-on-availability-groups/

The particular issue here is the fact that the AG’s aren’t weighted evenly, so even though it is “invalid” for the differentials, the AG still prefers NODE01 because it is more heavily weighted:

My recommendation to the client to fix this on LiteSpeed Fast Compression was to change the backup option on the AG from “Any Replica” to “Primary” – this would keep the backup load on the primary, which means the differential backups would work.

The cost to this is that the backup I/O load would always be on the primary, but since the “normal” state for AG01 is to live on NODE01 with backups on NODE01 then requiring the backups to run on the primary node would not be different in most situations.

Note this is the state for this particular AG - in many cases part of the use of AG's is being able to offload backups to the secondary, so in many cases this is a cost to be weighed before making this change.

If you want to run backups on the secondary node then my personal best suggestion for fixing this would be outside of LiteSpeed/Fast Compression – if you want/need to stay with differentials, we could use a scripted backup solution (like Ola Hallengren's Maintenance Solution) to enable/disable the full/differential backups on the secondary node, probably by adding a check in the backup commands to ignore databases that aren’t the AG primary.

A script to perform this Availability Group replica check appears in my previous post on determining the primary Availability Group replica:

SELECT AG.name AS AvailabilityGroupName,
HAGS.primary_replica AS PrimaryReplicaName,
HARS.role_desc as LocalReplicaRoleDesc,
DRCS.database_name AS DatabaseName,
HDRS.synchronization_state_desc as SynchronizationStateDesc,
HDRS.is_suspended AS IsSuspended,
DRCS.is_database_joined AS IsJoined
FROM master.sys.availability_groups AS AG
LEFT OUTER JOIN master.sys.dm_hadr_availability_group_states as HAGS
ON AG.group_id = HAGS.group_id
INNER JOIN master.sys.availability_replicas AS AR
ON AG.group_id = AR.group_id
INNER JOIN master.sys.dm_hadr_availability_replica_states AS HARS
ON AR.replica_id = HARS.replica_id
AND HARS.is_local = 1
INNER JOIN master.sys.dm_hadr_database_replica_cluster_states AS DRCS
ON HARS.replica_id = DRCS.replica_id
LEFT OUTER JOIN master.sys.dm_hadr_database_replica_states AS HDRS
ON DRCS.replica_id = HDRS.replica_id
AND DRCS.group_database_id = HDRS.group_database_id
ORDER BY AG.name, DRCS.database_name

Another option would be to go away from differentials completely and just run FULL’s and LOG’s which would allow for the current 80/50 weight model to continue.

I further advised the client that because of the differentials situation I don’t see a way to force backups to the secondary with LiteSpeed Fast Compression - since Fast Compression requires differentials as part of its process, Fast Compression backups *have* to run on the primary. (Maybe there is some setting in Fast Compression to fix this but I am not familiar with one.)

Hope this helps!

Monday, July 24, 2017

Sudden Differential Backup Failures after an Availability Group Failover

No comments:

Post a Comment