One DBA's Ongoing Search for Clarity in the Middle of Nowhere


*or*

Yet Another Andy Writing About SQL Server

Thursday, April 17, 2014

Why does my Failover Cluster Instance Keep Failing Back and Forth?

A few months ago I was paged by a client because their SQL Server Failover Cluster Instance (FCI) was experiencing multiple failovers when they performed routine Windows patching.

The client's request which generated the work ticket was this:

At some point recently this clustered node failed over from the A node to the B node however it seems to be the only instance that did as INSTANCE_NAME_2 still resides on COMPUTER_NAME. With that, we would like to open a ticket to investigate the cause/date/time of this failover.

This ticket was generated on 02/11, but then there was a failover (with file share “flapping” as described above) on the morning of 02/12, immediately after the request – it appeared at the time that this was initiated by someone at the client trying to move the service back from the B node to the A node (their chosen primary node), which did happen after some flapping.  There is no evidence of server reboot or other error on 02/12, but someone did RDP to the server about ten minutes before the failover, so I assumed it was manually triggered.

--

The failover prior to 02/12 (the one the client was asking about in their ticket request) happened on 01/14, and it turned out that there were multiple failover events on 01/14.

The first relevant event on 01/14 was this:

Event Type:        Information
Event Source:    USER32
Event Category:                None
Event ID:              1074
Date:                     1/14/2014
Time:                    3:18:18 PM
User:                     DOMAIN\LOGIN
Computer:          COMPUTER_NAME
Description:
The process Explorer.EXE has initiated the restart of computer COMPUTER_NAME on behalf of user DOMAIN\LOGIN for the following reason: Other (Planned)
Reason Code: 0x85000000
Shutdown Type: restart
Comment: reboot

DOMAIN\LOGIN rebooted the A node at 3:18pm with the comment of “reboot.”

--

The next relevant event was this:

Event Type:        Information
Event Source:    ClusSvc
Event Category:                Failover Mgr
Event ID:              1203
Date:                     1/14/2014
Time:                    5:50:26 PM
User:                     N/A
Computer:          COMPUTER_NAME
Description:
The Cluster Service is attempting to offline the Resource Group "INSTANCE_NAME".

At 5:50pm, INSTANCE_NAME was offlined on the B node, without a server restart or other error, which probably means it was user-initiated.  This resulted in 10+ minutes of file share “flapping” until the instance finally ended up on the B node:

--

Event Type:        Information
Event Source:    Service Control Manager
Event Category:                None
Event ID:              7036
Date:                     1/14/2014
Time:                    6:01:35 PM
User:                     N/A
Computer:          COMPUTER_NAME
Description:
The SQL Server (INSTANCE_NAME) service entered the running state.

--

At 10:50pm, the SQL Server cluster resource group was again failed over, possibly in preparation for the B node reboot that was about to happen:

Event Type:        Information
Event Source:    ClusSvc
Event Category:                Failover Mgr
Event ID:              1203
Date:                     1/14/2014
Time:                    10:50:56 PM
User:                     N/A
Computer:          COMPUTER_NAME
Description:
The Cluster Service is attempting to offline the Resource Group "INSTANCE_NAME".

--

At 10:55pm, the same user rebooted the B node, again with the comment of “reboot”:

Event Type:        Information
Event Source:    USER32
Event Category:                None
Event ID:              1074
Date:                     1/14/2014
Time:                    10:55:24 PM
User:                     DOMAIN\LOGIN
Computer:          COMPUTER_NAME
Description:
The process Explorer.EXE has initiated the restart of computer COMPUTER_NAME on behalf of user DOMAIN\LOGIN for the following reason: Other (Planned)
Reason Code: 0x85000000
Shutdown Type: restart
Comment: reboot

--

This resulted in the SQL Server resource group trying to fail over to A, but due to the same file share “flapping” the group finally ended up back on B after B was back up from its reboot:

Event Type:        Information
Event Source:    Service Control Manager
Event Category:                None
Event ID:              7035
Date:                     1/14/2014
Time:                    11:05:56 PM
User:                     DOMAIN\SERVICE_ACCOUNT_LOGIN
Computer:          COMPUTER_NAME
Description:
The SQL Server (INSTANCE_NAME) service was successfully sent a start control.

--

The issue relevant to their situation related to a cluster file share resource:

 

The share is on the Y: drive (Y:\FOLDER_NAME) but as seen in the resource properties screenshot above there is no clustering dependency established to that drive.  On a file share cluster resource like this there should be a dependency on the relevant drive, like this example from the “SQL Server MSSQL Share INSTANCE_NAME” share:

 

Without a dependency on the drive resource, the file share resource tries to come online as soon as the resource group fails over.  Unfortunately, since the file share actually *does* depend on the drive being online (even without an established dependency relationship), the file share fails with the following errors if the drive isn’t online yet:

--

Event Type:        Error
Event Source:    ClusSvc
Event Category:  File Share Resource
Event ID:              1068
Date:                     2/12/2014
Time:                    7:16:37 AM
User:                     N/A
Computer:          COMPUTER_NAME
Description:
Cluster file share resource SHARE_NAME failed to start with error 21.

--

Event Type:        Error
Event Source:    ClusSvc
Event Category:    File Share Resource
Event ID:              1053
Date:                     2/12/2014
Time:                    7:16:37 AM
User:                     N/A
Computer:          COMPUTER_NAME
Description:
Cluster File Share SHARE_NAME cannot be brought online because the share could not be created.

--

Event Type:        Error
Event Source:    ClusSvc
Event Category:    Failover Mgr
Event ID:              1069
Date:                     2/12/2014
Time:                    7:16:37 AM
User:                     N/A
Computer:          COMPUTER_NAME
Description:
Cluster resource 'SHARE_NAME' in Resource Group 'INSTANCE_NAME' failed.

--

The middle error is the most telling one – the file share cannot be brought online because the share could not be created – because the Y: drive isn’t online yet!

The dangerous catch here is that by default the file share will try three times to come online, and if it can’t (because the drive isn’t online yet) it results in the resource group failing, causing another SQL Server resource group failover, even if the SQL Server resource comes online cleanly.  This results in the group “flapping” back and forth between the nodes until you get lucky and the Y: drive resource happens to come online before the SHARE_NAME resource makes it through its three retry attempts and kills the group again.

This may not seem important and is not the cause of the actual initial failover event (in their case the initial failover was intentional as part of the Windows patching), but it is the reason why every time you do have a SQL Server cluster failover event there are multiple failovers, as seen in this SQL Server error log stack:


As can be seen here, on 01/14/2014 there was a failover, and it resulted in multiple failovers (each SQL Error Log is a new start of the MSSQLServer service) over the course of several minutes as the SHARE_NAME resource tried to come online and failed.  This happened again on 02/12.  This can be seen in the Windows System Log as well (shown filtered below):


Events 1068 and 1053 are the relevant events, and as you can see on 01/14 from 5:51pm-5:58pm there were multiple failovers with the file share failing, and then again that same night between 10:59pm and 11:02pm, and then the most recent event on 02/12 between 7:15am and 7:16am.

There are configuration options on the SHARE_NAME file share cluster object to change the number of retries or to make it so it doesn’t fail the group when its object fails, but the most correct fix is to add the dependency on the Y: drive to the file share object.

--


The ultimate answer of why the INSTANCE_NAME instance ended up on the B node is the file share flapping – DOMAIN\LOGIN was performing regular server maintenance (Windows patches) and mid-afternoon on 01/14 tried to fail everything from A to B and then reboot A, and then later in the evening tried to fail everything from B back to A and reboot B (and this is how I would patch a cluster like this).

The catch is that the file share flapping caused many, many failover events to occur each time, and resulted in the cluster instance ending up on the “wrong” node (after a failover everything would normally be on A, but due to the flapping A>B>A>B>A>B it actually ended up back on B).

-
 
The ultimate answer to prevent this "flapping" is to add the file share cluster resource dependency on the drive object.

**This issue should be corrected as soon as this situation is discovered even though it will require a file share downtime, which may require a SQL Server downtime depending on what the SHARE_NAME share is used for.**
 
While this example is about a file share resource, the same concept (and resolution) is valid for any clustered resource that relies on something else - for example, if your SQL Server instance cluster resource is missing its cluster dependency on the Network Name.

Hope this helps!

1 comment:

  1. Configuring "Maximum failures in the specified period" is a good way to prevent flapping too.
    http://technet.microsoft.com/en-us/library/cc755151.aspx

    ReplyDelete