A few months ago I was paged by a client because their SQL
Server Failover Cluster Instance (FCI) was experiencing multiple failovers when
they performed routine Windows patching.
The client's request which generated the work ticket was this:
At some point recently this clustered node failed
over from the A node to the B node however it seems to be the only instance that
did as INSTANCE_NAME_2 still resides on COMPUTER_NAME. With that, we would like to
open a ticket to investigate the cause/date/time of this failover.
This ticket was generated on 02/11, but
then there was a failover (with file share “flapping” as
described above) on the morning of 02/12, immediately after the request –
it appeared at the time that this was initiated by someone at the
client trying to move the service back from the B node to the A
node (their chosen primary node), which did happen after some flapping.
There is no evidence of
server reboot or other error on 02/12, but someone did RDP to the server
about
ten minutes before the failover, so I assumed it was manually triggered.
--
The failover prior to 02/12 (the one the
client was asking about in their ticket request) happened on 01/14, and
it turned out that there were multiple failover events on 01/14.
The first relevant event on 01/14 was this:
Event
Type: Information
Event Source: USER32
Event
Category:
None
Event
ID:
1074
Date:
1/14/2014
Time:
3:18:18 PM
User:
DOMAIN\LOGIN
Computer:
COMPUTER_NAME
Description:
The process Explorer.EXE has initiated the restart
of computer COMPUTER_NAME on behalf of user DOMAIN\LOGIN for the following
reason: Other (Planned)
Reason Code: 0x85000000
Shutdown Type: restart
Comment: reboot
DOMAIN\LOGIN rebooted the A node at 3:18pm with the comment
of “reboot.”
--
The next relevant event was this:
Event
Type: Information
Event Source: ClusSvc
Event
Category:
Failover Mgr
Event
ID:
1203
Date:
1/14/2014
Time:
5:50:26 PM
User:
N/A
Computer:
COMPUTER_NAME
Description:
The Cluster Service is attempting to offline the
Resource Group "INSTANCE_NAME".
At 5:50pm, INSTANCE_NAME was offlined on the B node, without
a server restart or other error, which probably means it was
user-initiated. This resulted in 10+ minutes of file share “flapping” until
the instance finally ended up on the B node:
--
Event
Type: Information
Event Source: Service Control
Manager
Event
Category:
None
Event
ID:
7036
Date:
1/14/2014
Time:
6:01:35 PM
User:
N/A
Computer:
COMPUTER_NAME
Description:
The SQL Server (INSTANCE_NAME) service entered the
running state.
--
At 10:50pm, the SQL Server cluster resource group was again
failed over, possibly in preparation for the B node reboot that was about to
happen:
Event
Type: Information
Event Source: ClusSvc
Event
Category:
Failover Mgr
Event
ID:
1203
Date:
1/14/2014
Time:
10:50:56 PM
User:
N/A
Computer:
COMPUTER_NAME
Description:
The Cluster Service is attempting to offline the
Resource Group "INSTANCE_NAME".
--
At 10:55pm, the same user rebooted the B node, again with
the comment of “reboot”:
Event
Type: Information
Event Source: USER32
Event
Category:
None
Event
ID:
1074
Date:
1/14/2014
Time:
10:55:24 PM
User:
DOMAIN\LOGIN
Computer:
COMPUTER_NAME
Description:
The process Explorer.EXE has initiated the restart
of computer COMPUTER_NAME on behalf of user DOMAIN\LOGIN for the following
reason: Other (Planned)
Reason Code: 0x85000000
Shutdown Type: restart
Comment: reboot
--
This resulted in the SQL Server resource group trying to
fail over to A, but due to the same file share “flapping” the group finally ended up
back on B after B was back up from its reboot:
Event
Type: Information
Event Source: Service Control
Manager
Event
Category:
None
Event
ID:
7035
Date:
1/14/2014
Time:
11:05:56 PM
User: DOMAIN\SERVICE_ACCOUNT_LOGIN
Computer:
COMPUTER_NAME
Description:
The SQL Server (INSTANCE_NAME) service was
successfully sent a start control.
--
The issue relevant to their situation related to a
cluster file share resource:
The share is on the Y: drive (Y:\FOLDER_NAME) but as seen in
the resource properties screenshot above there is no clustering dependency established to that drive. On a
file share cluster resource like this there should be a dependency on the
relevant drive, like this example from the “SQL Server MSSQL Share INSTANCE_NAME”
share:
Without a dependency on the drive resource, the file share
resource tries to come online as soon as the resource group fails over.
Unfortunately, since the file share actually *does* depend on the drive being
online (even without an established dependency relationship), the file share
fails with the following errors if the drive isn’t online yet:
--
--
Event
Type: Error
Event Source: ClusSvc
Event
Category: File Share Resource
Event
ID:
1068
Date:
2/12/2014
Time:
7:16:37 AM
User:
N/A
Computer:
COMPUTER_NAME
Description:
Cluster file share resource SHARE_NAME failed to
start with error 21.
--
Event
Type: Error
Event Source: ClusSvc
Event
Category: File Share Resource
Event
ID:
1053
Date:
2/12/2014
Time:
7:16:37 AM
User:
N/A
Computer:
COMPUTER_NAME
Description:
Cluster File Share SHARE_NAME cannot be brought
online because the share could not be created.
--
Event
Type: Error
Event Source: ClusSvc
Event
Category: Failover Mgr
Event
ID:
1069
Date:
2/12/2014
Time:
7:16:37 AM
User:
N/A
Computer:
COMPUTER_NAME
Description:
Cluster resource 'SHARE_NAME' in Resource Group 'INSTANCE_NAME'
failed.
--
The middle error is the most telling one – the file share
cannot be brought online because the share could not be created – because the
Y: drive isn’t online yet!
The dangerous catch here is that by default the file share will try
three times to come online, and if it can’t (because the drive isn’t online
yet) it results in the resource group
failing, causing another SQL
Server resource group failover, even if the SQL Server resource comes
online cleanly. This results in the group “flapping” back and forth
between the nodes until you get lucky and the Y: drive resource happens to come
online before the SHARE_NAME resource makes it through its three retry attempts
and kills the group again.
This may not seem important and is not the cause of the
actual initial failover event (in their case the initial failover was intentional as part of the Windows patching), but it is the reason why every time you do have
a SQL Server cluster failover event there are multiple failovers, as seen in this
SQL Server error log stack:
As can be seen here, on 01/14/2014 there was a failover, and
it resulted in multiple failovers (each SQL Error Log is a new start of the
MSSQLServer service) over the course of several minutes as the SHARE_NAME resource tried to come
online and failed. This happened again on 02/12. This can be seen
in the Windows System Log as well (shown filtered below):
Events 1068 and 1053 are the relevant events, and as you can
see on 01/14 from 5:51pm-5:58pm there were multiple failovers with the file share
failing, and then again that same night between 10:59pm and 11:02pm, and then
the most recent event on 02/12 between 7:15am and 7:16am.
There are configuration options on the SHARE_NAME file share
cluster object to change the number of retries or to make it so it doesn’t fail
the group when its object fails, but the most correct fix is to add the
dependency on the Y: drive to the file share object.
--
The ultimate answer of why the INSTANCE_NAME instance
ended up on the B node is the file share flapping – DOMAIN\LOGIN was performing regular
server maintenance (Windows patches) and mid-afternoon on 01/14 tried to
fail everything from A to B and then reboot A, and then later in the evening
tried to fail everything from B back to A and reboot B (and this is how I would
patch a cluster like this).
The catch is that the file share flapping caused many, many
failover events to occur each time, and resulted in the cluster instance ending
up on the “wrong” node (after a failover everything would normally be on A,
but due to the flapping A>B>A>B>A>B it actually ended up back on
B).
-
The ultimate answer to prevent this "flapping" is to add the file
share cluster resource dependency on the drive object.
**This issue should be corrected as soon as this situation is discovered even though it
will require a file share downtime, which may require a SQL Server downtime
depending on what the SHARE_NAME share is used for.**
While this example is about a file share resource, the same concept (and resolution) is valid for any clustered resource that relies on something else - for example, if your SQL Server instance cluster resource is missing its cluster dependency on the Network Name.
Hope this helps!
Configuring "Maximum failures in the specified period" is a good way to prevent flapping too.
ReplyDeletehttp://technet.microsoft.com/en-us/library/cc755151.aspx