The TL;DR - beware of Failover Cluster Manager trying to steal your non-shared storage!
--
At a client recently two availability groups on a single Windows cluster went down simultaneously. Apparently the server that was the primary for the AGs (Server1) had mysteriously lost its DATA and LOG drives. By the time the client got us involved they had faked the application into coming up by pointing it directly to the single SQL Server instance that was still up (Server2) directly via the instance name rather than the availability group listeners.
I found that two of the drives on Server1 had gone offline, causing the issues –
sample errors from the Windows System and Application Logs respectively:
--
Log Name: System
Source: Microsoft-Windows-FailoverClustering
Date:
3/31/2015 2:36:25 PM
Event ID: 1635
Task Category: Resource Control Manager
Level: Information
Keywords:
User: SYSTEM
Computer: server1.mydomain.com
Description:
Cluster resource 'Cluster Disk 2' of type 'Physical Disk' in
clustered role 'Available Storage' failed.
--
Log Name: Application
Source: MSSQLSERVER
Date:
3/31/2015 2:36:25 PM
Event ID: 9001
Task Category: Server
Level: Error
Keywords: Classic
User: N/A
Computer: server1.mydomain.com
Description:
The log for database 'Database1' is not available.
Check the event log for related error messages. Resolve any errors and restart
the database.
--
Since this is an Availability
Group (AG) I was surprised that there were “Cluster Disk” resources at all –
AG’s do not rely on shared disk (it is one of their many advantages) and most AG
clusters don’t have any shared disk at all (occasionally a quorum drive).
This is what I saw in Failover
Cluster Manager:
Cluster Disk 1 was the Quorum, but the presence of disks 2-7 did not make sense to me in a regular AG arrangement. The two disks that were online (Disk 6 and Disk 7) were the two disks that were currently “live” on Server2, but there was still no reason for them to be in Failover Cluster Manager.
The service provider assured me that none of
the drives except the Quorum are presented to more than one server from the
back-end storage.
There was one reported event that
happened at 2:36pm, a time that coincided with the failures – the client added a new node Server3 to the cluster (it was evicted 30 minutes
later with no further impact positive or negative).
My best theory at this point was
that when the engineer tried to add Server3 to the Windows cluster they
mistakenly tried to add the disks as Cluster Disk resources – for a traditional
SQL Server Failover Cluster Instance (FCI) this would be correct – for a SQL
FCI almost all disk is shared and all nodes need to have access to all of the
shared disk (although only one node can “own” it at any one time).
A cluster will “lock” disks – if
cluster MySuperHugeAndAmazingCluster01 owns a particular drive then no other server or cluster
can use it – the only way for a server to access it is through the
cluster. I considered that may be the cause of this issue – even though several of the
drives are flagged that “clustered storage is not connected to the node” this
may simply have been because the storage wasn’t presented to the current “owner” of the
Cluster Disk objects Server2.
--
After an application downtime was scheduled, I signed on to the server and after deleting the AGs (first saving their settings for later re-creation) and shutting down SQL I deleted the cluster
disk objects. This, combined with a disk rescan in the Computer Management
console on each server, did indeed return control of the “missing” drives to
the servers. I dropped six of the seven cluster disk objects (all of them
except the Quorum object) which means I needed to rescan disks on all of the
servers. This validated that the only reason things have been working on Server2 was because the cluster thought that Server2 owned the disk objects (my
guess is because the Add Node wizard to add Server3 to the cluster the
other day was probably run from Server2 rather than Server1– more to follow on that).
I recreated the two AGs and as a final step I performed a failover test of each of the two
availability groups from Server2 to Server1 and back again so that at
the end of the process Server2 was the primary for both availability
groups. Making Server2 the primary was necessary because of the
changes the client had made to the front-end applications and processes to get them to
work since they redirected the applications to talk directly to
Server2 rather than to the two availability group names (this works since
the availability group name is really just a redirect to the server name/IP
itself under the hood). A final step for the client was to redirect the
apps to once again talk to the availability group listeners.
I then added the new node (Server3) to the cluster and stepping through the Add Node wizard showed
me the likely cause of the original issue (below).
As of the end of the call, the client was satisfied with the end state – SQL Servers online, availability groups
running, and new cluster node added.
--
Here is what *I* learned today, brought to light through adding the new node and what was almost certainly the cause of the
problem:
As I noticed when adding
Server3 to the cluster, on the Confirmation screen of the Add Node wizard
in Windows Server 2012 there is a check box to “Add all eligible storage to the
cluster” – by default it is *CHECKED*.
As described here
by Clustering MVP David Bermingham, this can really cause problems:
On the
confirmation screen you will see the name and IP address you selected. You will
also see an option which is new with Windows Server 2012 failover
clustering…”Add all eligible storage to the cluster”. Personally I’m not sure
why this is selected by default, as this option can really confuse things. By
default, this selection will add all shared storage (if you have it configured)
to the cluster, but I have also seen it add
just local, non-shared disks, to the cluster as well. I suppose they want to make it easy to support symmetric
storage, but generally any host based or array based replication solutions are
going to have some pretty specific instructions on how to add symmetric storage
to the cluster and generally this option to add all disks to the cluster is
more of a hindrance than a help when it comes to asymmetric storage. For our
case, since I have no shared storage configured and I don’t want the cluster
adding any local disks to the cluster for me automatically I have unchecked the
Add all eligible storage to the cluster option.
(emphasis mine)
Although I have seen a cluster disk object reserve/”lock” a resource so
that the actual servers can’t access it other than through the cluster, but I
haven’t run over this specific situation before (the check box). The above
explanation from David shows the most likely reason *why* this happened in this
case – with the offending box checked by default, whoever was adding the node
probably clicked right past it and when the process to actually add the node
started, it grabbed all of the storage for the cluster, locking everybody
out. This would have impacted Server3 as well, but since it was a new server
with no user databases (or anything else) on its D: and E: drives unless
someone was looking in My Computer and saw the drives disappear, there wouldn’t
be any immediately apparent problem on that server.
The reason why I believe the Add
Node wizard was run from Server2 (not that it is important, just
explanatory) was because the disk objects showed as being owned by Server2. Since Server2 owned the cluster disk objects, it could
still access them, which is why it was able to keep accessing its user
databases on the two drives.
--
At the end of the day, if you are working on a cluster with no shared storage, make sure to uncheck the "Add all eligible storage to the cluster" check box - and even if you do have storage, it may not be a bad practice to uncheck the box - it isn't that hard to add the disks manually afterward, and it makes your cluster creation process consistent.
--
BONUS - I am not a PowerShell-freak myself (I keep telling myself I need to become one since #YouCanDoAnythingWithPowerShell) but if you like PS there is a flag to the relevant command there as well that is functionally equivalent to unchecking the box:
PS C:\> Add-ClusterNode -Name Server3 -NoStorage
--
#TheMoreYouKnow
just gone through today what you experienced and described above.
ReplyDeleteyour useful blogpost and help is much appreciated
Geezer you saved my bacon, added a node and bam the whole cluster down... :( Only done this a couple of times, so was in complete panic, as it should have had no impact.
ReplyDeleteIn the end, just removed the clustered drives, brought the physical disks online, rebooted and SQL came back online and was happily running the existing AGs.
Hoorah!, then went through the node addition again, this time insuring I had un-ticked the clustered storage option....
I have been playing with clustering and AG in my test environment. I was effectively hitting the same situation because I was trying to have 2 nodes that were FCI and 1 node that was a AG to the FCI cluster. The problem is that being in the same cluster the AG needed to see the D: drive to add replicas however the D drive was being used by the FCI virtual instance for its data drive. I tested adding the D drive of the AG node to the cluster and I lost my entire d drive on the AG server. So it would appear that if you are going to build a FCI and an AG you need to really consider what drives you have mapped where.
ReplyDelete