One DBA's Ongoing Search for Clarity in the Middle of Nowhere


*or*

Yet Another Andy Writing About SQL Server

Thursday, April 2, 2015

Availability Groups - Where Did My Disks Go?



The TL;DR - beware of Failover Cluster Manager trying to steal your non-shared storage!

--
At a client recently two availability groups on a single Windows cluster went down simultaneously.  Apparently the server that was the primary for the AGs (Server1) had mysteriously lost its DATA and LOG drives.  By the time the client got us involved they had faked the application into coming up by pointing it directly to the single SQL Server instance that was still up (Server2) directly via the instance name rather than the availability group listeners.

I found that two of the drives on Server1 had gone offline, causing the issues – sample errors from the Windows System and Application Logs respectively:



--

Log Name:      System
Source:        Microsoft-Windows-FailoverClustering
Date:          3/31/2015 2:36:25 PM
Event ID:      1635
Task Category: Resource Control Manager
Level:         Information
Keywords:     
User:          SYSTEM
Computer:      server1.mydomain.com
Description:
Cluster resource 'Cluster Disk 2' of type 'Physical Disk' in clustered role 'Available Storage' failed.

--

     Log Name:      Application
Source:        MSSQLSERVER
Date:          3/31/2015 2:36:25 PM
Event ID:      9001
Task Category: Server
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      server1.mydomain.com
Description:
The log for database 'Database1' is not available. Check the event log for related error messages. Resolve any errors and restart the database.

--



Since this is an Availability Group (AG) I was surprised that there were “Cluster Disk” resources at all – AG’s do not rely on shared disk (it is one of their many advantages) and most AG clusters don’t have any shared disk at all (occasionally a quorum drive).



This is what I saw in Failover Cluster Manager:



Cluster Disk 1 was the Quorum, but the presence of disks 2-7 did not make sense to me in a regular AG arrangement.  The two disks that were online (Disk 6 and Disk 7) were the two disks that were currently “live” on Server2, but there was still no reason for them to be in Failover Cluster Manager.

The service provider assured me that none of the drives except the Quorum are presented to more than one server from the back-end storage.

There was one reported event that happened at 2:36pm, a time that coincided with the failures – the client added a new node Server3 to the cluster (it was evicted 30 minutes later with no further impact positive or negative).

My best theory at this point was that when the engineer tried to add Server3 to the Windows cluster they mistakenly tried to add the disks as Cluster Disk resources – for a traditional SQL Server Failover Cluster Instance (FCI) this would be correct – for a SQL FCI almost all disk is shared and all nodes need to have access to all of the shared disk (although only one node can “own” it at any one time).

A cluster will “lock” disks – if cluster MySuperHugeAndAmazingCluster01 owns a particular drive then no other server or cluster can use it – the only way for a server to access it is through the cluster.  I considered that may be the cause of this issue – even though several of the drives are flagged that “clustered storage is not connected to the node” this may simply have been because the storage wasn’t presented to the current “owner” of the Cluster Disk objects Server2.

--

After an application downtime was scheduled, I signed on to the server and after deleting the AGs (first saving their settings for later re-creation) and shutting down SQL I deleted the cluster disk objects.  This, combined with a disk rescan in the Computer Management console on each server, did indeed return control of the “missing” drives to the servers.  I dropped six of the seven cluster disk objects (all of them except the Quorum object) which means I needed to rescan disks on all of the servers.  This validated that the only reason things have been working on Server2 was because the cluster thought that Server2 owned the disk objects (my guess is because the Add Node wizard to add Server3 to the cluster the other day was probably run from Server2 rather than Server1– more to follow on that).

I recreated the two AGs and as a final step I performed a failover test of each of the two availability groups from Server2 to Server1 and back again so that at the end of the process Server2 was the primary for both availability groups.  Making Server2 the primary was necessary because of the changes the client had made to the front-end applications and processes to get them to work since they redirected the applications to talk directly to Server2 rather than to the two availability group names (this works since the availability group name is really just a redirect to the server name/IP itself under the hood).  A final step for the client was to redirect the apps to once again talk to the availability group listeners.

I then added the new node (Server3) to the cluster and stepping through the Add Node wizard showed me the likely cause of the original issue (below).
 
As of the end of the call, the client was satisfied with the end state – SQL Servers online, availability groups running, and new cluster node added.

--

Here is what *I* learned today, brought to light through adding the new node and what was almost certainly the cause of the problem:


As I noticed when adding Server3 to the cluster, on the Confirmation screen of the Add Node wizard in Windows Server 2012 there is a check box to “Add all eligible storage to the cluster” – by default it is *CHECKED*.

As described here by Clustering MVP David Bermingham, this can really cause problems:

On the confirmation screen you will see the name and IP address you selected. You will also see an option which is new with Windows Server 2012 failover clustering…”Add all eligible storage to the cluster”. Personally I’m not sure why this is selected by default, as this option can really confuse things. By default, this selection will add all shared storage (if you have it configured) to the cluster, but I have also seen it add just local, non-shared disks, to the cluster as well. I suppose they want to make it easy to support symmetric storage, but generally any host based or array based replication solutions are going to have some pretty specific instructions on how to add symmetric storage to the cluster and generally this option to add all disks to the cluster is more of a hindrance than a help when it comes to asymmetric storage. For our case, since I have no shared storage configured and I don’t want the cluster adding any local disks to the cluster for me automatically I have unchecked the Add all eligible storage to the cluster option.

(emphasis mine)

Although I have seen a cluster disk object reserve/”lock” a resource so that the actual servers can’t access it other than through the cluster, but I haven’t run over this specific situation before (the check box).  The above explanation from David shows the most likely reason *why* this happened in this case – with the offending box checked by default, whoever was adding the node probably clicked right past it and when the process to actually add the node started, it grabbed all of the storage for the cluster, locking everybody out.  This would have impacted Server3 as well, but since it was a new server with no user databases (or anything else) on its D: and E: drives unless someone was looking in My Computer and saw the drives disappear, there wouldn’t be any immediately apparent problem on that server.

The reason why I believe the Add Node wizard was run from Server2 (not that it is important, just explanatory) was because the disk objects showed as being owned by Server2.  Since Server2 owned the cluster disk objects, it could still access them, which is why it was able to keep accessing its user databases on the two drives. 

--

At the end of the day, if you are working on a cluster with no shared storage, make sure to uncheck the "Add all eligible storage to the cluster" check box - and even if you do have storage, it may not be a bad practice to uncheck the box - it isn't that hard to add the disks manually afterward, and it makes your cluster creation process consistent.

--
BONUS - I am not a PowerShell-freak myself (I keep telling myself I need to become one since #YouCanDoAnythingWithPowerShell)  but if you like PS there is a flag to the relevant command there as well that is functionally equivalent to unchecking the box:
PS C:\> Add-ClusterNode -Name Server3 -NoStorage
 --

#TheMoreYouKnow


 

3 comments:

  1. just gone through today what you experienced and described above.
    your useful blogpost and help is much appreciated

    ReplyDelete
  2. Geezer you saved my bacon, added a node and bam the whole cluster down... :( Only done this a couple of times, so was in complete panic, as it should have had no impact.

    In the end, just removed the clustered drives, brought the physical disks online, rebooted and SQL came back online and was happily running the existing AGs.

    Hoorah!, then went through the node addition again, this time insuring I had un-ticked the clustered storage option....

    ReplyDelete
  3. I have been playing with clustering and AG in my test environment. I was effectively hitting the same situation because I was trying to have 2 nodes that were FCI and 1 node that was a AG to the FCI cluster. The problem is that being in the same cluster the AG needed to see the D: drive to add replicas however the D drive was being used by the FCI virtual instance for its data drive. I tested adding the D drive of the AG node to the cluster and I lost my entire d drive on the AG server. So it would appear that if you are going to build a FCI and an AG you need to really consider what drives you have mapped where.

    ReplyDelete