One DBA's Ongoing Search for Clarity in the Middle of Nowhere


*or*

Yet Another Andy Writing About SQL Server

Wednesday, April 23, 2014

Where are my 32-bit Perfmon Counters?

I have run into this before, but not in this same way, so I decided it would be worth writing about.

On a client server they are running 32-bit SQL Server 2005 (I know, not supported - that's a different rant/post) on a 64-bit Windows 2008 installation.  When I logged onto the server for the first time to try to pull some basic data, I was greeted with one of the most-hated items of dealing with 32-bit SQL Server on a 64-bit operating system:

NO PERFMON COUNTERS!!!!!!

Public Service Announcement - Always *always* install your applications with the same architecture as your O/S - meaning always install only 64-bit applications on your 64-bit Windows Server!

Having seen this before, I went to the first item in my 32-bit toolbox - the WOW64 version of Perfmon located at C:\Windows\WOW64\perfmon.msc.  Many people don't seem to know this exists, but it allows you to access the 32-bit version of Perfmon required to access the counters for 32-bit SQL Server (or any other 32-bit application with performance counters) on a 64-bit Windows machine.

No Good.

I placed a shortcut for the 32-bit Perfmon on the desktop for future use (regardless of how I eventually resolved the issue, I knew I would definitely need to use the 32-bit Perfmon) and went to Management Studio to see if the counters were in sys.dm_os_performance_counters, which they were.  This showed my that the counters did exist - if they were completely missing or broken, they wouldn't be visible in the DMV either.

In the past when I had run into this issue, it had been on SQL Express and the counters didn't exist at all - not even in the DMV.  In those situations the answer was to load the counters via lodctr after copying them from another machine of the same version.  Since the counters existed I didn't believe that would be the answer (although eventually it was close - keep reading) so I turned to my Google-Fu to see where I should go next...

...and didn't find much.  A few forum postings and blog articles with different symptoms and different resolutions, but nothing AHA.  In my frustration I went through the steps to repair or reload the existing counters in case some of them were corrupted somehow.

First, I tried to just rebuild the existing counters via lodctr /R but that did not resolve the issue.

Next I went through the full steps to unload and reload the counters as described here by Alex Pinsker (blog/@alex_pinsker).  The catch is that near the end of the process Alex calls for a server reboot, and this was a production server.  I tried several things that I could do without impacting production, including:
  • Re-syncing the counters with WMI (WINMGMT.EXE /RESYNCPERF
  • Restarting the "Performance Logs and Alerts" service
  • Restarting "Windows Management Instrumentation" (WMI) service
  • Restarting the "Remote Registry" service (which shouldn't have mattered since I was trying to access the counters locally, but it isn't impactful to try)
...but it didn't work.  I gave up for the afternoon after coordinating a reboot for overnight with the client's server team.

A new day, and BINGO - I could see the SQL Server counters in the 32-bit Perfmon via the interactive ("real-time") Performance Monitor after the reboot. 

With counters firmly in hand, I went to the next step - trying to set up a data collector.  I went through the basic steps to set up the collector and started it, and then a short time later I stopped and re-started the collector to view the contents of the perf file.

ARGH- no SQL Server counters.  The Windows hives of counters (Processor, Logical Disk, etc.) were there, but no SQL Server counters.

I had initially opened the perf file by double-clicking it, so, just to make sure, I opened 32-bit Perfmon manually via my shortcut and changed its data source to my perf file just in case that was the issue, but no good.

More than a little Google-Fu later and I finally found something that described my situation pretty well, from a Microsoft CSS Engineer in Romania: http://blogs.technet.com/b/roplatforms/archive/2010/05/03/creating-a-custom-data-collector-set-with-sql-x86-counters-on-an-x64-os.aspx

The engineer, Bogdan Palos, was writing about a situation on Small Business Server (SBS) compared to Windows 2008 X64, but the basic scenario was similar - trying to create a data collector of 32-bit counters on a Windows 2008 X64 installation.

The suggested fix seemed more than a little bizarre:
...copy the 64bit version of “sqlctr90.dll” from another machine running a x64 SQL instance, to the “c:\windows\system32” directory of the Server 2008 x64 OS machine running SQL Server 2005 x86. We recommend using the DLL from similar SP levels of SQL Server
Apparently the files from the x86 SQL Server installation are not enough - you actually need a file from an X64 install?

But guess what - it worked!  After simply copying the sqlctr90.dll file from another SQL Server 2005 X64 of the same version number into the c:\Windows\system32 directory on this server and re-starting my Perfmon collector, it began collecting the SQL Server hives of counters.

Just one more little thing to file under "Seriously, Microsoft?"

What did I learn?
  • 32-bit is evil (already knew this, but it was definitely reinforced).
  • Microsoft will never cease to amaze me in how some things need to be hacked to work.
  • There is always something else to learn!

Hope this helps!

Thursday, April 17, 2014

Why does my Failover Cluster Instance Keep Failing Back and Forth?

A few months ago I was paged by a client because their SQL Server Failover Cluster Instance (FCI) was experiencing multiple failovers when they performed routine Windows patching.

The client's request which generated the work ticket was this:

At some point recently this clustered node failed over from the A node to the B node however it seems to be the only instance that did as INSTANCE_NAME_2 still resides on COMPUTER_NAME. With that, we would like to open a ticket to investigate the cause/date/time of this failover.

This ticket was generated on 02/11, but then there was a failover (with file share “flapping” as described above) on the morning of 02/12, immediately after the request – it appeared at the time that this was initiated by someone at the client trying to move the service back from the B node to the A node (their chosen primary node), which did happen after some flapping.  There is no evidence of server reboot or other error on 02/12, but someone did RDP to the server about ten minutes before the failover, so I assumed it was manually triggered.

--

The failover prior to 02/12 (the one the client was asking about in their ticket request) happened on 01/14, and it turned out that there were multiple failover events on 01/14.

The first relevant event on 01/14 was this:

Event Type:        Information
Event Source:    USER32
Event Category:                None
Event ID:              1074
Date:                     1/14/2014
Time:                    3:18:18 PM
User:                     DOMAIN\LOGIN
Computer:          COMPUTER_NAME
Description:
The process Explorer.EXE has initiated the restart of computer COMPUTER_NAME on behalf of user DOMAIN\LOGIN for the following reason: Other (Planned)
Reason Code: 0x85000000
Shutdown Type: restart
Comment: reboot

DOMAIN\LOGIN rebooted the A node at 3:18pm with the comment of “reboot.”

--

The next relevant event was this:

Event Type:        Information
Event Source:    ClusSvc
Event Category:                Failover Mgr
Event ID:              1203
Date:                     1/14/2014
Time:                    5:50:26 PM
User:                     N/A
Computer:          COMPUTER_NAME
Description:
The Cluster Service is attempting to offline the Resource Group "INSTANCE_NAME".

At 5:50pm, INSTANCE_NAME was offlined on the B node, without a server restart or other error, which probably means it was user-initiated.  This resulted in 10+ minutes of file share “flapping” until the instance finally ended up on the B node:

--

Event Type:        Information
Event Source:    Service Control Manager
Event Category:                None
Event ID:              7036
Date:                     1/14/2014
Time:                    6:01:35 PM
User:                     N/A
Computer:          COMPUTER_NAME
Description:
The SQL Server (INSTANCE_NAME) service entered the running state.

--

At 10:50pm, the SQL Server cluster resource group was again failed over, possibly in preparation for the B node reboot that was about to happen:

Event Type:        Information
Event Source:    ClusSvc
Event Category:                Failover Mgr
Event ID:              1203
Date:                     1/14/2014
Time:                    10:50:56 PM
User:                     N/A
Computer:          COMPUTER_NAME
Description:
The Cluster Service is attempting to offline the Resource Group "INSTANCE_NAME".

--

At 10:55pm, the same user rebooted the B node, again with the comment of “reboot”:

Event Type:        Information
Event Source:    USER32
Event Category:                None
Event ID:              1074
Date:                     1/14/2014
Time:                    10:55:24 PM
User:                     DOMAIN\LOGIN
Computer:          COMPUTER_NAME
Description:
The process Explorer.EXE has initiated the restart of computer COMPUTER_NAME on behalf of user DOMAIN\LOGIN for the following reason: Other (Planned)
Reason Code: 0x85000000
Shutdown Type: restart
Comment: reboot

--

This resulted in the SQL Server resource group trying to fail over to A, but due to the same file share “flapping” the group finally ended up back on B after B was back up from its reboot:

Event Type:        Information
Event Source:    Service Control Manager
Event Category:                None
Event ID:              7035
Date:                     1/14/2014
Time:                    11:05:56 PM
User:                     DOMAIN\SERVICE_ACCOUNT_LOGIN
Computer:          COMPUTER_NAME
Description:
The SQL Server (INSTANCE_NAME) service was successfully sent a start control.

--

The issue relevant to their situation related to a cluster file share resource:

 

The share is on the Y: drive (Y:\FOLDER_NAME) but as seen in the resource properties screenshot above there is no clustering dependency established to that drive.  On a file share cluster resource like this there should be a dependency on the relevant drive, like this example from the “SQL Server MSSQL Share INSTANCE_NAME” share:

 

Without a dependency on the drive resource, the file share resource tries to come online as soon as the resource group fails over.  Unfortunately, since the file share actually *does* depend on the drive being online (even without an established dependency relationship), the file share fails with the following errors if the drive isn’t online yet:

--

Event Type:        Error
Event Source:    ClusSvc
Event Category:  File Share Resource
Event ID:              1068
Date:                     2/12/2014
Time:                    7:16:37 AM
User:                     N/A
Computer:          COMPUTER_NAME
Description:
Cluster file share resource SHARE_NAME failed to start with error 21.

--

Event Type:        Error
Event Source:    ClusSvc
Event Category:    File Share Resource
Event ID:              1053
Date:                     2/12/2014
Time:                    7:16:37 AM
User:                     N/A
Computer:          COMPUTER_NAME
Description:
Cluster File Share SHARE_NAME cannot be brought online because the share could not be created.

--

Event Type:        Error
Event Source:    ClusSvc
Event Category:    Failover Mgr
Event ID:              1069
Date:                     2/12/2014
Time:                    7:16:37 AM
User:                     N/A
Computer:          COMPUTER_NAME
Description:
Cluster resource 'SHARE_NAME' in Resource Group 'INSTANCE_NAME' failed.

--

The middle error is the most telling one – the file share cannot be brought online because the share could not be created – because the Y: drive isn’t online yet!

The dangerous catch here is that by default the file share will try three times to come online, and if it can’t (because the drive isn’t online yet) it results in the resource group failing, causing another SQL Server resource group failover, even if the SQL Server resource comes online cleanly.  This results in the group “flapping” back and forth between the nodes until you get lucky and the Y: drive resource happens to come online before the SHARE_NAME resource makes it through its three retry attempts and kills the group again.

This may not seem important and is not the cause of the actual initial failover event (in their case the initial failover was intentional as part of the Windows patching), but it is the reason why every time you do have a SQL Server cluster failover event there are multiple failovers, as seen in this SQL Server error log stack:


As can be seen here, on 01/14/2014 there was a failover, and it resulted in multiple failovers (each SQL Error Log is a new start of the MSSQLServer service) over the course of several minutes as the SHARE_NAME resource tried to come online and failed.  This happened again on 02/12.  This can be seen in the Windows System Log as well (shown filtered below):


Events 1068 and 1053 are the relevant events, and as you can see on 01/14 from 5:51pm-5:58pm there were multiple failovers with the file share failing, and then again that same night between 10:59pm and 11:02pm, and then the most recent event on 02/12 between 7:15am and 7:16am.

There are configuration options on the SHARE_NAME file share cluster object to change the number of retries or to make it so it doesn’t fail the group when its object fails, but the most correct fix is to add the dependency on the Y: drive to the file share object.

--


The ultimate answer of why the INSTANCE_NAME instance ended up on the B node is the file share flapping – DOMAIN\LOGIN was performing regular server maintenance (Windows patches) and mid-afternoon on 01/14 tried to fail everything from A to B and then reboot A, and then later in the evening tried to fail everything from B back to A and reboot B (and this is how I would patch a cluster like this).

The catch is that the file share flapping caused many, many failover events to occur each time, and resulted in the cluster instance ending up on the “wrong” node (after a failover everything would normally be on A, but due to the flapping A>B>A>B>A>B it actually ended up back on B).

-
 
The ultimate answer to prevent this "flapping" is to add the file share cluster resource dependency on the drive object.

**This issue should be corrected as soon as this situation is discovered even though it will require a file share downtime, which may require a SQL Server downtime depending on what the SHARE_NAME share is used for.**
 
While this example is about a file share resource, the same concept (and resolution) is valid for any clustered resource that relies on something else - for example, if your SQL Server instance cluster resource is missing its cluster dependency on the Network Name.

Hope this helps!