Nebraska SQL from @DBA_ANDY: April 2014

Monday, April 28, 2014

Why Can't I Reinitialize My Subscriber?

I recently had a colleague request help with a replication problem he was having with a client. The client was running transactional replication from a SQL Server 2005 Publisher to multiple SQL Server 2005 Subscribers and was having issues after attempting to add several articles (tables) to an existing publication.

The initial error reported on multiple subscribers looked like this in Replication Monitor:

“Unable to replicate a view or function because the referenced objects or columns are not present on the Subscriber.”

To me this said that some of the article changes that had been made during the publication change hadn’t been properly sent to the subscribers, so I decided to reinitialize all subscriptions:

(Another colleague pointed out that I could have reinitialized only the broken subscriptions one-by-one, but in this case a majority of the subscriptions were broken and the publication was relatively small.)

Three of the four subscriptions failed re-initialization with two different errors, one with a object reference error and the other two with a replication error.

The first subscription, on SubServer1, failed re-initialization with the following error:

Cannot DROP TABLE 'dbo.Table99' because it is being referenced by object 'View1'.

I found that the referenced view was created with SCHEMABINDING, which was preventing the related table from being dropped and recreated. The view didn’t have any explicit permissions on it, so I scripted the view, dropped it, re-ran initialization for that subscription, which succeeded, and then recreated the view (again with SCHEMABINDING as it was originally).

SubServer2 and SubServer3 had a different but actually related issue:

In both of these cases, the re-initialization was failing because a table on the subscriber couldn’t drop a table being used in replication. What this essentially means is:

In Publication #1, PubServer1 is publishing to SubServer2

In Publication #2, SubServer2 is publishing one or more of the tables from Publication #1 to some other server or servers.

In essence, SubServer2 is both a subscriber to Publication #1 and a publisher of Publication #2 (often called a “re-publisher”). This results in the error seen above, because part of the default options for a re-initialization of a subscription is to drop and recreate the tables, and when a table is an article in a publication (in this case Publication #2 above) it cannot be dropped to reinitialize the subscription to Publication #1.

For information on re-publishing and the order of steps required to use it, see the TechNet article here and the blog post here. There are many limitations on both the publications and the subscriptions involved in such an arrangement.

The fix to this is alter the properties of the offending articles, or all articles, in the publication (in this case the change was made to all articles, but it could have been made one-by-one as well:

Changing article properties in this fashion automatically marks all subscriptions for re-initialization (assuming you don’t cancel out of the above dialog box). Your options are to accept the default as shown, in which case you need to manually run the snapshot agent job on the Distributor before the re-initialization will occur, or to check the “Generate the new snapshot now” box which will reach out to the Distributor and immediately fire the snapshot agent job, beginning the re-initialization process right away.

The decision on what to do here should be based on how critical the replication is to your infrastructure (if you need it fixed *now* you may have no alternative but to generate it right away) and how large the publication itself is (a large publication can be time- and resource-intensive to generate and transfer a snapshot during business hours).

In this case based on the tables included in the publication it was possible to “Generate the new snapshot now” because of its relatively small size. This re-initialization completed successfully on all subscribers with the “Delete Data” option configured.

The final item here is that my notes above about the SCHEMABINDING view would not be completely relevant had I first discovered the republishing situation. The SCHEMABINDING view broke that single subscriber because the SCHEMABINDING prevented the table from being dropped, but had the article's property been set to “Delete Data” rather than “Drop and Recreate” the SCHEMABINDING issue would have been handled without dropping and recreating the view.

Hope this helps!

Wednesday, April 23, 2014

Where are my 32-bit Perfmon Counters?

I have run into this before, but not in this same way, so I decided it would be worth writing about.

On a client server they are running 32-bit SQL Server 2005 (I know, not supported - that's a different rant/post) on a 64-bit Windows 2008 installation. When I logged onto the server for the first time to try to pull some basic data, I was greeted with one of the most-hated items of dealing with 32-bit SQL Server on a 64-bit operating system:

NO PERFMON COUNTERS!!!!!!

Public Service Announcement - Always *always* install your applications with the same architecture as your O/S - meaning always install only 64-bit applications on your 64-bit Windows Server!

Having seen this before, I went to the first item in my 32-bit toolbox - the WOW64 version of Perfmon located at C:\Windows\WOW64\perfmon.msc. Many people don't seem to know this exists, but it allows you to access the 32-bit version of Perfmon required to access the counters for 32-bit SQL Server (or any other 32-bit application with performance counters) on a 64-bit Windows machine.

No Good.

I placed a shortcut for the 32-bit Perfmon on the desktop for future use (regardless of how I eventually resolved the issue, I knew I would definitely need to use the 32-bit Perfmon) and went to Management Studio to see if the counters were in sys.dm_os_performance_counters, which they were. This showed my that the counters did exist - if they were completely missing or broken, they wouldn't be visible in the DMV either.

In the past when I had run into this issue, it had been on SQL Express and the counters didn't exist at all - not even in the DMV. In those situations the answer was to load the counters via lodctr after copying them from another machine of the same version. Since the counters existed I didn't believe that would be the answer (although eventually it was close - keep reading) so I turned to my Google-Fu to see where I should go next...

...and didn't find much. A few forum postings and blog articles with different symptoms and different resolutions, but nothing AHA. In my frustration I went through the steps to repair or reload the existing counters in case some of them were corrupted somehow.

First, I tried to just rebuild the existing counters via lodctr /R but that did not resolve the issue.

Next I went through the full steps to unload and reload the counters as described here by Alex Pinsker (blog/@alex_pinsker). The catch is that near the end of the process Alex calls for a server reboot, and this was a production server. I tried several things that I could do without impacting production, including:

Re-syncing the counters with WMI (WINMGMT.EXE /RESYNCPERF)
Restarting the "Performance Logs and Alerts" service
Restarting "Windows Management Instrumentation" (WMI) service
Restarting the "Remote Registry" service (which shouldn't have mattered since I was trying to access the counters locally, but it isn't impactful to try)

...but it didn't work. I gave up for the afternoon after coordinating a reboot for overnight with the client's server team.

A new day, and BINGO - I could see the SQL Server counters in the 32-bit Perfmon via the interactive ("real-time") Performance Monitor after the reboot.

With counters firmly in hand, I went to the next step - trying to set up a data collector. I went through the basic steps to set up the collector and started it, and then a short time later I stopped and re-started the collector to view the contents of the perf file.

ARGH- no SQL Server counters. The Windows hives of counters (Processor, Logical Disk, etc.) were there, but no SQL Server counters.

I had initially opened the perf file by double-clicking it, so, just to make sure, I opened 32-bit Perfmon manually via my shortcut and changed its data source to my perf file just in case that was the issue, but no good.

More than a little Google-Fu later and I finally found something that described my situation pretty well, from a Microsoft CSS Engineer in Romania: http://blogs.technet.com/b/roplatforms/archive/2010/05/03/creating-a-custom-data-collector-set-with-sql-x86-counters-on-an-x64-os.aspx

The engineer, Bogdan Palos, was writing about a situation on Small Business Server (SBS) compared to Windows 2008 X64, but the basic scenario was similar - trying to create a data collector of 32-bit counters on a Windows 2008 X64 installation.

The suggested fix seemed more than a little bizarre:

...copy the 64bit version of “sqlctr90.dll” from another machine running a x64 SQL instance, to the “c:\windows\system32” directory of the Server 2008 x64 OS machine running SQL Server 2005 x86. We recommend using the DLL from similar SP levels of SQL Server

Apparently the files from the x86 SQL Server installation are not enough - you actually need a file from an X64 install?

But guess what - it worked! After simply copying the sqlctr90.dll file from another SQL Server 2005 X64 of the same version number into the c:\Windows\system32 directory on this server and re-starting my Perfmon collector, it began collecting the SQL Server hives of counters.

Just one more little thing to file under "Seriously, Microsoft?"

What did I learn?

32-bit is evil (already knew this, but it was definitely reinforced).
Microsoft will never cease to amaze me in how some things need to be hacked to work.
There is always something else to learn!

Hope this helps!

Thursday, April 17, 2014

Why does my Failover Cluster Instance Keep Failing Back and Forth?

A few months ago I was paged by a client because their SQL Server Failover Cluster Instance (FCI) was experiencing multiple failovers when they performed routine Windows patching.

The client's request which generated the work ticket was this:

At some point recently this clustered node failed over from the A node to the B node however it seems to be the only instance that did as INSTANCE_NAME_2 still resides on COMPUTER_NAME. With that, we would like to open a ticket to investigate the cause/date/time of this failover.

This ticket was generated on 02/11, but then there was a failover (with file share “flapping” as described above) on the morning of 02/12, immediately after the request – it appeared at the time that this was initiated by someone at the client trying to move the service back from the B node to the A node (their chosen primary node), which did happen after some flapping. There is no evidence of server reboot or other error on 02/12, but someone did RDP to the server about ten minutes before the failover, so I assumed it was manually triggered.

The failover prior to 02/12 (the one the client was asking about in their ticket request) happened on 01/14, and it turned out that there were multiple failover events on 01/14.

The first relevant event on 01/14 was this:

Event Type: Information

Event Source: USER32

Event Category: None

Event ID: 1074

Date: 1/14/2014

Time: 3:18:18 PM

User: DOMAIN\LOGIN

Computer: COMPUTER_NAME

Description:

The process Explorer.EXE has initiated the restart of computer COMPUTER_NAME on behalf of user DOMAIN\LOGIN for the following reason: Other (Planned)

Reason Code: 0x85000000

Shutdown Type: restart

Comment: reboot

DOMAIN\LOGIN rebooted the A node at 3:18pm with the comment of “reboot.”

The next relevant event was this:

Event Type: Information

Event Source: ClusSvc

Event Category: Failover Mgr

Event ID: 1203

Date: 1/14/2014

Time: 5:50:26 PM

User: N/A

Computer: COMPUTER_NAME

Description:

The Cluster Service is attempting to offline the Resource Group "INSTANCE_NAME".

At 5:50pm, INSTANCE_NAME was offlined on the B node, without a server restart or other error, which probably means it was user-initiated. This resulted in 10+ minutes of file share “flapping” until the instance finally ended up on the B node:

Event Type: Information

Event Source: Service Control Manager

Event Category: None

Event ID: 7036

Date: 1/14/2014

Time: 6:01:35 PM

User: N/A

Computer: COMPUTER_NAME

Description:

The SQL Server (INSTANCE_NAME) service entered the running state.

At 10:50pm, the SQL Server cluster resource group was again failed over, possibly in preparation for the B node reboot that was about to happen:

Event Type: Information

Event Source: ClusSvc

Event Category: Failover Mgr

Event ID: 1203

Date: 1/14/2014

Time: 10:50:56 PM

User: N/A

Computer: COMPUTER_NAME

Description:

The Cluster Service is attempting to offline the Resource Group "INSTANCE_NAME".

At 10:55pm, the same user rebooted the B node, again with the comment of “reboot”:

Event Type: Information

Event Source: USER32

Event Category: None

Event ID: 1074

Date: 1/14/2014

Time: 10:55:24 PM

User: DOMAIN\LOGIN

Computer: COMPUTER_NAME

Description:

The process Explorer.EXE has initiated the restart of computer COMPUTER_NAME on behalf of user DOMAIN\LOGIN for the following reason: Other (Planned)

Reason Code: 0x85000000

Shutdown Type: restart

Comment: reboot

This resulted in the SQL Server resource group trying to fail over to A, but due to the same file share “flapping” the group finally ended up back on B after B was back up from its reboot:

Event Type: Information

Event Source: Service Control Manager

Event Category: None

Event ID: 7035

Date: 1/14/2014

Time: 11:05:56 PM

User: DOMAIN\SERVICE_ACCOUNT_LOGIN

Computer: COMPUTER_NAME

Description:

The SQL Server (INSTANCE_NAME) service was successfully sent a start control.

The issue relevant to their situation related to a cluster file share resource:

The share is on the Y: drive (Y:\FOLDER_NAME) but as seen in the resource properties screenshot above there is no clustering dependency established to that drive. On a file share cluster resource like this there should be a dependency on the relevant drive, like this example from the “SQL Server MSSQL Share INSTANCE_NAME” share:

Without a dependency on the drive resource, the file share resource tries to come online as soon as the resource group fails over. Unfortunately, since the file share actually *does* depend on the drive being online (even without an established dependency relationship), the file share fails with the following errors if the drive isn’t online yet:

--

Event Type: Error

Event Source: ClusSvc

Event Category: File Share Resource

Event ID: 1068

Date: 2/12/2014

Time: 7:16:37 AM

User: N/A

Computer: COMPUTER_NAME

Description:

Cluster file share resource SHARE_NAME failed to start with error 21.

Event Type: Error

Event Source: ClusSvc

Event Category: File Share Resource

Event ID: 1053

Date: 2/12/2014

Time: 7:16:37 AM

User: N/A

Computer: COMPUTER_NAME

Description:

Cluster File Share SHARE_NAME cannot be brought online because the share could not be created.

Event Type: Error

Event Source: ClusSvc

Event Category: Failover Mgr

Event ID: 1069

Date: 2/12/2014

Time: 7:16:37 AM

User: N/A

Computer: COMPUTER_NAME

Description:

Cluster resource 'SHARE_NAME' in Resource Group 'INSTANCE_NAME' failed.

The middle error is the most telling one – the file share cannot be brought online because the share could not be created – because the Y: drive isn’t online yet!

The dangerous catch here is that by default the file share will try three times to come online, and if it can’t (because the drive isn’t online yet) it results in the resource group failing, causing another SQL Server resource group failover, even if the SQL Server resource comes online cleanly. This results in the group “flapping” back and forth between the nodes until you get lucky and the Y: drive resource happens to come online before the SHARE_NAME resource makes it through its three retry attempts and kills the group again.

This may not seem important and is not the cause of the actual initial failover event (in their case the initial failover was intentional as part of the Windows patching), but it is the reason why every time you do have a SQL Server cluster failover event there are multiple failovers, as seen in this SQL Server error log stack:

As can be seen here, on 01/14/2014 there was a failover, and it resulted in multiple failovers (each SQL Error Log is a new start of the MSSQLServer service) over the course of several minutes as the SHARE_NAME resource tried to come online and failed. This happened again on 02/12. This can be seen in the Windows System Log as well (shown filtered below):

Events 1068 and 1053 are the relevant events, and as you can see on 01/14 from 5:51pm-5:58pm there were multiple failovers with the file share failing, and then again that same night between 10:59pm and 11:02pm, and then the most recent event on 02/12 between 7:15am and 7:16am.

There are configuration options on the SHARE_NAME file share cluster object to change the number of retries or to make it so it doesn’t fail the group when its object fails, but the most correct fix is to add the dependency on the Y: drive to the file share object.

The ultimate answer of why the INSTANCE_NAME instance ended up on the B node is the file share flapping – DOMAIN\LOGIN was performing regular server maintenance (Windows patches) and mid-afternoon on 01/14 tried to fail everything from A to B and then reboot A, and then later in the evening tried to fail everything from B back to A and reboot B (and this is how I would patch a cluster like this).

The catch is that the file share flapping caused many, many failover events to occur each time, and resulted in the cluster instance ending up on the “wrong” node (after a failover everything would normally be on A, but due to the flapping A>B>A>B>A>B it actually ended up back on B).

The ultimate answer to prevent this "flapping" is to add the file share cluster resource dependency on the drive object.

**This issue should be corrected as soon as this situation is discovered even though it will require a file share downtime, which may require a SQL Server downtime depending on what the SHARE_NAME share is used for.**

While this example is about a file share resource, the same concept (and resolution) is valid for any clustered resource that relies on something else - for example, if your SQL Server instance cluster resource is missing its cluster dependency on the Network Name.

Hope this helps!