One DBA's Ongoing Search for Clarity in the Middle of Nowhere


*or*

Yet Another Andy Writing About SQL Server

Wednesday, May 21, 2014

Why Won't My Clustered Instance Start?

Last weekend I had a client page me in the middle of the night (because, of course, these things always happen in the middle of the night) with the complaint that their SQL Server Failover Cluster Instance (FCI) wouldn't come online on their B node after multiple attempts.  When they failed it back to the A node, the instance came up without issue.

The error they reported was this:
Error: 17750, Severity: 16, State: 0. Could not load the DLL SQLJDBC_XA.dll, or one of the DLLs it references. Reason: 126(The specified module could not be found.). 
If you don't recognize it outright, SQLJDBC_XA.dll is part of the JDBC (Java Database Connectivity) driver specifically for handling distributed transactions.

At first I went digging for references to this specific error, but didn't see much except for the fact that the Application Log on the server had roughly 2.3 Gajillion (a  technical term roughly equivalent to a Googolplex of Googolplexes) of these 17750 errors.

I tried to go to the SQL Server Error Log but of course the server wasn't changed from its default setting of six plus current, meaning there were only seven logs.  Since the client has tried multiple times to fail the instance back and forth, they had easily burned through the seven tries and therefore the original error.
PRO TIP: I strongly recommend changing the default log retention to a much larger number (30-45) and then setting up a job to cycle the error log every day by running sp_cycle_errorlog.  This works in all versions of SQL Server, makes the logs smaller and easier to open when you want to troubleshoot an issue, and often most importantly, protects you from the issue described above of multiple SQL Server service restarts aging out your log too quickly, or at least mitigates it (if you restart your instance 46 times, you will still be stuck).
Right-click on "SQL Server Logs" and select Configure.  Check the box for "Limit the number of error log files before they are recycled" and fill in the Maximum Number of Error Log files.  Many people are faked into believing that there is no limit because by default this box isn't checked, but if it isn't checked all you get is the *default* limit of six. #TheMoreYouKnow
I did the simple thing - I searched the B node for the file SQLJDBC_XA.dll to see if it was in the wrong directory or simply didn't exist - surprise surprise, it didn't exist.  The other possibility with this particular error (for a DLL file anyway) would be that it wasn't registered with Windows, but in this case it wasn't there at all.

I looked at the A node and searched for the file and there it was:


This showed me that the problem was probably related to a piece of application code installed on the A node, which I found without too much effort:


 I then looked on the B node and found the this ATG application wasn't there:



 No wonder the application DLL wasn't there!

As you can see the problem is that the application code was not installed on the B node and therefore the relevant database couldn't be started.  Failing the instance back to the A node allowed everything to run because the application code had been installed there.  The Windows Application Log also showed a particular database related to the situation (with a suspiciously similar name):
Date     5/17/2014 2:08:16 AM
Log     SQL Server
Source     spid7s
Message
Recovery is complete. This is an informational message only. No user action is required.
--
Date     5/17/2014 2:08:16 AM
Log     SQL Server
Source     spid51
Message

Starting up database 'atg'.
-- 
Date     5/17/2014 2:08:16 AM
Log     SQL Server
Source     spid63
Message
Attempting to load library 'SQLJDBC_XA.dll' into memory. This is an informational message only. No user action is required.
--
Date     5/17/2014 2:08:16 AM
Log     SQL Server
Source     spid63
Message
Error: 17750, Severity: 16, State: 0.
--
Date     5/17/2014 2:08:16 AM
Log     SQL Server
Source     spid63
Message
Could not load the DLL SQLJDBC_XA.dll, or one of the DLLs it references. Reason: 126(The specified module could not be found.)
 ...which led into a near infinite cycle of Attempting to load - Count not load - Attempting to load - Could not load - etc. etc. etc.  Only by making it all the way back to the instance restart could you see the "starting up database" message.

--

Lessons learned for the client (and all of us):

  • Don't install application code on a SQL Server unless you absolutely have to.  Ever.  When you do have to do so, make sure all impacted SQL Servers (cluster nodes, mirroring hosts, etc.) have the same version of the application with all of the same options enabled.  (This last part gets people as well - the application is installed on both nodes but one is installed with a particular flag enabled which adds extra DLLs to the server which causes the same problem seen above).
  • Set the SQL Server Log to extra history retention so that your DBAs and other impacted parties have a better chance of seeing the true cause of the initial failure - much of the information is in the Windows Application Log but not all of it, and it is easier to find the SQL Server-specific items in the SQL Server log.  Also, on many servers the Application Log cycles more frequently than every thirty days (although you can increase the size of the Application Log as well, and this also can be a good idea.)
  • Most importantly - call for help before trying the same thing over and over.  When I try something once and it doesn't work, I do often try it a second time to make sure I didn't make a simple fat-finger mistake the first time.  After that, throw up the warning flag and ask for help via another DBA, or #sqlhelp on Twitter, or Google-Fu - whatever helps you get a second opinion.
Hope this helps!


No comments:

Post a Comment