One DBA's Ongoing Search for Clarity in the Middle of Nowhere


*or*

Yet Another Andy Writing About SQL Server

Wednesday, June 21, 2017

9003 is a Scary Number


Recently I moved up to the next consecutive error number, a 9003 error:

-- 
Log Name:      Application
Source:        MSSQLSERVER
Date:          6/15/2017 4:49:59 AM
Event ID:      9003
Task Category: Server
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      Server1.domain.net
Description:
The log scan number (139943:2168:0) passed to log scan in database 'DBStuff' is not valid.
 
This error may indicate data corruption or that the log file (.ldf) does not match the data file (.mdf). 
If this error occurred during replication, re-create the publication. Otherwise, restore from backup if the problem results in a failure during startup.

--

<Cue Scary Music>

http://www.scaryforkids.com/pics/scary-music.jpg

--

Like the 9002, this is a seemingly scary error with the word “corruption” in it, but when I checked the server it appeared the database was online and I saw connections to DBStuff once I signed on.

The server was throwing a series of the errors constantly in the SQL Error Log (and Windows Application Log) for more than an hour by the time it was escalated to me – constantly every minute or less!

The DBStuff database was a relatively small database (13GB) so I went to run CHECKDB against it manually from a query window to look for corruption, but I couldn’t even get that far:

--

Msg 5901, Level 16, State 1, Line 1
One or more recovery units belonging to database 'DBStuff' failed to generate a checkpoint.  This is typically caused by lack of system resources such as disk or memory, or in some cases due to database corruption. Examine previous entries in the error log for more detailed information on this failure.

Msg 1823, Level 16, State 2, Line 1
A database snapshot cannot be created because it failed to start.

Msg 1823, Level 16, State 8, Line 1
A database snapshot cannot be created because it failed to start.

Msg 7928, Level 16, State 1, Line 1
The database snapshot for online checks could not be created. Either the reason is given in a previous error or one of the underlying volumes does not support sparse files or alternate streams. Attempting to get exclusive access to run checks offline.

Msg 5030, Level 16, State 12, Line 1
The database could not be exclusively locked to perform the operation.

Msg 7926, Level 16, State 1, Line 1
Check statement aborted. The database could not be checked as a database snapshot could not be created and the database or table could not be locked. See Books Online for details of when this behavior is expected and what workarounds exist. Also see previous errors for more details.

Msg 9003, Level 20, State 9, Line 1
The log scan number (139943:2168:0) passed to log scan in database 'DBStuff' is not valid.
This error may indicate data corruption or that the log file (.ldf) does not match the data file (.mdf).
If this error occurred during replication, re-create the publication. Otherwise, restore from backup if the problem results in a failure during startup.
 --

Note that the last error in the series is again the 9003 error shown in the initial Log entry above.

Could there be corruption?

http://s2.quickmeme.com/img/a6/a646c45633cfab95a2aaf727480883aba7de652148fb3601db61853a62d3b3da.jpg

At that point I went Googling and found this from Nathan Courtine at dbi Services:


At first I dismissed the idea because it is talking about replication and I didn’t see any replication on Server1, but when I went to look at the log_reuse_wait_desc as the article mentions, sure enough I found this:

--
SELECT name, log_reuse_wait_desc 
FROM sys.databases
WHERE name = N'DBStuff'
  
name
log_reuse_wait_desc
DBStuff
REPLICATION

--

As described in the article I tried running sp_removedbreplication against DBStuff since there is no apparent replication (publications or subscriptions) but as the author notes this did not resolve the issue in my case.

The fix described in the article involves setting up a new dummy table in the database in question and creating a publication for that dummy table, which will supposedly goose the system and reset the log_reuse_wait_desc from REPLICATION to some “normal” type such as NOTHING or CHECKPOINT.  After this you can remove the replication and the dummy table and be done.

The catch for us was that the replication components weren’t installed on the Server1 instance during its original install!


Microsoft SQL Server Management Studio is unable to access replication components because replication is not installed on this instance of SQL Server. For information about installing replication, see the topic Installing Replication in SQL Server Books Online. 
Replication components are not installed on this server. Run SQL Server Setup again and select the option to install replication. (Microsoft SQL Server, Error: 21028)

Sigh...

--

After mounting media and installing the Replication components the rest was relatively easy - add a dummy table, create a publication, then drop the publication and dummy table - just as described in the referenced post above.

Eventually we determined that the DBStuff database was restored from another server where it had been involved in a replication topology, which was the source of the offending bits.

--

Hope this helps!