One DBA's Ongoing Search for Clarity in the Middle of Nowhere


*or*

Yet Another Andy Writing About SQL Server

Monday, July 24, 2017

Sudden Differential Backup Failures after an Availability Group Failover

As so many stories do, this story starts with a failover - in this case an Availability Group (AG) failover.

https://cdn.meme.am/cache/instances/folder984/400x/59300984.jpg

There were several different backup and status check jobs failing on NODE01 and NODE02 because the AG01 availability group was now primary on NODE02 instead of NODE01 after a failover.

The failover occurred the previous Saturday morning at 8:30am local server time because of an application-initiated server reboot of the 01 node:

--
Log Name:      System
Source:        User32
Date:          7/22/2017 8:31:08 AM
Event ID:      1074
Task Category: None
Level:         Information
Keywords:      Classic
User:          SYSTEM
Computer:      NODE01.COMPANY1.com
Description:
The process C:\Program Files\Application12\AppAgent\01\patch\ApplicationService.exe (NODE01) has initiated the restart of computer NODE01 on behalf of user NT AUTHORITY\SYSTEM for the following reason: Legacy API shutdown
Reason Code: 0x80070000
Shutdown Type: restart
Comment: The Update Agent will now reboot this machine
--

Always fun when an application “unexpectedly” restarts a server without advance warning.

https://i.imgflip.com/w6x8g.jpg
--

Even though the Quest/Dell LiteSpeed Fast Compression maintenance plans are configured to pick up “all user databases” the availability group databases were not being backed up currently and hadn’t been since the failover:

NODE01:



NODE02:



The error messages on the SQL Error Log on the 01 node are telling:

--

Date       7/24/2017 1:09:23 AM
Log        SQL Server (Current - 7/24/2017 12:59:00 PM)
Source     Backup
Message    Error: 3041, Severity: 16, State: 1.
 
-- 
Date       7/24/2017 1:09:23 AM
Log        SQL Server (Current - 7/24/2017 12:59:00 PM)
Source     Backup
Message    BACKUP failed to complete the command BACKUP DATABASE AG_DATABASE22. Check the backup application log for detailed messages.
 
-- 
Date       7/24/2017 1:09:23 AM
Log        SQL Server (Current - 7/24/2017 12:59:00 PM)
Source     spid269
Message    FastCompression Alert: 62301 : SQL Server has returned a failure message to LiteSpeed™ for SQL Server® which has prevented the operation from succeeding.
The following message is not a LiteSpeed™ for SQL Server® message. Please refer to SQL Server books online or Microsoft technical support for a solution:  
BACKUP DATABASE is terminating abnormally. This BACKUP or RESTORE command is not supported on a database mirror or secondary replica.
 
-- 
Date       7/24/2017 1:09:24 AM
Log        SQL Server (Current - 7/24/2017 12:59:00 PM)
Source     spid258
Message    Error: Maintenance plan 'Maint Backup User Databases'.LiteSpeed™ for SQL Server® 8.2.1.42
© 2016 Dell Inc.
           Selecting full backup: Maximum interval (7 days) has elapsed since last full
Executing SQLLitespeed.exe: Write new full backup \\adcnas21\SQL\NODE01\AG_DATABASE22\AG_DATABASE22.litespeed.f22.bkp
           Msg 62301, Level 16, State 1, Line 0: SQL Server has returned a failure message to LiteSpeed™ for SQL Server® which has prevented the operation from succeeding.
The following message is not a LiteSpeed™ for SQL Server® message. Please refer to SQL Server books online or Microsoft technical support for a solution:

           BACKUP DATABASE is terminating abnormally.
This BACKUP or RESTORE command is not supported on a database mirror or secondary replica.
--

 ** IMPORTANT **  - while this specific scenario is on a cluster running Dell/Quest LiteSpeed Fast Compression backups (a process that uses an algorithm to determine whether a differential is sufficient or a full backup is required each day), the problem does *not* directly relate to LiteSpeed - the problem is with running differential backups on availability group secondaries or database mirror partners in general.

The situation around running differentials on secondaries is described in this post from Anup Warrier (blog/@AnupWarrier):


--

The particular issue here is the fact that the AG’s aren’t weighted evenly, so even though it is “invalid” for the differentials, the AG still prefers NODE01 because it is more heavily weighted:




--

My recommendation to the client to fix this on LiteSpeed Fast Compression was to change the backup option on the AG from “Any Replica” to “Primary” – this would keep the backup load on the primary, which means the differential backups would work.

The cost to this is that the backup I/O load would always be on the primary, but since the “normal” state for AG01 is to live on NODE01 with backups on NODE01 then requiring the backups to run on the primary node would not be different in most situations.

Note this is the state for this particular AG - in many cases part of the use of AG's is being able to offload backups to the secondary, so in many cases this is a cost to be weighed before making this change.

If you want to run backups on the secondary node then my personal best suggestion for fixing this would be outside of LiteSpeed/Fast Compression – if you want/need to stay with differentials, we could use a scripted backup solution (like Ola Hallengren's Maintenance Solution) to enable/disable the full/differential backups on the secondary node, probably by adding a check in the backup commands to ignore databases that aren’t the AG primary.  

A script to perform this Availability Group replica check appears in my previous post on determining the primary Availability Group replica:

--
SELECT AG.name AS AvailabilityGroupName,
HAGS.primary_replica AS PrimaryReplicaName, 
HARS.role_desc as LocalReplicaRoleDesc, 
DRCS.database_name AS DatabaseName, 
HDRS.synchronization_state_desc as SynchronizationStateDesc, 
HDRS.is_suspended AS IsSuspended, 
DRCS.is_database_joined AS IsJoined
FROM master.sys.availability_groups AS AG
LEFT OUTER JOIN master.sys.dm_hadr_availability_group_states as HAGS
ON AG.group_id = HAGS.group_id
INNER JOIN master.sys.availability_replicas AS AR
ON AG.group_id = AR.group_id
INNER JOIN master.sys.dm_hadr_availability_replica_states AS HARS
ON AR.replica_id = HARS.replica_id 
AND HARS.is_local = 1
INNER JOIN master.sys.dm_hadr_database_replica_cluster_states AS DRCS
ON HARS.replica_id = DRCS.replica_id
LEFT OUTER JOIN master.sys.dm_hadr_database_replica_states AS HDRS
ON DRCS.replica_id = HDRS.replica_id
AND DRCS.group_database_id = HDRS.group_database_id
ORDER BY AG.name, DRCS.database_name
--

Another option would be to go away from differentials completely and just run FULL’s and LOG’s which would allow for the current 80/50 weight model to continue.

I further advised the client that because of the differentials situation I don’t see a way to force backups to the secondary with LiteSpeed Fast Compression - since Fast Compression requires differentials as part of its process, Fast Compression backups *have* to run on the primary. (Maybe there is some setting in Fast Compression to fix this but I am not familiar with one.) 

--

Hope this helps!



Friday, July 21, 2017

Toolbox - Which Tables are Using All of My Space?

This is the next in a new series of blogs I am going to create talking about useful tools (mostly scripts) that I use frequently in my day-to-day life as a production DBA.  I work as a Work-From-Home DBA for a Remote DBA company, but 90+% of my job functions are the same as any other company DBA.

Many of these scripts come directly from blogs and articles created and shared by other members of the SQL Server community; some of them I have slightly modified and some I have lifted directly from those articles.  I will always give attribution back to the original source and note when I have made modifications.

--

In the previous post in this series "Toolbox - Where Did All My Space Go?" I shared a script for finding which database files consumed the most space and which of those files had free space in them.  The next step after finding out which databases are using the space is determining which tables in those databases are occupying that space to consider purge/archive opportunities.  In many cases you will find a single large table (often called a "mother table") taking up most of the space in your database.

https://www.newslinq.com/wp-content/uploads/2014/05/table.png
(That's a *big* table!)

I found a script created by a developer from Switzerland in a question/answer on StackOverflow.com and modified it slightly to return the specifics I wanted.  Among other things I added the InstanceName and DatabaseName because in my job I frequently create documentation or reports for external clients who don't necessarily know the result set came from a particular instance and a particular database:

--
/*
Object Sizes
Modified from http://stackoverflow.com/questions/15896564/get-table-and-index-storage-size-in-sql-server
*/
SELECT TOP 50
@@SERVERNAME as InstanceName
, DB_NAME() as DatabaseName
, s.NAME AS SchemaName
, t.NAME  AS TableName
, SUM(p.rows) AS RowCounts
--, SUM(a.total_pages) * 8 AS TotalSpaceKB
, SUM(a.total_pages) * 8/1024.0 AS TotalSpaceMB
, SUM(a.total_pages) * 8/1024.0/1024.0 AS TotalSpaceGB
, SUM(a.used_pages) * 8/1024.0 AS UsedSpaceMB
, (SUM(a.total_pages) - SUM(a.used_pages)) * 8/1024.0 AS UnusedSpaceMB
FROM sys.tables t
INNER JOIN sys.schemas s
ON s.schema_id = t.schema_id
INNER JOIN sys.indexes i
ON t.OBJECT_ID = i.object_id
INNER JOIN sys.partitions p
ON i.object_id = p.OBJECT_ID AND i.index_id = p.index_id
INNER JOIN sys.allocation_units a
ON p.partition_id = a.container_id
WHERE t.NAME NOT LIKE 'dt%'    -- filter out system tables for diagramming
AND t.is_ms_shipped = 0
AND i.OBJECT_ID > 255
GROUP BY t.Name
, s.Name
ORDER BY TotalSpaceMB DESC
--

The results look like this:

--

InstanceName DatabaseName SchemaName TableName RowCounts TotalSpaceMB TotalSpaceGB UsedSpaceMB UnusedSpaceMB
Instance01 Database15 SI_USER TRANS_DATA 17730150 16263.72 15.88 16234.14 29.58
Instance01 Database15 SI_USER WORKFLOW_CONTEXT 50785680 7623.27 7.44 7622.52 0.75
Instance01 Database15 PTM EI_HISTORY_MSG 22704543 3701.59 3.61 3701.19 0.40
Instance01 Database15 SI_USER WORKFLOW_LINKAGE 72643908 2657.21 2.59 2657.06 0.15
Instance01 Database15 SI_USER CORRELATION_SET 13762284 2542.87 2.48 2542.59 0.27
Instance01 Database15 SI_USER DOCUMENT 9833616 1445.55 1.41 1445.32 0.23
Instance01 Database15 PTM EI_HISTORY_TRANDTL 30673680 1257.23 1.23 1256.99 0.24
Instance01 Database15 SI_USER ACT_SESSION_GUID 18728114 1246.77 1.22 1246.70 0.07
Instance01 Database15 SI_USER DATA_FLOW_GUID 13635838 908.08 0.89 907.98 0.10
Instance01 Database15 PTM EI_HISTORY_TRAN 18560608 699.97 0.68 699.73 0.24
Instance01 Database15 SI_USER DATA_TABLE 174048 460.91 0.45 460.45 0.46
Instance01 Database15 SI_USER DOCUMENT_EXTENSION 1630855 363.54 0.36 363.15 0.39
Instance01 Database15 SI_USER ACT_NON_XFER 1579422 284.48 0.28 284.13 0.35
Instance01 Database15 SI_USER ACT_XFER 804270 217.98 0.21 217.61 0.38
Instance01 Database15 SI_USER ACT_SESSION 1008875 209.04 0.20 208.66 0.38
Instance01 Database15 SI_USER ARCHIVE_INFO 4203976 113.34 0.11 113.10 0.24
Instance01 Database15 SI_USER WF_INST_S 1061373 101.52 0.10 101.37 0.16
Instance01 Database15 SI_USER ACT_AUTHORIZE 298908 70.27 0.07 70.11 0.16
Instance01 Database15 SI_USER DATA_FLOW 420930 56.59 0.06 56.35 0.23
Instance01 Database15 SI_USER ACT_AUTHENTICATE 269200 45.80 0.04 45.31 0.48
Instance01 Database15 SI_USER EDIINTDOC 182672 43.83 0.04 43.69 0.14
Instance01 Database15 SI_USER MSGMDNCORRELATION 74656 27.86 0.03 26.42 1.44
Instance01 Database15 SI_USER EDI_DOCUMENT_STATE 57498 22.19 0.02 18.21 3.98
Instance01 Database15 SI_USER ENVELOPE_PARMS 134691 19.50 0.02 19.34 0.16
Instance01 Database15 SI_USER SAP_TID 81598 14.13 0.01 14.00 0.13
Instance01 Database15 PTM EI_PARTNER_REPORT 74617 13.39 0.01 13.38 0.02
Instance01 Database15 SI_USER EDI_ELEMENT_CODES 89583 10.63 0.01 10.55 0.08
Instance01 Database15 SI_USER EDI_COMPLIANCE_RPT 37500 10.23 0.01 9.52 0.70
Instance01 Database15 SI_USER DOCUMENT_LIFESPAN 29454 10.10 0.01 8.44 1.66
Instance01 Database15 SI_USER ACTIVITY_INFO 43269 6.00 0.01 5.70 0.30
Instance01 Database15 SI_USER YFS_USER_ACT_AUDIT 14025 4.90 0.00 4.18 0.72
Instance01 Database15 SI_USER BPMV_LS_WRK2 19110 4.20 0.00 2.70 1.50
Instance01 Database15 SI_USER CODELIST_XREF_ITEM 14332 3.50 0.00 2.76 0.74
Instance01 Database15 SI_USER DOC_STATISTICS 8948 3.43 0.00 3.01 0.42
Instance01 Database15 SI_USER MAP 4848 2.37 0.00 2.26 0.11
Instance01 Database15 SI_USER YFS_RESOURCE 5628 1.98 0.00 1.82 0.16
Instance01 Database15 SI_USER MDLR_PAL_ITEM_DESC 4767 1.38 0.00 1.35 0.03
Instance01 Database15 SI_USER SERVICE_PARM_LIST 6290 1.28 0.00 1.12 0.16
Instance01 Database15 SI_USER ADMIN_AUDIT 2100 1.15 0.00 0.76 0.39
Instance01 Database15 SI_USER DOC_STAT_KEY 2860 1.08 0.00 0.82 0.26
Instance01 Database15 SI_USER MAPPER_ERL_XREF 4317 1.07 0.00 1.02 0.05
Instance01 Database15 PTM EI_MSG_PROFILE 4830 0.89 0.00 0.76 0.13
Instance01 Database15 SI_USER WF_INST_S_WRK 2708 0.86 0.00 0.78 0.08
Instance01 Database15 SI_USER YFS_ORGANIZATION 909 0.84 0.00 0.38 0.46
Instance01 Database15 SI_USER YFS_STATISTICS_DETAIL 792 0.83 0.00 0.48 0.35
Instance01 Database15 SI_USER RESOURCE_CHECKSUM 5827 0.76 0.00 0.73 0.02
Instance01 Database15 SI_USER YFS_RESOURCE_PERMISSION 1611 0.73 0.00 0.56 0.16
Instance01 Database15 PTM EI_COMM_PROFILE_AUDIT 1512 0.72 0.00 0.63 0.09
Instance01 Database15 SI_USER WFD 3406 0.72 0.00 0.63 0.09
Instance01 Database15 SI_USER XMLSCHEMAS 1678 0.70 0.00 0.69 0.01

--

This allows you to easily find the largest tables (you can modify the ORDER BY to find the tables with the most free space as well to look for inefficient indexing or design).

Once you have the largest tables in hand you have the starting point for a discussion on potential purges or archives.

Hope this helps!


Tuesday, July 11, 2017

T-SQL Tuesday #92 - Trust But Verify


It's T-SQL Tuesday time again, and this month the host is Raul Gonzalez (blog/@SQLDoubleG).  His chosen topic is Lessons Learned the Hard Way. (Everybody should have a story on this one, right?)

When considering this topic the thing that spoke most to me was a broad generalization (and no, it isn't #ItDepends)

https://cdn.meme.am/cache/instances/folder589/500x/64701589/grumpy-cat-5-trust-but-verify.jpg

Yes I am old enough (Full Disclosure) to know many people associate the phrase "Trust But Verify" with a certain Republican President of the United States, but it really is an important part of what we do as DBA's.

When a peer sends you code to review, do you just trust it's OK?

https://cdn.meme.am/cache/instances/folder809/66278809.jpg

When a client says they're backing up their databases, do you trust them?

https://cdn.meme.am/cache/instances/folder686/500x/64677686/willy-wonka-trust-but-verify.jpg

When *SQL SERVER ITSELF* says it's backing up your databases, do you trust it?

https://imgflip.com/i/1sc4nl

(Catching a theme yet?)

At the end of the day, as the DBA we are the Default Blame Acceptors, which means whatever happens, we are Guilty Until Proven Innocent - and even then we're guilty half the time.

This isn't about paranoia, it's about being thorough and doing your job - making sure the systems stay up and the customers stay happy.
  • Verify your backup jobs are running, and then run test restores to make sure the backup files are good.
  • Verify your SQL Server services are running (when they're supposed to be)
  • Verify your databases are online (again, when they're supposed to be)
  • Verify your logins and users have the appropriate security access (#SysadminIsNotAnAppropriateDefault)
  • Read through your code and that of your peers - don't just trust the syntax checker!
  • When you find code online in a blog or a Microsoft forum, read through it and check it - don't just blindly execute it because it was written by a Microsoft employee or an MVP - they're all human too! (This almost bit me once on an old forum post where thankfully I did read through the code before I pushed Execute - the post was two years old with hundreds of reads so it would be fine, right?  There was a pretty simple mistake in the WHERE clause and none of the hundreds of readers had seen it or been polite enough to point it out to the author!)
  • Run periodic checks to make sure all of your Perfmon traces (you are running Perfmon traces, right?), Windows Tasks, SQL Agent Jobs, XEvents sessions, etc. are installed properly and configured with the current settings - just because the instance has a job named "Online Status Check" on it doesn't mean the version is anything like your current code!
...and so many, many, many more.

My belief has always been that people are generally good - contrary to popular belief among some of my peers, developers/managers/QA staff/etc. are not inherently bad people - but they are people.  Sometimes they make mistakes (we all do).  Sometimes, they don't have a full grasp of our DBA-speak (just like I don't understand a lot of Developer-Speak - I definitely don't speak C#) - and that makes it our responsibility to make sure things are covered as a DBA - we are the subject matter expert in this area, so we need to make sure it's covered.

As I said above, this isn't about paranoia - you do need to entrust other team members with shared responsibilities and division of labor, because as Yoda says:

http://blog.idonethis.com/wp-content/uploads/2016/04/yoda.jpg

...but when the buck stops with you (as it so often does) and it is within your power to check something - do it!

Hope this helps!

Monday, July 3, 2017

Toolbox - Where Did All My Space Go?

This is the first in a new series of blogs I am going to create talking about useful tools (mostly scripts) that I use frequently in my day-to-day life as a production DBA.  I work as a Work-From-Home DBA for a Remote DBA company, but 90+% of my job functions are the same as any other company DBA.

Many of these scripts come directly from blogs and articles created and shared by other members of the SQL Server community; some of them I have slightly modified and some I have lifted directly from those articles.  I will always give attribution back to the original source and note when I have made modifications.

How do I find these scripts, you may ask?


Google is still the top tool in my toolbox - it never ceases to amaze me what you can find by Googling an error number, or the snippet of a message, or simply an object name.

(Disclaimer - I use Google from familiarity but the occasions I have used Bing haven't shown much different results.)

--

The first script I want to share comes from MSSQLTips.com and was created by frequent MSSQLTip author Ken Simmons (Blog/@KenSimmons) - it shows the free space in each of the database files on the instance and default comes back sorted by free space descending:

--
/*
https://www.mssqltips.com/sqlservertip/1510/script-to-determine-free-space-to-support-shrinking-sql-server-database-files/
*/
USE MASTER
GO 
 
CREATE TABLE #TMPFIXEDDRIVES (
DRIVE CHAR(1),
MBFREE INT) 
 
INSERT INTO #TMPFIXEDDRIVES
EXEC xp_FIXEDDRIVES 
 
CREATE TABLE #TMPSPACEUSED (
DBNAME VARCHAR(500),
FILENME VARCHAR(500),
SPACEUSED FLOAT) 
 
INSERT INTO #TMPSPACEUSED
EXEC( 'sp_msforeachdb''use [?]; Select ''''?'''' DBName, Name FileNme,
fileproperty(Name,''''SpaceUsed'''') SpaceUsed from sysfiles''') 
 
SELECT
C.DRIVE,
A.NAME AS DATABASENAME,
B.NAME AS FILENAME,
CASE B.TYPE
WHEN 0 THEN 'DATA'
ELSE TYPE_DESC
END AS FILETYPE,
CASE
WHEN (B.SIZE * 8 / 1024.0) > 1000
THEN CAST(CAST(((B.SIZE * 8 / 1024) / 1024.0) AS DECIMAL(18,2)) AS VARCHAR(20)) + ' GB'
ELSE CAST(CAST((B.SIZE * 8 / 1024.0) AS DECIMAL(18,2)) AS VARCHAR(20)) + ' MB'
END AS FILESIZE,
CASE
WHEN (B.SIZE * 8 / 1024.0) - (D.SPACEUSED / 128.0) > 1000
THEN CAST(CAST((((B.SIZE * 8 / 1024.0) - (D.SPACEUSED / 128.0)) / 1024.0) AS DECIMAL(18,2)) AS VARCHAR(20)) + ' GB'
ELSE CAST(CAST((((B.SIZE * 8 / 1024.0) - (D.SPACEUSED / 128.0))) AS DECIMAL(18,2)) AS VARCHAR(20)) + ' MB'
END AS SPACEFREE,
B.PHYSICAL_NAME
FROM SYS.DATABASES A
JOIN SYS.MASTER_FILES B
ON A.DATABASE_ID = B.DATABASE_ID
JOIN #TMPFIXEDDRIVES C
ON LEFT(B.PHYSICAL_NAME,1) = C.DRIVE
JOIN #TMPSPACEUSED D
ON A.NAME = D.DBNAME
AND B.NAME = D.FILENME
WHERE a.database_id>4
--and DRIVE = 'F'
ORDER BY (B.SIZE * 8 / 1024.0) - (D.SPACEUSED / 128.0) DESC

DROP TABLE #TMPFIXEDDRIVES 
 
DROP TABLE #TMPSPACEUSED
--

The result set displays like this:


--

I use this all of the time to troubleshoot space issues - not only to find out what is taking up space but also to see if there are files with excessive empty space that can be examined for cleanup.

http://s2.quickmeme.com/img/16/167911b7d66e2736356e5680d23bc9d5111120af5aada10bdb714c70874cc3b1.jpg

Do not willy-nilly shrink files - it causes all kinds of problems with file fragmentation and potentially with performance - but there are times where you just don't have any choice, and this is an easy way to find candidates.

--

Hope this helps!

Wednesday, June 21, 2017

9003 is a Scary Number


Recently I moved up to the next consecutive error number, a 9003 error:

-- 
Log Name:      Application
Source:        MSSQLSERVER
Date:          6/15/2017 4:49:59 AM
Event ID:      9003
Task Category: Server
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      Server1.domain.net
Description:
The log scan number (139943:2168:0) passed to log scan in database 'DBStuff' is not valid.
 
This error may indicate data corruption or that the log file (.ldf) does not match the data file (.mdf). 
If this error occurred during replication, re-create the publication. Otherwise, restore from backup if the problem results in a failure during startup.

--

<Cue Scary Music>

http://www.scaryforkids.com/pics/scary-music.jpg

--

Like the 9002, this is a seemingly scary error with the word “corruption” in it, but when I checked the server it appeared the database was online and I saw connections to DBStuff once I signed on.

The server was throwing a series of the errors constantly in the SQL Error Log (and Windows Application Log) for more than an hour by the time it was escalated to me – constantly every minute or less!

The DBStuff database was a relatively small database (13GB) so I went to run CHECKDB against it manually from a query window to look for corruption, but I couldn’t even get that far:

--

Msg 5901, Level 16, State 1, Line 1
One or more recovery units belonging to database 'DBStuff' failed to generate a checkpoint.  This is typically caused by lack of system resources such as disk or memory, or in some cases due to database corruption. Examine previous entries in the error log for more detailed information on this failure.

Msg 1823, Level 16, State 2, Line 1
A database snapshot cannot be created because it failed to start.

Msg 1823, Level 16, State 8, Line 1
A database snapshot cannot be created because it failed to start.

Msg 7928, Level 16, State 1, Line 1
The database snapshot for online checks could not be created. Either the reason is given in a previous error or one of the underlying volumes does not support sparse files or alternate streams. Attempting to get exclusive access to run checks offline.

Msg 5030, Level 16, State 12, Line 1
The database could not be exclusively locked to perform the operation.

Msg 7926, Level 16, State 1, Line 1
Check statement aborted. The database could not be checked as a database snapshot could not be created and the database or table could not be locked. See Books Online for details of when this behavior is expected and what workarounds exist. Also see previous errors for more details.

Msg 9003, Level 20, State 9, Line 1
The log scan number (139943:2168:0) passed to log scan in database 'DBStuff' is not valid.
This error may indicate data corruption or that the log file (.ldf) does not match the data file (.mdf).
If this error occurred during replication, re-create the publication. Otherwise, restore from backup if the problem results in a failure during startup.
 --

Note that the last error in the series is again the 9003 error shown in the initial Log entry above.

Could there be corruption?

http://s2.quickmeme.com/img/a6/a646c45633cfab95a2aaf727480883aba7de652148fb3601db61853a62d3b3da.jpg

At that point I went Googling and found this from Nathan Courtine at dbi Services:


At first I dismissed the idea because it is talking about replication and I didn’t see any replication on Server1, but when I went to look at the log_reuse_wait_desc as the article mentions, sure enough I found this:

--
SELECT name, log_reuse_wait_desc 
FROM sys.databases
WHERE name = N'DBStuff'
  
name
log_reuse_wait_desc
DBStuff
REPLICATION

--

As described in the article I tried running sp_removedbreplication against DBStuff since there is no apparent replication (publications or subscriptions) but as the author notes this did not resolve the issue in my case.

The fix described in the article involves setting up a new dummy table in the database in question and creating a publication for that dummy table, which will supposedly goose the system and reset the log_reuse_wait_desc from REPLICATION to some “normal” type such as NOTHING or CHECKPOINT.  After this you can remove the replication and the dummy table and be done.

The catch for us was that the replication components weren’t installed on the Server1 instance during its original install!


Microsoft SQL Server Management Studio is unable to access replication components because replication is not installed on this instance of SQL Server. For information about installing replication, see the topic Installing Replication in SQL Server Books Online. 
Replication components are not installed on this server. Run SQL Server Setup again and select the option to install replication. (Microsoft SQL Server, Error: 21028)

Sigh...

--

After mounting media and installing the Replication components the rest was relatively easy - add a dummy table, create a publication, then drop the publication and dummy table - just as described in the referenced post above.

Eventually we determined that the DBStuff database was restored from another server where it had been involved in a replication topology, which was the source of the offending bits.

--

Hope this helps!


Friday, May 5, 2017

It's Just Another 9002...another error like all the rest...


11:30pm on a Saturday night, and the pager went off…

Error: 5901, Severity: 16, State: 1.
One or more recovery units belonging to database 'i3_dialer' failed to generate a checkpoint. This is typically caused by lack of system resources such as disk or memory, or in some cases due to database corruption. Examine previous entries in the error log for more detailed information on this failure.

That phrase – “due to database corruption” – always makes my stomach turn just a little…

https://img.memesuper.com/9b7a67d76f5de79439a92cc9ad703dda_im-coming-elizabeth-its-the-im-coming-elizabeth-meme_500-384.jpeg

Thankfully, this turned out to *not* be corruption. <Whew!>

There was a related message:

Could not write a checkpoint record in database DB_1 because the log is out of space. Contact the database administrator to truncate the log or allocate more space to the database log files.

…and then the real problem…

Error: 9002, Severity: 17, State: 9.
The transaction log for database DB_1 is full due to 'AVAILABILITY_REPLICA'.

--

Upon cursory examination, DB_1 was suspended (not moving data) in Sales_AG, with the AG being primary on node ServerA.  The other databases in the AG were all transferring data without apparent issue.

I went to check the drive hosting the LDF log file and almost broke out laughing at midnight – it was a mount point that was 367MB.

That’s right, megabytes.

The DB_1 database was just over 1GB, with a 350MB LDF file.

Being an Availability Group, there wasn’t shared storage, so I went to the B node to look and see if something was sitting on the drive on that side – some extra file that was taking up space and preventing the log file from auto-growing, but I didn’t find one.

--

The regular Database Index Maintenance instigated this situation.  The largest object (table) in DB_1 is only 69MB, but the transaction LOG backup job only runs every fifteen minutes.  In a database this small it is quite possible that the index rebuild will cycle through all of the tables in the database between LOG backups, which would drive the LDF fill out towards the used space of the MDF data file, in this case 1GB.

In this specific situation, the LDF log file grew by 10% repeatedly until it hit 350MB.  At that point, there wasn’t room for another 10%/35MB so it couldn’t grow any more.

On a regular (non-availability group) database this would manifest as the 9002 errors for log full messages with a log_reuse_wait_desc of LOG_BACKUP – the most likely fix would be to take LOG backups, shrink the LDF log file if needed, and be done.  That was not the case here.

(Note that the log_reuse_wait_desc here wasn’t LOG_BACKUP, it was AVAILABILITY_REPLICA.  This shows that something was breaking the situation from the AG side before the lack of LOG_BACKUP could even be raised as a problem, but I didn’t notice that at the time.)

--

I added a second LDF log file (not a great answer but often a useful triage to get things moving) and that allowed DB_1 to function, but it did not allow the availability group data movement to resume.  To try to repair the situation, I removed DB_1 from the Sales_AG availability group.  (Note that this does not prevent applications from accessing the database, but rather it would not allow DB_1 to fail over to the B node if a failover occurred – but in the current state where data movement was suspended and non-resumable, a failover wouldn’t work anyway.)

I ran the LOG backup job manually multiple times and then tried to re-add the DB_1 database to the availability group – that action gave me a telling message that there wasn’t room on the B node to restore the LDF log file:

System.Data.SqlClient.SqlError: There is insufficient free space on disk volume 'X:\LOG' to create the database. The database requires 368050176 additional free bytes, while only 356515840 bytes are available.

 What I found surprised me even more than finding a 367MB mount point in the first place on ServerA – the mount point on ServerB was only 343 MB!

This case with the availability group made the situation especially nasty – by growing to 350MB the LDF log file was now past the size of the 343MB mount point on the B node.  This automatically suspended data movement for the DB_1 database (not the whole availability group – just the single database) while it was waiting for sufficient space to grow the LDF on the secondary node.

I have dealt with lots of clustering/mirroring/AG’s over the years, and I have never tripped over this particular issue before.  With traditional failover clustering, you have shared storage, so this isn’t an issue; with a “no shared storage” design, this is a real risk for both mirroring and AG’s, and in this case it was exaggerated by the tiny size of the mount.

I was able to shrink the LDF log file of DB_1 to 310MB, which fit inside the 343MB mount point on the B node.  At that point I was able to re-add DB_1 to Sales_AG, and after it was successfully back in the availability group the database showed in the monitor as successfully moving data from A>>B.

I removed the secondary LDF log file that I had added as it is not recommended to have multiple log files as a regular operating state. #TheMoreYouKnow

As a final step I shrunk the primary LDF log file a little further (shrinking is not great but is sometimes necessary) and then modified the FILEGROWTH for the file to grow in 10MB increments (rather than %).  I then manually capped the MAXSIZE of the file at 335MB so that if the file did fill again, it wouldn’t break the availability group in this same way.  The database would still throw 9002’s, but they would be for LOG_BACKUP, not AVAILABILITY_REPLICA.

I warned the client that they really needed to expand the mount point a little (storage is cheap, right?) to cover the needs of the index maintenance and to match the sizes.  I also noted that if the mount point were expanded, the manual cap would need to be lifted before the DB_1 LDF log file will take advantage of any extra space.

--

The cautionary tale here is about the mismatched storage – as unlikely as it may seems to have 343MB and 365MB mount points, the exact same thing could happen on a system with a 343GB and 367GB mounts with a file trying to grow past the smaller of the two sizes.

Configurations on AG partners don’t have to be exactly the same, but they should be darned close…and for storage – just make them the same, and save yourself this particular headache.

http://img01.deviantart.net/668d/i/2004/260/e/c/headache_by_cat_lovers.jpg


Hope this helps!