One DBA's Ongoing Search for Clarity in the Middle of Nowhere


*or*

Yet Another Andy Writing About SQL Server

Monday, July 24, 2017

Sudden Differential Backup Failures after an Availability Group Failover

As so many stories do, this story starts with a failover - in this case an Availability Group (AG) failover.

https://cdn.meme.am/cache/instances/folder984/400x/59300984.jpg

There were several different backup and status check jobs failing on NODE01 and NODE02 because the AG01 availability group was now primary on NODE02 instead of NODE01 after a failover.

The failover occurred the previous Saturday morning at 8:30am local server time because of an application-initiated server reboot of the 01 node:

--
Log Name:      System
Source:        User32
Date:          7/22/2017 8:31:08 AM
Event ID:      1074
Task Category: None
Level:         Information
Keywords:      Classic
User:          SYSTEM
Computer:      NODE01.COMPANY1.com
Description:
The process C:\Program Files\Application12\AppAgent\01\patch\ApplicationService.exe (NODE01) has initiated the restart of computer NODE01 on behalf of user NT AUTHORITY\SYSTEM for the following reason: Legacy API shutdown
Reason Code: 0x80070000
Shutdown Type: restart
Comment: The Update Agent will now reboot this machine
--

Always fun when an application “unexpectedly” restarts a server without advance warning.

https://i.imgflip.com/w6x8g.jpg
--

Even though the Quest/Dell LiteSpeed Fast Compression maintenance plans are configured to pick up “all user databases” the availability group databases were not being backed up currently and hadn’t been since the failover:

NODE01:



NODE02:



The error messages on the SQL Error Log on the 01 node are telling:

--

Date       7/24/2017 1:09:23 AM
Log        SQL Server (Current - 7/24/2017 12:59:00 PM)
Source     Backup
Message    Error: 3041, Severity: 16, State: 1.
 
-- 
Date       7/24/2017 1:09:23 AM
Log        SQL Server (Current - 7/24/2017 12:59:00 PM)
Source     Backup
Message    BACKUP failed to complete the command BACKUP DATABASE AG_DATABASE22. Check the backup application log for detailed messages.
 
-- 
Date       7/24/2017 1:09:23 AM
Log        SQL Server (Current - 7/24/2017 12:59:00 PM)
Source     spid269
Message    FastCompression Alert: 62301 : SQL Server has returned a failure message to LiteSpeed™ for SQL Server® which has prevented the operation from succeeding.
The following message is not a LiteSpeed™ for SQL Server® message. Please refer to SQL Server books online or Microsoft technical support for a solution:  
BACKUP DATABASE is terminating abnormally. This BACKUP or RESTORE command is not supported on a database mirror or secondary replica.
 
-- 
Date       7/24/2017 1:09:24 AM
Log        SQL Server (Current - 7/24/2017 12:59:00 PM)
Source     spid258
Message    Error: Maintenance plan 'Maint Backup User Databases'.LiteSpeed™ for SQL Server® 8.2.1.42
© 2016 Dell Inc.
           Selecting full backup: Maximum interval (7 days) has elapsed since last full
Executing SQLLitespeed.exe: Write new full backup \\adcnas21\SQL\NODE01\AG_DATABASE22\AG_DATABASE22.litespeed.f22.bkp
           Msg 62301, Level 16, State 1, Line 0: SQL Server has returned a failure message to LiteSpeed™ for SQL Server® which has prevented the operation from succeeding.
The following message is not a LiteSpeed™ for SQL Server® message. Please refer to SQL Server books online or Microsoft technical support for a solution:

           BACKUP DATABASE is terminating abnormally.
This BACKUP or RESTORE command is not supported on a database mirror or secondary replica.
--

 ** IMPORTANT **  - while this specific scenario is on a cluster running Dell/Quest LiteSpeed Fast Compression backups (a process that uses an algorithm to determine whether a differential is sufficient or a full backup is required each day), the problem does *not* directly relate to LiteSpeed - the problem is with running differential backups on availability group secondaries or database mirror partners in general.

The situation around running differentials on secondaries is described in this post from Anup Warrier (blog/@AnupWarrier):


--

The particular issue here is the fact that the AG’s aren’t weighted evenly, so even though it is “invalid” for the differentials, the AG still prefers NODE01 because it is more heavily weighted:




--

My recommendation to the client to fix this on LiteSpeed Fast Compression was to change the backup option on the AG from “Any Replica” to “Primary” – this would keep the backup load on the primary, which means the differential backups would work.

The cost to this is that the backup I/O load would always be on the primary, but since the “normal” state for AG01 is to live on NODE01 with backups on NODE01 then requiring the backups to run on the primary node would not be different in most situations.

Note this is the state for this particular AG - in many cases part of the use of AG's is being able to offload backups to the secondary, so in many cases this is a cost to be weighed before making this change.

If you want to run backups on the secondary node then my personal best suggestion for fixing this would be outside of LiteSpeed/Fast Compression – if you want/need to stay with differentials, we could use a scripted backup solution (like Ola Hallengren's Maintenance Solution) to enable/disable the full/differential backups on the secondary node, probably by adding a check in the backup commands to ignore databases that aren’t the AG primary.  

A script to perform this Availability Group replica check appears in my previous post on determining the primary Availability Group replica:

--
SELECT AG.name AS AvailabilityGroupName,
HAGS.primary_replica AS PrimaryReplicaName, 
HARS.role_desc as LocalReplicaRoleDesc, 
DRCS.database_name AS DatabaseName, 
HDRS.synchronization_state_desc as SynchronizationStateDesc, 
HDRS.is_suspended AS IsSuspended, 
DRCS.is_database_joined AS IsJoined
FROM master.sys.availability_groups AS AG
LEFT OUTER JOIN master.sys.dm_hadr_availability_group_states as HAGS
ON AG.group_id = HAGS.group_id
INNER JOIN master.sys.availability_replicas AS AR
ON AG.group_id = AR.group_id
INNER JOIN master.sys.dm_hadr_availability_replica_states AS HARS
ON AR.replica_id = HARS.replica_id 
AND HARS.is_local = 1
INNER JOIN master.sys.dm_hadr_database_replica_cluster_states AS DRCS
ON HARS.replica_id = DRCS.replica_id
LEFT OUTER JOIN master.sys.dm_hadr_database_replica_states AS HDRS
ON DRCS.replica_id = HDRS.replica_id
AND DRCS.group_database_id = HDRS.group_database_id
ORDER BY AG.name, DRCS.database_name
--

Another option would be to go away from differentials completely and just run FULL’s and LOG’s which would allow for the current 80/50 weight model to continue.

I further advised the client that because of the differentials situation I don’t see a way to force backups to the secondary with LiteSpeed Fast Compression - since Fast Compression requires differentials as part of its process, Fast Compression backups *have* to run on the primary. (Maybe there is some setting in Fast Compression to fix this but I am not familiar with one.) 

--

Hope this helps!



Friday, July 21, 2017

Toolbox - Which Tables are Using All of My Space?

This is the next in a new series of blogs I am going to create talking about useful tools (mostly scripts) that I use frequently in my day-to-day life as a production DBA.  I work as a Work-From-Home DBA for a Remote DBA company, but 90+% of my job functions are the same as any other company DBA.

Many of these scripts come directly from blogs and articles created and shared by other members of the SQL Server community; some of them I have slightly modified and some I have lifted directly from those articles.  I will always give attribution back to the original source and note when I have made modifications.

--

In the previous post in this series "Toolbox - Where Did All My Space Go?" I shared a script for finding which database files consumed the most space and which of those files had free space in them.  The next step after finding out which databases are using the space is determining which tables in those databases are occupying that space to consider purge/archive opportunities.  In many cases you will find a single large table (often called a "mother table") taking up most of the space in your database.

https://www.newslinq.com/wp-content/uploads/2014/05/table.png
(That's a *big* table!)

I found a script created by a developer from Switzerland in a question/answer on StackOverflow.com and modified it slightly to return the specifics I wanted.  Among other things I added the InstanceName and DatabaseName because in my job I frequently create documentation or reports for external clients who don't necessarily know the result set came from a particular instance and a particular database:

--
/*
Object Sizes
Modified from http://stackoverflow.com/questions/15896564/get-table-and-index-storage-size-in-sql-server
*/
SELECT TOP 50
@@SERVERNAME as InstanceName
, DB_NAME() as DatabaseName
, s.NAME AS SchemaName
, t.NAME  AS TableName
, SUM(p.rows) AS RowCounts
--, SUM(a.total_pages) * 8 AS TotalSpaceKB
, SUM(a.total_pages) * 8/1024.0 AS TotalSpaceMB
, SUM(a.total_pages) * 8/1024.0/1024.0 AS TotalSpaceGB
, SUM(a.used_pages) * 8/1024.0 AS UsedSpaceMB
, (SUM(a.total_pages) - SUM(a.used_pages)) * 8/1024.0 AS UnusedSpaceMB
FROM sys.tables t
INNER JOIN sys.schemas s
ON s.schema_id = t.schema_id
INNER JOIN sys.indexes i
ON t.OBJECT_ID = i.object_id
INNER JOIN sys.partitions p
ON i.object_id = p.OBJECT_ID AND i.index_id = p.index_id
INNER JOIN sys.allocation_units a
ON p.partition_id = a.container_id
WHERE t.NAME NOT LIKE 'dt%'    -- filter out system tables for diagramming
AND t.is_ms_shipped = 0
AND i.OBJECT_ID > 255
GROUP BY t.Name
, s.Name
ORDER BY TotalSpaceMB DESC
--

The results look like this:

--

InstanceName DatabaseName SchemaName TableName RowCounts TotalSpaceMB TotalSpaceGB UsedSpaceMB UnusedSpaceMB
Instance01 Database15 SI_USER TRANS_DATA 17730150 16263.72 15.88 16234.14 29.58
Instance01 Database15 SI_USER WORKFLOW_CONTEXT 50785680 7623.27 7.44 7622.52 0.75
Instance01 Database15 PTM EI_HISTORY_MSG 22704543 3701.59 3.61 3701.19 0.40
Instance01 Database15 SI_USER WORKFLOW_LINKAGE 72643908 2657.21 2.59 2657.06 0.15
Instance01 Database15 SI_USER CORRELATION_SET 13762284 2542.87 2.48 2542.59 0.27
Instance01 Database15 SI_USER DOCUMENT 9833616 1445.55 1.41 1445.32 0.23
Instance01 Database15 PTM EI_HISTORY_TRANDTL 30673680 1257.23 1.23 1256.99 0.24
Instance01 Database15 SI_USER ACT_SESSION_GUID 18728114 1246.77 1.22 1246.70 0.07
Instance01 Database15 SI_USER DATA_FLOW_GUID 13635838 908.08 0.89 907.98 0.10
Instance01 Database15 PTM EI_HISTORY_TRAN 18560608 699.97 0.68 699.73 0.24
Instance01 Database15 SI_USER DATA_TABLE 174048 460.91 0.45 460.45 0.46
Instance01 Database15 SI_USER DOCUMENT_EXTENSION 1630855 363.54 0.36 363.15 0.39
Instance01 Database15 SI_USER ACT_NON_XFER 1579422 284.48 0.28 284.13 0.35
Instance01 Database15 SI_USER ACT_XFER 804270 217.98 0.21 217.61 0.38
Instance01 Database15 SI_USER ACT_SESSION 1008875 209.04 0.20 208.66 0.38
Instance01 Database15 SI_USER ARCHIVE_INFO 4203976 113.34 0.11 113.10 0.24
Instance01 Database15 SI_USER WF_INST_S 1061373 101.52 0.10 101.37 0.16
Instance01 Database15 SI_USER ACT_AUTHORIZE 298908 70.27 0.07 70.11 0.16
Instance01 Database15 SI_USER DATA_FLOW 420930 56.59 0.06 56.35 0.23
Instance01 Database15 SI_USER ACT_AUTHENTICATE 269200 45.80 0.04 45.31 0.48
Instance01 Database15 SI_USER EDIINTDOC 182672 43.83 0.04 43.69 0.14
Instance01 Database15 SI_USER MSGMDNCORRELATION 74656 27.86 0.03 26.42 1.44
Instance01 Database15 SI_USER EDI_DOCUMENT_STATE 57498 22.19 0.02 18.21 3.98
Instance01 Database15 SI_USER ENVELOPE_PARMS 134691 19.50 0.02 19.34 0.16
Instance01 Database15 SI_USER SAP_TID 81598 14.13 0.01 14.00 0.13
Instance01 Database15 PTM EI_PARTNER_REPORT 74617 13.39 0.01 13.38 0.02
Instance01 Database15 SI_USER EDI_ELEMENT_CODES 89583 10.63 0.01 10.55 0.08
Instance01 Database15 SI_USER EDI_COMPLIANCE_RPT 37500 10.23 0.01 9.52 0.70
Instance01 Database15 SI_USER DOCUMENT_LIFESPAN 29454 10.10 0.01 8.44 1.66
Instance01 Database15 SI_USER ACTIVITY_INFO 43269 6.00 0.01 5.70 0.30
Instance01 Database15 SI_USER YFS_USER_ACT_AUDIT 14025 4.90 0.00 4.18 0.72
Instance01 Database15 SI_USER BPMV_LS_WRK2 19110 4.20 0.00 2.70 1.50
Instance01 Database15 SI_USER CODELIST_XREF_ITEM 14332 3.50 0.00 2.76 0.74
Instance01 Database15 SI_USER DOC_STATISTICS 8948 3.43 0.00 3.01 0.42
Instance01 Database15 SI_USER MAP 4848 2.37 0.00 2.26 0.11
Instance01 Database15 SI_USER YFS_RESOURCE 5628 1.98 0.00 1.82 0.16
Instance01 Database15 SI_USER MDLR_PAL_ITEM_DESC 4767 1.38 0.00 1.35 0.03
Instance01 Database15 SI_USER SERVICE_PARM_LIST 6290 1.28 0.00 1.12 0.16
Instance01 Database15 SI_USER ADMIN_AUDIT 2100 1.15 0.00 0.76 0.39
Instance01 Database15 SI_USER DOC_STAT_KEY 2860 1.08 0.00 0.82 0.26
Instance01 Database15 SI_USER MAPPER_ERL_XREF 4317 1.07 0.00 1.02 0.05
Instance01 Database15 PTM EI_MSG_PROFILE 4830 0.89 0.00 0.76 0.13
Instance01 Database15 SI_USER WF_INST_S_WRK 2708 0.86 0.00 0.78 0.08
Instance01 Database15 SI_USER YFS_ORGANIZATION 909 0.84 0.00 0.38 0.46
Instance01 Database15 SI_USER YFS_STATISTICS_DETAIL 792 0.83 0.00 0.48 0.35
Instance01 Database15 SI_USER RESOURCE_CHECKSUM 5827 0.76 0.00 0.73 0.02
Instance01 Database15 SI_USER YFS_RESOURCE_PERMISSION 1611 0.73 0.00 0.56 0.16
Instance01 Database15 PTM EI_COMM_PROFILE_AUDIT 1512 0.72 0.00 0.63 0.09
Instance01 Database15 SI_USER WFD 3406 0.72 0.00 0.63 0.09
Instance01 Database15 SI_USER XMLSCHEMAS 1678 0.70 0.00 0.69 0.01

--

This allows you to easily find the largest tables (you can modify the ORDER BY to find the tables with the most free space as well to look for inefficient indexing or design).

Once you have the largest tables in hand you have the starting point for a discussion on potential purges or archives.

Hope this helps!


Tuesday, July 11, 2017

T-SQL Tuesday #92 - Trust But Verify


It's T-SQL Tuesday time again, and this month the host is Raul Gonzalez (blog/@SQLDoubleG).  His chosen topic is Lessons Learned the Hard Way. (Everybody should have a story on this one, right?)

When considering this topic the thing that spoke most to me was a broad generalization (and no, it isn't #ItDepends)

https://cdn.meme.am/cache/instances/folder589/500x/64701589/grumpy-cat-5-trust-but-verify.jpg

Yes I am old enough (Full Disclosure) to know many people associate the phrase "Trust But Verify" with a certain Republican President of the United States, but it really is an important part of what we do as DBA's.

When a peer sends you code to review, do you just trust it's OK?

https://cdn.meme.am/cache/instances/folder809/66278809.jpg

When a client says they're backing up their databases, do you trust them?

https://cdn.meme.am/cache/instances/folder686/500x/64677686/willy-wonka-trust-but-verify.jpg

When *SQL SERVER ITSELF* says it's backing up your databases, do you trust it?

https://imgflip.com/i/1sc4nl

(Catching a theme yet?)

At the end of the day, as the DBA we are the Default Blame Acceptors, which means whatever happens, we are Guilty Until Proven Innocent - and even then we're guilty half the time.

This isn't about paranoia, it's about being thorough and doing your job - making sure the systems stay up and the customers stay happy.
  • Verify your backup jobs are running, and then run test restores to make sure the backup files are good.
  • Verify your SQL Server services are running (when they're supposed to be)
  • Verify your databases are online (again, when they're supposed to be)
  • Verify your logins and users have the appropriate security access (#SysadminIsNotAnAppropriateDefault)
  • Read through your code and that of your peers - don't just trust the syntax checker!
  • When you find code online in a blog or a Microsoft forum, read through it and check it - don't just blindly execute it because it was written by a Microsoft employee or an MVP - they're all human too! (This almost bit me once on an old forum post where thankfully I did read through the code before I pushed Execute - the post was two years old with hundreds of reads so it would be fine, right?  There was a pretty simple mistake in the WHERE clause and none of the hundreds of readers had seen it or been polite enough to point it out to the author!)
  • Run periodic checks to make sure all of your Perfmon traces (you are running Perfmon traces, right?), Windows Tasks, SQL Agent Jobs, XEvents sessions, etc. are installed properly and configured with the current settings - just because the instance has a job named "Online Status Check" on it doesn't mean the version is anything like your current code!
...and so many, many, many more.

My belief has always been that people are generally good - contrary to popular belief among some of my peers, developers/managers/QA staff/etc. are not inherently bad people - but they are people.  Sometimes they make mistakes (we all do).  Sometimes, they don't have a full grasp of our DBA-speak (just like I don't understand a lot of Developer-Speak - I definitely don't speak C#) - and that makes it our responsibility to make sure things are covered as a DBA - we are the subject matter expert in this area, so we need to make sure it's covered.

As I said above, this isn't about paranoia - you do need to entrust other team members with shared responsibilities and division of labor, because as Yoda says:

http://blog.idonethis.com/wp-content/uploads/2016/04/yoda.jpg

...but when the buck stops with you (as it so often does) and it is within your power to check something - do it!

Hope this helps!

Monday, July 3, 2017

Toolbox - Where Did All My Space Go?

This is the first in a new series of blogs I am going to create talking about useful tools (mostly scripts) that I use frequently in my day-to-day life as a production DBA.  I work as a Work-From-Home DBA for a Remote DBA company, but 90+% of my job functions are the same as any other company DBA.

Many of these scripts come directly from blogs and articles created and shared by other members of the SQL Server community; some of them I have slightly modified and some I have lifted directly from those articles.  I will always give attribution back to the original source and note when I have made modifications.

How do I find these scripts, you may ask?


Google is still the top tool in my toolbox - it never ceases to amaze me what you can find by Googling an error number, or the snippet of a message, or simply an object name.

(Disclaimer - I use Google from familiarity but the occasions I have used Bing haven't shown much different results.)

--

The first script I want to share comes from MSSQLTips.com and was created by frequent MSSQLTip author Ken Simmons (Blog/@KenSimmons) - it shows the free space in each of the database files on the instance and default comes back sorted by free space descending:

--
/*
https://www.mssqltips.com/sqlservertip/1510/script-to-determine-free-space-to-support-shrinking-sql-server-database-files/
*/
USE MASTER
GO 
 
CREATE TABLE #TMPFIXEDDRIVES (
DRIVE CHAR(1),
MBFREE INT) 
 
INSERT INTO #TMPFIXEDDRIVES
EXEC xp_FIXEDDRIVES 
 
CREATE TABLE #TMPSPACEUSED (
DBNAME VARCHAR(500),
FILENME VARCHAR(500),
SPACEUSED FLOAT) 
 
INSERT INTO #TMPSPACEUSED
EXEC( 'sp_msforeachdb''use [?]; Select ''''?'''' DBName, Name FileNme,
fileproperty(Name,''''SpaceUsed'''') SpaceUsed from sysfiles''') 
 
SELECT
C.DRIVE,
A.NAME AS DATABASENAME,
B.NAME AS FILENAME,
CASE B.TYPE
WHEN 0 THEN 'DATA'
ELSE TYPE_DESC
END AS FILETYPE,
CASE
WHEN (B.SIZE * 8 / 1024.0) > 1000
THEN CAST(CAST(((B.SIZE * 8 / 1024) / 1024.0) AS DECIMAL(18,2)) AS VARCHAR(20)) + ' GB'
ELSE CAST(CAST((B.SIZE * 8 / 1024.0) AS DECIMAL(18,2)) AS VARCHAR(20)) + ' MB'
END AS FILESIZE,
CASE
WHEN (B.SIZE * 8 / 1024.0) - (D.SPACEUSED / 128.0) > 1000
THEN CAST(CAST((((B.SIZE * 8 / 1024.0) - (D.SPACEUSED / 128.0)) / 1024.0) AS DECIMAL(18,2)) AS VARCHAR(20)) + ' GB'
ELSE CAST(CAST((((B.SIZE * 8 / 1024.0) - (D.SPACEUSED / 128.0))) AS DECIMAL(18,2)) AS VARCHAR(20)) + ' MB'
END AS SPACEFREE,
B.PHYSICAL_NAME
FROM SYS.DATABASES A
JOIN SYS.MASTER_FILES B
ON A.DATABASE_ID = B.DATABASE_ID
JOIN #TMPFIXEDDRIVES C
ON LEFT(B.PHYSICAL_NAME,1) = C.DRIVE
JOIN #TMPSPACEUSED D
ON A.NAME = D.DBNAME
AND B.NAME = D.FILENME
WHERE a.database_id>4
--and DRIVE = 'F'
ORDER BY (B.SIZE * 8 / 1024.0) - (D.SPACEUSED / 128.0) DESC

DROP TABLE #TMPFIXEDDRIVES 
 
DROP TABLE #TMPSPACEUSED
--

The result set displays like this:


--

I use this all of the time to troubleshoot space issues - not only to find out what is taking up space but also to see if there are files with excessive empty space that can be examined for cleanup.

http://s2.quickmeme.com/img/16/167911b7d66e2736356e5680d23bc9d5111120af5aada10bdb714c70874cc3b1.jpg

Do not willy-nilly shrink files - it causes all kinds of problems with file fragmentation and potentially with performance - but there are times where you just don't have any choice, and this is an easy way to find candidates.

--

Hope this helps!