Nebraska SQL from @DBA_ANDY: July 2017

Monday, July 24, 2017

Sudden Differential Backup Failures after an Availability Group Failover

As so many stories do, this story starts with a failover - in this case an Availability Group (AG) failover.

https://cdn.meme.am/cache/instances/folder984/400x/59300984.jpg

There were several different backup and status check jobs failing on NODE01 and NODE02 because the AG01 availability group was now primary on NODE02 instead of NODE01 after a failover.

The failover occurred the previous Saturday morning at 8:30am local server time because of an application-initiated server reboot of the 01 node:

Log Name: System
Source: User32
Date: 7/22/2017 8:31:08 AM
Event ID: 1074
Task Category: None
Level: Information
Keywords: Classic
User: SYSTEM
Computer: NODE01.COMPANY1.com
Description:
The process C:\Program Files\Application12\AppAgent\01\patch\ApplicationService.exe (NODE01) has initiated the restart of computer NODE01 on behalf of user NT AUTHORITY\SYSTEM for the following reason: Legacy API shutdown
Reason Code: 0x80070000
Shutdown Type: restart
Comment: The Update Agent will now reboot this machine

Always fun when an application “unexpectedly” restarts a server without advance warning.

https://i.imgflip.com/w6x8g.jpg

Even though the Quest/Dell LiteSpeed Fast Compression maintenance plans are configured to pick up “all user databases” the availability group databases were not being backed up currently and hadn’t been since the failover:

NODE01:

NODE02:

The error messages on the SQL Error Log on the 01 node are telling:

Date 7/24/2017 1:09:23 AM
Log SQL Server (Current - 7/24/2017 12:59:00 PM)
Source Backup
Message Error: 3041, Severity: 16, State: 1.

--

Date 7/24/2017 1:09:23 AM
Log SQL Server (Current - 7/24/2017 12:59:00 PM)
Source Backup
Message BACKUP failed to complete the command BACKUP DATABASE AG_DATABASE22. Check the backup application log for detailed messages.

--

Date 7/24/2017 1:09:23 AM
Log SQL Server (Current - 7/24/2017 12:59:00 PM)
Source spid269
Message FastCompression Alert: 62301 : SQL Server has returned a failure message to LiteSpeed™ for SQL Server® which has prevented the operation from succeeding.
The following message is not a LiteSpeed™ for SQL Server® message. Please refer to SQL Server books online or Microsoft technical support for a solution:
BACKUP DATABASE is terminating abnormally. This BACKUP or RESTORE command is not supported on a database mirror or secondary replica.

--

Date 7/24/2017 1:09:24 AM
Log SQL Server (Current - 7/24/2017 12:59:00 PM)
Source spid258
Message Error: Maintenance plan 'Maint Backup User Databases'.LiteSpeed™ for SQL Server® 8.2.1.42
© 2016 Dell Inc.
Selecting full backup: Maximum interval (7 days) has elapsed since last full
Executing SQLLitespeed.exe: Write new full backup \\adcnas21\SQL\NODE01\AG_DATABASE22\AG_DATABASE22.litespeed.f22.bkp
Msg 62301, Level 16, State 1, Line 0: SQL Server has returned a failure message to LiteSpeed™ for SQL Server® which has prevented the operation from succeeding.
The following message is not a LiteSpeed™ for SQL Server® message. Please refer to SQL Server books online or Microsoft technical support for a solution:
BACKUP DATABASE is terminating abnormally.
This BACKUP or RESTORE command is not supported on a database mirror or secondary replica.

** IMPORTANT ** - while this specific scenario is on a cluster running Dell/Quest LiteSpeed Fast Compression backups (a process that uses an algorithm to determine whether a differential is sufficient or a full backup is required each day), the problem does *not* directly relate to LiteSpeed - the problem is with running differential backups on availability group secondaries or database mirror partners in general.

The situation around running differentials on secondaries is described in this post from Anup Warrier (blog/@AnupWarrier):

https://sqlsailor.com/2012/05/15/backups-on-secondary-replicas-always-on-availability-groups/

The particular issue here is the fact that the AG’s aren’t weighted evenly, so even though it is “invalid” for the differentials, the AG still prefers NODE01 because it is more heavily weighted:

My recommendation to the client to fix this on LiteSpeed Fast Compression was to change the backup option on the AG from “Any Replica” to “Primary” – this would keep the backup load on the primary, which means the differential backups would work.

The cost to this is that the backup I/O load would always be on the primary, but since the “normal” state for AG01 is to live on NODE01 with backups on NODE01 then requiring the backups to run on the primary node would not be different in most situations.

Note this is the state for this particular AG - in many cases part of the use of AG's is being able to offload backups to the secondary, so in many cases this is a cost to be weighed before making this change.

If you want to run backups on the secondary node then my personal best suggestion for fixing this would be outside of LiteSpeed/Fast Compression – if you want/need to stay with differentials, we could use a scripted backup solution (like Ola Hallengren's Maintenance Solution) to enable/disable the full/differential backups on the secondary node, probably by adding a check in the backup commands to ignore databases that aren’t the AG primary.

A script to perform this Availability Group replica check appears in my previous post on determining the primary Availability Group replica:

SELECT AG.name AS AvailabilityGroupName,
HAGS.primary_replica AS PrimaryReplicaName,
HARS.role_desc as LocalReplicaRoleDesc,
DRCS.database_name AS DatabaseName,
HDRS.synchronization_state_desc as SynchronizationStateDesc,
HDRS.is_suspended AS IsSuspended,
DRCS.is_database_joined AS IsJoined
FROM master.sys.availability_groups AS AG
LEFT OUTER JOIN master.sys.dm_hadr_availability_group_states as HAGS
ON AG.group_id = HAGS.group_id
INNER JOIN master.sys.availability_replicas AS AR
ON AG.group_id = AR.group_id
INNER JOIN master.sys.dm_hadr_availability_replica_states AS HARS
ON AR.replica_id = HARS.replica_id
AND HARS.is_local = 1
INNER JOIN master.sys.dm_hadr_database_replica_cluster_states AS DRCS
ON HARS.replica_id = DRCS.replica_id
LEFT OUTER JOIN master.sys.dm_hadr_database_replica_states AS HDRS
ON DRCS.replica_id = HDRS.replica_id
AND DRCS.group_database_id = HDRS.group_database_id
ORDER BY AG.name, DRCS.database_name

Another option would be to go away from differentials completely and just run FULL’s and LOG’s which would allow for the current 80/50 weight model to continue.

I further advised the client that because of the differentials situation I don’t see a way to force backups to the secondary with LiteSpeed Fast Compression - since Fast Compression requires differentials as part of its process, Fast Compression backups *have* to run on the primary. (Maybe there is some setting in Fast Compression to fix this but I am not familiar with one.)

Hope this helps!

Friday, July 21, 2017

Toolbox - Which Tables are Using All of My Space?

This is the next in a new series of blogs I am going to create talking about useful tools (mostly scripts) that I use frequently in my day-to-day life as a production DBA. I work as a Work-From-Home DBA for a Remote DBA company, but 90+% of my job functions are the same as any other company DBA.

Many of these scripts come directly from blogs and articles created and shared by other members of the SQL Server community; some of them I have slightly modified and some I have lifted directly from those articles. I will always give attribution back to the original source and note when I have made modifications.

--

In the previous post in this series "Toolbox - Where Did All My Space Go?" I shared a script for finding which database files consumed the most space and which of those files had free space in them. The next step after finding out which databases are using the space is determining which tables in those databases are occupying that space to consider purge/archive opportunities. In many cases you will find a single large table (often called a "mother table") taking up most of the space in your database.

https://www.newslinq.com/wp-content/uploads/2014/05/table.png

(That's a *big* table!)

I found a script created by a developer from Switzerland in a question/answer on StackOverflow.com and modified it slightly to return the specifics I wanted. Among other things I added the InstanceName and DatabaseName because in my job I frequently create documentation or reports for external clients who don't necessarily know the result set came from a particular instance and a particular database:

--

/*
Object Sizes
Modified from http://stackoverflow.com/questions/15896564/get-table-and-index-storage-size-in-sql-server
*/
SELECT TOP 50
@@SERVERNAME as InstanceName
, DB_NAME() as DatabaseName
, s.NAME AS SchemaName
, t.NAME AS TableName
, SUM(p.rows) AS RowCounts
--, SUM(a.total_pages) * 8 AS TotalSpaceKB
, SUM(a.total_pages) * 8/1024.0 AS TotalSpaceMB
, SUM(a.total_pages) * 8/1024.0/1024.0 AS TotalSpaceGB
, SUM(a.used_pages) * 8/1024.0 AS UsedSpaceMB
, (SUM(a.total_pages) - SUM(a.used_pages)) * 8/1024.0 AS UnusedSpaceMB
FROM sys.tables t
INNER JOIN sys.schemas s
ON s.schema_id = t.schema_id
INNER JOIN sys.indexes i
ON t.OBJECT_ID = i.object_id
INNER JOIN sys.partitions p
ON i.object_id = p.OBJECT_ID AND i.index_id = p.index_id
INNER JOIN sys.allocation_units a
ON p.partition_id = a.container_id
WHERE t.NAME NOT LIKE 'dt%' -- filter out system tables for diagramming
AND t.is_ms_shipped = 0
AND i.OBJECT_ID > 255
GROUP BY t.Name
, s.Name
ORDER BY TotalSpaceMB DESC

--

The results look like this:

--

InstanceName	DatabaseName	SchemaName	TableName	RowCounts	TotalSpaceMB	TotalSpaceGB	UsedSpaceMB	UnusedSpaceMB
Instance01	Database15	SI_USER	TRANS_DATA	17730150	16263.72	15.88	16234.14	29.58
Instance01	Database15	SI_USER	WORKFLOW_CONTEXT	50785680	7623.27	7.44	7622.52	0.75
Instance01	Database15	PTM	EI_HISTORY_MSG	22704543	3701.59	3.61	3701.19	0.40
Instance01	Database15	SI_USER	WORKFLOW_LINKAGE	72643908	2657.21	2.59	2657.06	0.15
Instance01	Database15	SI_USER	CORRELATION_SET	13762284	2542.87	2.48	2542.59	0.27
Instance01	Database15	SI_USER	DOCUMENT	9833616	1445.55	1.41	1445.32	0.23
Instance01	Database15	PTM	EI_HISTORY_TRANDTL	30673680	1257.23	1.23	1256.99	0.24
Instance01	Database15	SI_USER	ACT_SESSION_GUID	18728114	1246.77	1.22	1246.70	0.07
Instance01	Database15	SI_USER	DATA_FLOW_GUID	13635838	908.08	0.89	907.98	0.10
Instance01	Database15	PTM	EI_HISTORY_TRAN	18560608	699.97	0.68	699.73	0.24
Instance01	Database15	SI_USER	DATA_TABLE	174048	460.91	0.45	460.45	0.46
Instance01	Database15	SI_USER	DOCUMENT_EXTENSION	1630855	363.54	0.36	363.15	0.39
Instance01	Database15	SI_USER	ACT_NON_XFER	1579422	284.48	0.28	284.13	0.35
Instance01	Database15	SI_USER	ACT_XFER	804270	217.98	0.21	217.61	0.38
Instance01	Database15	SI_USER	ACT_SESSION	1008875	209.04	0.20	208.66	0.38
Instance01	Database15	SI_USER	ARCHIVE_INFO	4203976	113.34	0.11	113.10	0.24
Instance01	Database15	SI_USER	WF_INST_S	1061373	101.52	0.10	101.37	0.16
Instance01	Database15	SI_USER	ACT_AUTHORIZE	298908	70.27	0.07	70.11	0.16
Instance01	Database15	SI_USER	DATA_FLOW	420930	56.59	0.06	56.35	0.23
Instance01	Database15	SI_USER	ACT_AUTHENTICATE	269200	45.80	0.04	45.31	0.48
Instance01	Database15	SI_USER	EDIINTDOC	182672	43.83	0.04	43.69	0.14
Instance01	Database15	SI_USER	MSGMDNCORRELATION	74656	27.86	0.03	26.42	1.44
Instance01	Database15	SI_USER	EDI_DOCUMENT_STATE	57498	22.19	0.02	18.21	3.98
Instance01	Database15	SI_USER	ENVELOPE_PARMS	134691	19.50	0.02	19.34	0.16
Instance01	Database15	SI_USER	SAP_TID	81598	14.13	0.01	14.00	0.13
Instance01	Database15	PTM	EI_PARTNER_REPORT	74617	13.39	0.01	13.38	0.02
Instance01	Database15	SI_USER	EDI_ELEMENT_CODES	89583	10.63	0.01	10.55	0.08
Instance01	Database15	SI_USER	EDI_COMPLIANCE_RPT	37500	10.23	0.01	9.52	0.70
Instance01	Database15	SI_USER	DOCUMENT_LIFESPAN	29454	10.10	0.01	8.44	1.66
Instance01	Database15	SI_USER	ACTIVITY_INFO	43269	6.00	0.01	5.70	0.30
Instance01	Database15	SI_USER	YFS_USER_ACT_AUDIT	14025	4.90	0.00	4.18	0.72
Instance01	Database15	SI_USER	BPMV_LS_WRK2	19110	4.20	0.00	2.70	1.50
Instance01	Database15	SI_USER	CODELIST_XREF_ITEM	14332	3.50	0.00	2.76	0.74
Instance01	Database15	SI_USER	DOC_STATISTICS	8948	3.43	0.00	3.01	0.42
Instance01	Database15	SI_USER	MAP	4848	2.37	0.00	2.26	0.11
Instance01	Database15	SI_USER	YFS_RESOURCE	5628	1.98	0.00	1.82	0.16
Instance01	Database15	SI_USER	MDLR_PAL_ITEM_DESC	4767	1.38	0.00	1.35	0.03
Instance01	Database15	SI_USER	SERVICE_PARM_LIST	6290	1.28	0.00	1.12	0.16
Instance01	Database15	SI_USER	ADMIN_AUDIT	2100	1.15	0.00	0.76	0.39
Instance01	Database15	SI_USER	DOC_STAT_KEY	2860	1.08	0.00	0.82	0.26
Instance01	Database15	SI_USER	MAPPER_ERL_XREF	4317	1.07	0.00	1.02	0.05
Instance01	Database15	PTM	EI_MSG_PROFILE	4830	0.89	0.00	0.76	0.13
Instance01	Database15	SI_USER	WF_INST_S_WRK	2708	0.86	0.00	0.78	0.08
Instance01	Database15	SI_USER	YFS_ORGANIZATION	909	0.84	0.00	0.38	0.46
Instance01	Database15	SI_USER	YFS_STATISTICS_DETAIL	792	0.83	0.00	0.48	0.35
Instance01	Database15	SI_USER	RESOURCE_CHECKSUM	5827	0.76	0.00	0.73	0.02
Instance01	Database15	SI_USER	YFS_RESOURCE_PERMISSION	1611	0.73	0.00	0.56	0.16
Instance01	Database15	PTM	EI_COMM_PROFILE_AUDIT	1512	0.72	0.00	0.63	0.09
Instance01	Database15	SI_USER	WFD	3406	0.72	0.00	0.63	0.09
Instance01	Database15	SI_USER	XMLSCHEMAS	1678	0.70	0.00	0.69	0.01

--

This allows you to easily find the largest tables (you can modify the ORDER BY to find the tables with the most free space as well to look for inefficient indexing or design).

Once you have the largest tables in hand you have the starting point for a discussion on potential purges or archives.

Hope this helps!

Tuesday, July 11, 2017

T-SQL Tuesday #92 - Trust But Verify

It's T-SQL Tuesday time again, and this month the host is Raul Gonzalez (blog/@SQLDoubleG). His chosen topic is Lessons Learned the Hard Way. (Everybody should have a story on this one, right?)

When considering this topic the thing that spoke most to me was a broad generalization (and no, it isn't #ItDepends)

https://cdn.meme.am/cache/instances/folder589/500x/64701589/grumpy-cat-5-trust-but-verify.jpg

Yes I am old enough (Full Disclosure) to know many people associate the phrase "Trust But Verify" with a certain Republican President of the United States, but it really is an important part of what we do as DBA's.

When a peer sends you code to review, do you just trust it's OK?

https://cdn.meme.am/cache/instances/folder809/66278809.jpg

When a client says they're backing up their databases, do you trust them?

https://cdn.meme.am/cache/instances/folder686/500x/64677686/willy-wonka-trust-but-verify.jpg

When *SQL SERVER ITSELF* says it's backing up your databases, do you trust it?

https://imgflip.com/i/1sc4nl

(Catching a theme yet?)

At the end of the day, as the DBA we are the Default Blame Acceptors, which means whatever happens, we are Guilty Until Proven Innocent - and even then we're guilty half the time.

This isn't about paranoia, it's about being thorough and doing your job - making sure the systems stay up and the customers stay happy.

Verify your backup jobs are running, and then run test restores to make sure the backup files are good.
Verify your SQL Server services are running (when they're supposed to be)
Verify your databases are online (again, when they're supposed to be)
Verify your logins and users have the appropriate security access (#SysadminIsNotAnAppropriateDefault)
Read through your code and that of your peers - don't just trust the syntax checker!
When you find code online in a blog or a Microsoft forum, read through it and check it - don't just blindly execute it because it was written by a Microsoft employee or an MVP - they're all human too! (This almost bit me once on an old forum post where thankfully I did read through the code before I pushed Execute - the post was two years old with hundreds of reads so it would be fine, right? There was a pretty simple mistake in the WHERE clause and none of the hundreds of readers had seen it or been polite enough to point it out to the author!)
Run periodic checks to make sure all of your Perfmon traces (you are running Perfmon traces, right?), Windows Tasks, SQL Agent Jobs, XEvents sessions, etc. are installed properly and configured with the current settings - just because the instance has a job named "Online Status Check" on it doesn't mean the version is anything like your current code!

...and so many, many, many more.

My belief has always been that people are generally good - contrary to popular belief among some of my peers, developers/managers/QA staff/etc. are not inherently bad people - but they are people. Sometimes they make mistakes (we all do). Sometimes, they don't have a full grasp of our DBA-speak (just like I don't understand a lot of Developer-Speak - I definitely don't speak C#) - and that makes it our responsibility to make sure things are covered as a DBA - we are the subject matter expert in this area, so we need to make sure it's covered.

As I said above, this isn't about paranoia - you do need to entrust other team members with shared responsibilities and division of labor, because as Yoda says:

http://blog.idonethis.com/wp-content/uploads/2016/04/yoda.jpg

...but when the buck stops with you (as it so often does) and it is within your power to check something - do it!

Hope this helps!

Monday, July 3, 2017

Toolbox - Where Did All My Space Go?

This is the first in a new series of blogs I am going to create talking about useful tools (mostly scripts) that I use frequently in my day-to-day life as a production DBA. I work as a Work-From-Home DBA for a Remote DBA company, but 90+% of my job functions are the same as any other company DBA.

Many of these scripts come directly from blogs and articles created and shared by other members of the SQL Server community; some of them I have slightly modified and some I have lifted directly from those articles. I will always give attribution back to the original source and note when I have made modifications.

How do I find these scripts, you may ask?

Google is still the top tool in my toolbox - it never ceases to amaze me what you can find by Googling an error number, or the snippet of a message, or simply an object name.

(Disclaimer - I use Google from familiarity but the occasions I have used Bing haven't shown much different results.)

--

The first script I want to share comes from MSSQLTips.com and was created by frequent MSSQLTip author Ken Simmons (Blog/@KenSimmons) - it shows the free space in each of the database files on the instance and default comes back sorted by free space descending:

--

/*
https://www.mssqltips.com/sqlservertip/1510/script-to-determine-free-space-to-support-shrinking-sql-server-database-files/
*/
USE MASTER
GO

CREATE TABLE #TMPFIXEDDRIVES (
DRIVE CHAR(1),
MBFREE INT)

INSERT INTO #TMPFIXEDDRIVES
EXEC xp_FIXEDDRIVES

CREATE TABLE #TMPSPACEUSED (
DBNAME VARCHAR(500),
FILENME VARCHAR(500),
SPACEUSED FLOAT)

INSERT INTO #TMPSPACEUSED
EXEC( 'sp_msforeachdb''use [?]; Select ''''?'''' DBName, Name FileNme,
fileproperty(Name,''''SpaceUsed'''') SpaceUsed from sysfiles''')

SELECT
C.DRIVE,
A.NAME AS DATABASENAME,
B.NAME AS FILENAME,
CASE B.TYPE
WHEN 0 THEN 'DATA'
ELSE TYPE_DESC
END AS FILETYPE,
CASE
WHEN (B.SIZE * 8 / 1024.0) > 1000
THEN CAST(CAST(((B.SIZE * 8 / 1024) / 1024.0) AS DECIMAL(18,2)) AS VARCHAR(20)) + ' GB'
ELSE CAST(CAST((B.SIZE * 8 / 1024.0) AS DECIMAL(18,2)) AS VARCHAR(20)) + ' MB'
END AS FILESIZE,
CASE
WHEN (B.SIZE * 8 / 1024.0) - (D.SPACEUSED / 128.0) > 1000
THEN CAST(CAST((((B.SIZE * 8 / 1024.0) - (D.SPACEUSED / 128.0)) / 1024.0) AS DECIMAL(18,2)) AS VARCHAR(20)) + ' GB'
ELSE CAST(CAST((((B.SIZE * 8 / 1024.0) - (D.SPACEUSED / 128.0))) AS DECIMAL(18,2)) AS VARCHAR(20)) + ' MB'
END AS SPACEFREE,
B.PHYSICAL_NAME
FROM SYS.DATABASES A
JOIN SYS.MASTER_FILES B
ON A.DATABASE_ID = B.DATABASE_ID
JOIN #TMPFIXEDDRIVES C
ON LEFT(B.PHYSICAL_NAME,1) = C.DRIVE
JOIN #TMPSPACEUSED D
ON A.NAME = D.DBNAME
AND B.NAME = D.FILENME
WHERE a.database_id>4
--and DRIVE = 'F'
ORDER BY (B.SIZE * 8 / 1024.0) - (D.SPACEUSED / 128.0) DESC

DROP TABLE #TMPFIXEDDRIVES

DROP TABLE #TMPSPACEUSED

The result set displays like this:

--

I use this all of the time to troubleshoot space issues - not only to find out what is taking up space but also to see if there are files with excessive empty space that can be examined for cleanup.

http://s2.quickmeme.com/img/16/167911b7d66e2736356e5680d23bc9d5111120af5aada10bdb714c70874cc3b1.jpg

Do not willy-nilly shrink files - it causes all kinds of problems with file fragmentation and potentially with performance - but there are times where you just don't have any choice, and this is an easy way to find candidates.

--

Hope this helps!