One DBA's Ongoing Search for Clarity in the Middle of Nowhere


*or*

Yet Another Andy Writing About SQL Server

Monday, March 31, 2014

We Are Hiring!

My employer Ntirety (a division of HOSTING) is currently looking for multiple SQL Server DBA's.  As you have probably seen from me on Twitter (or on this blog) previously, I joined Ntirety last November as a 100% work-from-home Senior SQL Server DBA.

Ntirety is based out of Boston, but we have WFH employees across the country.  We provide MSSQL, Oracle, and mySQL services, both as a managed service provider that works with clients for day-to-day production support and ongoing architecture work and as a professional service provider that works with clients on a project-by-project basis.  My group is the Managed Services group (and this is where all of these positions are.)

In Managed Services, we provide 24/7 support to clients of all shapes and sizes on things as simple as failed backups and maintenance jobs and as complicated as transactional replication and failover clustering.  We consult with our clients on plans to expand their infrastructure, implement Disaster Recovery, and migrate from one SQL Server version to another (just to provide a few examples).  While we generally work 8-5 local time, we do provide 24/7 on-call with a regular DBA rotation and a service desk in front of the on-call to handle basic items without paging the DBA.

Some of our clients have their own DBA teams, and we partner with those teams to provide expertise and extra hands, while other clients have no local DBA at all, and therefore use us for all of their database needs.

In the interests of full disclosure, Ntirety was recently acquired by HOSTING.  HOSTING is a networking and infrastructure managed services provider out of Colorado that saw Ntirety as a strategic acquisition since they did not have any database services in their portfolio.  We have been told that Ntirety will function "business as usual" as a division of HOSTING.

The MSSQL Managed Services team currently has the following *four* openings:
  • Mid-Senior SQL Server DBA (must be within New England area) - 5+ years experience (prefer 8+ years experience) with the ability to travel to Boston weekly.  Technically this position can be WFH, but there is a need to travel roughly weekly to the offices of a new Ntirety client located in South Boston, so you would still need to be located in the general Boston area.
  • SQL Server DBA - 8+ years experience, with the ability to WFH from anywhere in the United States (we do not explicitly exclude outside the US as far as I know, but a large majority of our clients are in the US time zones so there is a limitation in that way.) << NOTE - it was pointed out to me that we do indeed have a few clients outside the US, so I changed this from "all of our clients" to "a large majority of our clients"
  • SQL Server DBA - 8+ years experience, with the ability to WFH from anywhere in the United States (we do not explicitly exclude outside the US as far as I know, but a large majority of our clients are in the US time zones so there is a limitation in that way.) << NOTE - it was pointed out to me that we do indeed have a few clients outside the US, so I changed this from "all of our clients" to "a large majority of our clients"
  • SQL Server DBA (Dedham MA) - 5+ years experience (prefer 8+ years experience) - this position is required to work in the Dedham (Boston) office daily.
One of these positions is a refill of a previous employee, and three of them are new positions due to company growth.

The links with the positions above will take you to jobvite.com, the site our corporate recruiter uses.  You can apply through that site or by passing your resume to me.

Please contact me via Twitter message or DM (@DBA_ANDY), LinkedIn (http://www.linkedin.com/in/dbaandy)  or via email (agalbraith@ntirety.com or andyhusker@gmail.com) with any questions you may have about Ntirety, about our Managed Services team, or about any of the particular positions.  The duties of each position are basically identical other than the point of whether you can WFH or not.  The mid-level DBAs and senior-level DBAs perform basically the same client work other than the Seniors performing mentoring and assistance for the Mid-levels.

Hope to hear from you soon!

Tuesday, March 25, 2014

Backup Checksums and Trace Flag 3023

One of the items I always flag when performing a health check on a server is whether backups are being run with CHECKSUM enabled.  As Paul Randal (@PaulRandal/blog) writes in his post "Example 2000/2005 corrupt databases and some more info on backup, restore, page checksums and IO errors":

If you have page checksums turned on, you should always use the WITH CHECKSUM option when taking backups. This will cause the page checksums to be checked as they’re read into the backup. If a bad page checksum is found, the backup will stop and print a message identifying the bad page. Using the WITH CHECKSUM option on a backup will also generate a checksum over the entire backup stream and store it in the backup. This means we can detect a damaged backup by recalculating the checksum and comparing it against that stored in the backup – in much the same way that page checksums work.
 As Paul describes, BACKUP WITH CHECKSUM has two advantages:
  • If you have page checksums enabled, BACKUP WITH CHECKSUM will countercheck against the page checksums to verify the pages are valid - this is your first defense against general page corruption.
    • NOTE - page checksums are the default for new databases created in MSSQL 2005+, but this option is not updated when you migrate older databases up to a new version of MSSQL, so check your databases (select name, page_verify_option_desc from master.sys.databases) and then ALTER DATABASE...SET PAGE_VERIFY CHECKSUM to fix it for databases that are currently set to TORN_PAGE_DETECTION or NONE.
  • Even if you don't have page checksums enabled (although you definitely should - if you aren't convinced by me, check out Microsoft Certified Master Kendra Little's (@Kendra_Little) blog post on the topic with some great examples)), BACKUP WITH CHECKSUM generates a checksum for that individual backup which is used to verify the validity of that backup in the future.
This all sounds great, right?

Just add WITH CHECKSUM to your BACKUP statements and everything is great.

Right?

Right?

The fly in the ointment comes when you don't explicitly control the syntax of your BACKUP statements.  The two major examples of this I run into while providing DBA services are:
  • SQL Server Maintenance Plans - no matter how much I evangelize Ola's amazing Maintenance Solution, there are still lots and lots of companies out there using the out-of-the-box Maintenance Plans you design in SQL Server Management Studio.
  • Third Party Backup Products - I'm not talking SAN snapshots and the like, but database-related products like Quest (now Dell) LiteSpeed.  I am not widely familiar with many of these products, but several of them I have run into don't have the option of performing BACKUP with CHECKSUM from within their maintenance plans (although to be fair most of them do offer it from the command line via extended stored procedures).
OK, fine....so I just can't BACKUP WITH CHECKSUM.  

Sigh.

But wait - Trace Flags to the rescue!

Recently I found this article: How to enable the CHECKSUM option if backup utilities do not expose the option which outlines the exact solution to my problem - outstanding!

I have researched this in the past (several years ago) and either my Google-Fu wasn't as strong as it is now (Many years I have honed this craft, young ones) or the last time I looked this article didn't exist. (To be fair to myself, the article is date-stamped February 2012 as revision 1.0, so maybe the article didn't exist!)

Trace flag 3023 causes all backups to be written with CHECKSUM, unless you explicitly call BACKUP WITH NO_CHECKSUM (which isn't the case for any of the examples above).  This trace flag is available in all Microsoft SQL Server versions since 2005.

This trace flag can be enabled on the fly (DBCC TRACEON(3023,-1) where the (-1) switches on the trace flag globally rather than just for your session) which incurs no downtime, or by adding the -T3023 parameter to the MSSQL Server service startup parameters via SQL Server Configuration Manager (on the Advanced tab of the service properties) and then restarting the service (which *does* incur a downtime!)

I would advise that if you intend to start using this, you need to go the startup parameter route since the DBCC TRACEON will only remain in effect until the next time the service restarts - if you want this to persist you need to use the startup parameter.

I am now using this on several servers and sure enough the clients' maintenance plans are writing their backups with CHECKSUM - check this with the following query to verify that new backups are being created with has_backup_checksums = 1:

select database_name, type,
MAX(backup_start_date) as MAX_Backup_Start_Date,
has_backup_checksums
from msdb.dbo.backupset
group by database_name, type, has_backup_checksums
order by database_name,type

One thing I have learned about Microsoft SQL Server is that regardless of your years of experience with the product or any given part of it (clustering, SSIS, etc.), there is always something left to learn - even things that in retrospect seem obvious (well, duh!) were new at some point, and there is always - always - something left out there waiting for you to discover.
 
Hope this helps!

Thursday, March 20, 2014

Error 701 - Insufficient System Memory - Now what?

This story begins, as so many often do, with a few key elements:
1) A client who experienced a problem on Thursday and waited until Friday evening to report it, even though the problem was "critical" enough to ask us to work on it all weekend if we couldn't solve it.
2) A co-worker who got locked out of the client machine, thereby getting me involved.
3) A relatively obscure error message that I hadn't seen before.
In this case the error message was an Application Log error 701 - "There is insufficient system memory in resource pool 'internal' to run this query":


...followed by lots of 18053/17300's:


The next thing that happened were DBCC MEMORYSTATUS dumps into the SQL Server error log:
 

The client reported an interesting problem - they were scripting stored procedures for modification (scripting out the ALTER PROCEDURE, editing it, and running it to apply changes) and they found that the process worked for a brief time, but then after a few successful runs, executing the ALTER statements started throwing the "insufficient system memory in resource pool 'internal'" messages.

Looking at the MEMORYSTATUS dumps in the ErrorLogs I had available, I saw that the only unusual values were very large "Pages Allocated" values for the MEMORYCLERK_XE.

The client had tried rebooting the server and restarting the MSSQLServer service multiple times before reporting the issue to us, and each time they found that it helped for a short time but within an hour or so the problem had returned.

This behavior (let's reboot - reboot always fixes it!) resulted in the SQL Server Error Logs having been cycled out so that I couldn't see the original log from the original failure.

--

BEGIN RANT

Every SQL Server should be set up to retain more than the default six logs - there are too many situations (like the one I am describing here) where six restarts (and therefore six historical logs) just isn't good enough.


Best case is to set the retention to 40-50 logs *and* to create a SQL Server Agent job to run sp_cycle_errorlog on a nightly basis (say at 11:59:59 or 12:00:01 every night) which generates a new log every day.  In most scenarios this will give you just over a month's error logs, and in a scenario such as this one, it will give multiple days' logs *and* allow for multiple MSSQLServer service restarts.

Even if you can't/don't want to implement the daily cycle job, you should definitely increase the retention to some larger number, such as 15 or 20 - the log files are relatively small text files, and even if your SQL service doesn't restart regularly (because you don't apply Windows patches, etc., in which case that is a different rant) the files don't get that large. (They can be unwieldy to open in the log viewer, but they aren't that large that they fill your disk.)

END RANT

--


Without the SQL Server Error Log from the time of the original failure, I went into the Windows Application and System Logs and the SQL Server Default Trace trying to find anything else of value.  Unfortunately the only other item was one that I saw in the available logs, "Failed Allocate Pages: FAIL_PAGE_ALLOCATION 1" which unfortunately didn't tell me anything else other than the failure was related to a Single Page Allocation (1) rather than a Multi-Page Allocation.

Possible scenarios I was able to rule out at this point:

1) SQL Server 2012 SP1 Memory Leak - Fixed by 2012SP1CU2, and they were already at 2012SP1CU8
2) Service Broker Misconfiguration - Service Broker was not enabled.
Having exhausted my Google-Fu, I turned to my next avenue - #sqlhelp on Twitter:


--


Note my rookie mistake in the middle (I blame it on working on the problem for a couple of hours on a Friday night when I wasn't really the primary on-call, but that's just ducking the blame) - I forgot to convert the pages value from the MEMORYSTATUS dump from 8KB pages into actual bytes/KB/MB/GB.  I have done this correctly many times in the past, but in this case forgetting to make the conversion made what I already knew to be a bad situation...worse.

So I knew the instance had a memory problem - but why did this suddenly start, and what was the cause?

Trying to mitigate the issue, I recommended that the client increase the memory on their VM. (IMPORTANT NOTE - this was not in any way a VM vs. physical server issue!)  The client increased the memory from 4GB to 12GB and I set the Max Server Memory to 11000MB (too high for long-term but set there to test against the 1409032 pages shown in the Tweet above), monitored the situation for another 30 minutes, and tried to go to bed, planning to review the situation in the morning.

No Good.

Pages came in overnight and when I got online I found a tweet that provided a whole new insight on the situation:



That made me pause and consider something that I hadn't thought about - in all of my Googling of the various problems earlier in the evening, I hadn't considered *which* memory clerk (MEMORYCLERK_XE) was complaining.

I checked the Extended Events section of Management Studio, and sure enough, their was a user-defined Extended Events session named "Locks" alongside the default system health session.

I checked the XE sessions DMV and found that the Locks session was almost as large as the system health session:

SELECT name, total_buffer_size
FROM sys.dm_xe_sessions
I opened the properties of the Locks session and not being an XE person myself took the most obvious route (for me anyway) - I started comparing it to the default system health session properties to see what was different - and then it hit me.  

The system health session writes to a ring buffer (memory) and an eventfile (disk).  The ring buffer target is a  memory target and the eventfile target persists the completed buffers to disk.

The user-defined Locks session was set up to write to a ring buffer (OK - or so I thought) and a histogram (hmm) but not to an event file.  I looked up the histogram and found that it too was a memory target.  So the session was set up to write to memory and memory, and I have a memory problem...I see.

To investigate further I scripted out the two sessions and looked at the definitions of the targets.   For the system health session:
ADD TARGET package0.event_file(SET filename=N'system_health.xel',max_file_size=(5),max_rollover_files=(4)),
ADD TARGET package0.ring_buffer(SET max_events_limit=(5000),max_memory=(4096))
WITH (MAX_MEMORY=4096 KB, EVENT_RETENTION_MODE=ALLOW_SINGLE_EVENT_LOSS, MAX_DISPATCH_LATENCY=120 SECONDS,MAX_EVENT_SIZE=0 KB, MEMORY_PARTITION_MODE=NONE, TRACK_CAUSALITY=OFF, STARTUP_STATE=ON)
For the Locks session:
ADD TARGET package0.histogram(SET filtering_event_name=N'sqlserver.lock_acquired', source=N'sqlserver.query_hash'),
ADD TARGET package0.ring_buffer(SET max_events_limit=(1000000))
WITH (MAX_MEMORY=4096 KB, EVENT_RETENTION_MODE=ALLOW_SINGLE_EVENT_LOSS, MAX_DISPATCH_LATENCY=30 SECONDS, MAX_EVENT_SIZE=0 KB, MEMORY_PARTITION_MODE=NONE, TRACK_CAUSALITY=ON, STARTUP_STATE=OFF)
Originally I thought the histogram was the issue, but after following up with Jonathan Kehayias (@sqlpoolboy/blog) of SQLskills - when you think of Extended Events, you should think of Jonathan - I found I was mistaken.  I thought that the MAX_MEMORY=4096KB would throttle the ring buffer to 4MB.  Instead, Jonathan explained to me that the MAX_MEMORY parameter in the WITH clause just determines the "event session memory buffer space for buffering events before they are dispatched to the targets for consumption" - the buffer *before* the ring buffer.  There is a MAX_MEMORY parameter for the target but it needs to be included is the SET clause (like the max_events_limit) to affect the size of the target.  If the SET MAX_MEMORY is set, it will cause the buffer to cycle, effectively having the events overwrite themselves, once the size is reached.  With no parameter defined, there is *no* cap on the size of the ring buffer other than the indirect limit imposed by the max_events_limit parameter.

As you can see above, the default system_health session has its ring buffer set to a max_events_limit of 5,000 events, while the user-defined "Locks" session has a max_events_limit of 1,000,000 events! (Say it with me - one Millllllion events).  The catch Jonathan described to me is that events can be of any size, so you can't directly convert a certain number of events to any particular MB/GB size.

As he so often does, Jonathan went above and beyond my basic question and found he could easily reproduce this problem (the XE session causing the 701 errors) on a SQL 2012 instance in his lab using the same XE session configuration.  As Jonathan notes in this blog post on sqlperformance.com, this is an issue especially with SQL 2012 and the way XE works in 2012.

--

The catch to all of this is that had I realized it was an XE problem right away, there might (*might*) have been enough data still present in the default trace for me to see who had "Object:Created" the "Locks" object, but by the time I realized it was an XE situation the default trace had spun past the time of the original problem when the XE session was created, so there was no way (that I know of) to find the login responsible without someone confessing at the client.

The cautionary tool here is don't use a tool in Production that you don't understand (and remember, almost every environment is Production to someone!) - I have not been able to find out what the unknown admin was trying to accomplish with this session, but had they understood how XE sessions and the ring buffers work, they never would have set a max_events_limit so high without an accompanying max_memory parameter to cap the size of the ring buffer itself.

Thanks to Denny Cherry (@mrdenny/blog), Robert Davis (@sqlsoldier/blog), @sql_handle, and especially Jonathan Kehayias for their insights and #sqlhelp - community is amazing and I am proud to be part of it.

This was several hours for me on a Friday night and a Saturday early morning - hopefully this will help someone else in a similar situation in the future!

Tuesday, March 18, 2014

A New Day at #Ntirety

We - the rank-and-file employees at Ntirety - received word yesterday that we had been acquired by Hosting, a cloud and network services provider based out of Denver, CO.  Apparently Hosting did not have a database services division and decided it made more sense to buy one than to build one.  Ntirety had been looking for a partner or a purchaser and liked what they saw at Hosting.


 As always, the employees in the middle of such a change are the ones who have to make the day-to-day operations appear as seamless as possible to each other and to our clients.

While I have never been in this situation myself, (but I did stay at a Holiday Inn Express last night)  I have had many friends who have gone through this scenario, and most of them have ended one of two ways:

Mass Carnage, where the purchasing firm becomes a conquering overlord and dismembers the purchased firm in a quick and painful fashion
http://www.flickr.com/photos/topcat_angel/2166383816/
Friendly Cooperation, where the purchasing firm basically leaves the purchased firm alone and the two companies help each other succeed.
http://www.flickr.com/photos/vblibrary/8471702001/
Unfortunately, for many corporate partnerships/acquisitions, the first option seems more common than the second, but in our case it feels like we are actually in the second scenario.

The most comforting thing we learned yesterday in the "Congratulations - you work for someone else effective last Friday!" presentation is that our CEO, Michael Corey, is staying on as president of "Ntirety, a division of Hosting."  Another positive thing is that there will be no staff reductions *and* that all currently open positions will remain open and be filled. (I had to ask that during the presentation because I have seen that first-hand before - divisions juggle and while no one gets cut suddenly all of the open positions disappear, leaving you just as short-handed as if people had been cut!)

I am always cautious (a good trait in a DBA, right?) but looking back the morning after I think this will work out well for all involved - and as an employee, I am betting my career and my family's well-being on it.

Wednesday, March 12, 2014

Why Did My Job Fail Anyway?

One of my biggest pet peeves when I check a new SQL Server for a client is the fact that by default there is no method of telling why a job has failed.

My philosophy is that all job steps should have output files, no matter how mundane or simple the job.

Sure you can look at the job history and see something like this:
Log        Job History (User Databases.Reorganize Indexes)
Step ID        1
Server        Server1
Job Name        User Databases.Reorganize Indexes
Step Name        Reorganize Indexes
Duration        04:20:28
Sql Severity        0
Sql Message ID        0
Operator Emailed       
Operator Net sent       
Operator Paged       
Retries Attempted        0

Message
Executed as user: Domain\Bob. ....0.2531.0 for 64-bit  Copyright (C) Microsoft Corp 1984-2005. All rights reserved.    Started:  2:00:01 AM  Progress: 2014-03-09 02:00:02.90     Source: {656CCD7A-0959-4B4F-9A86-81B67288F87D}      Executing query "DECLARE @Guid UNIQUEIDENTIFIER      EXECUTE msdb..sp...".: 100% complete  End Progress  Progress: 2014-03-09 02:02:38.54     Source: Reorganize Index Task      Executing query "USE [db1]  ".: 6% complete  End Progress  Progress: 2014-03-09 02:02:38.96     Source: Reorganize Index Task      Executing query "ALTER INDEX [PK_at_jobs] ON [dbo].[at_jobs] REORGA...".: 12% complete  End Progress  Progress: 2014-03-09 02:02:38.97     Source: Reorganize Index Task      Executing query "USE [db1]  ".: 18% complete  End Progress  Progress: 2014-03-09 02:02:39.16     Source: Reorganize Index Task      Executing query "ALTER INDEX [PK_at_records] ON [dbo].[at_records] ...".: 25% complete  End Progress  Progress: 2014-03-09 02:02:39.16     Source: Reorganize Index Task      Executing query "USE [db1]  ".: 31% complete  End Progress  Progress: 2014-03-09 02:02:39.18     Source: Reorganize Index Task      Executing query "ALTER INDEX [PK_license_data] ON [dbo].[license_da...".: 37% complete  End Progress  Progress: 2014-03-09 02:02:39.18     Source: Reorganize Index Task      Executing query "USE [db1]  ".: 43% complete  End Progress  Progress: 2014-03-09 02:02:39.18     Source: Reorganize Index Task      Executing query "ALTER INDEX [PK_tf_counter] ON [dbo].[tf_counter] ...".: 50% complete  End Progress  Progress: 2014-03-09 02:02:39.18     Source: Reorganize Index Task      Executing query "USE [db1]  ".: 56% complete  End Progress  Progress: 2014-03-09 02:02:39.18     Source: Reorganize Index Task      Executing query "ALTER INDEX [PK_tf_gen_jobs] ON [dbo].[tf_gen_jobs...".: 62% complete  End Progress  Progress: 2014-03-09 02:02:39.18     Source: Reorganize Index Task      Executing query "USE [db1]  ".: 68% complete  End Progress  Progress: 2014-03-09 02:02:39.24     Source: Reorganize Index Task      Executing query "ALTER INDEX [PK_tf_jobs] ON [dbo].[tf_jobs] REORGA...".: 75% complete  End Progress  Progress: 2014-03-09 02:02:39.25     Source: Reorganize Index Task      Executing query "USE [db1]  ".: 81% complete  End Progress  Progress: 2014-03-09 02:02:39.32     Source: Reorganize Index Task      Executing query "ALTER INDEX [PK_tf_keyviol] ON [dbo].[tf_keyviol] ...".: 87% complete  End Progress  Progress: 2014-03-09 02:02:39.32     Source: Reorganize Index Task      Executing query "USE [db1]  ".: 93% complete  End Progress  Progress: 2014-03-09 02:02:41.51     Source: Reorganize Index Task      Executing query "ALTER INDEX [PK_tf_records] ON [dbo].[tf_records] ...".: 100% complete  End Progress  Progress: 2014-03-09 02:02:41.53     Source: Reorganize Index Task      Executing query "USE [db2]  ".: 1% complete  End Progress  Progress: 2014-03-09 02:02:41.80     Source: Reorganize Index Task      Executing query "ALTER INDEX [affiliate_pk] ON [ap].[affiliate_rang...".: 2% complete  End Progress  Progress: 2014-03-09 02:02:41.81     Source: Reorganize Index Task      Executing query "USE [db2]  ".: 4% complete  End Progress  Progress: 2014-03-09 02:02:41.88     Source: Reorganize Index Task      Executing query "ALTER INDEX [approver_backup_pk] ON [ap].[approver...".: 5% complete  End Progress  Progress: 2014-03-09 02:02:41.88     Source: Reorganize Index Task      Executing query "USE [db2]  ".: 7% complete  End Progress  Progress: 2014-03-09 02:02:41.92     Source: Reorganize Index Task      Executing query "ALTER INDEX [approver_expense_map_pk] ON [ap].[app...".: 8% complete  End Progress  Progress: 2014-03-09 02:02:41.92     Source: Reorganize Index Task      Executing query "USE [db2]  ".: 10% complete  End Progress  Progress: 2014-03-09 02:02:41.97     Source: Re...  The package execution fa...  The step failed.

Tells you a lot, doesn't it?

The next worse thing is the DBA who has set up an output file, but just sets it to a fixed path and filename:
At least there is output...for the last run of the job.  May be useful for a daily or weekly job, but for  job that runs multiple times a day all you will see is the output from the last successful run of the job:

Job 'Hourly2' : Step 1, 'SnapManager for SQL Server Backup Step' : Began Executing 2014-01-01 12:55:00
(0 rows(s) affected)
Not really useful when you need to figure out why the job failed twice last night but has run successfully since then, is it?

Is there a better answer?

As you can see in the above screenshot, it is possible to check a box to "Append output to the existing file" but this has some downsides as well.  It results in one big file, which can be difficult to search for any particular job run.  It also makes it nearly impossible to automate the history cleanup since you either have to manually purge the old rows within the file, or automatically delete the whole file which loses *all* of the history, not just that older than X days.

Isn't there a better way?  Anybody?  Bueller?

I was clued into my favorite solution by my work using Ola Hallengren's amazing Maintenance Solution.  Ola uses date-stamped output files in his jobs like this:
CommandLogCleanup_0xE6828CE9E32A244A87B89241D6F20938_1_20140312_132320.txt
So to learn how this worked, I went into Ola's procedure code and found that it is all based on Escape Macros and Tokens:
L:\Log\CommandLogCleanup_$(ESCAPE_SQUOTE(JOBID))_$(ESCAPE_SQUOTE(STEPID))_$(ESCAPE_SQUOTE(STRTDT))_$(ESCAPE_SQUOTE(STRTTM)).txt
Pretty amazing - it resolves the issue above of overwriting the same file over and over *and* also heads off the problem of appending to the same file and having one giant log file.  Having a format like this results in a unique text file for each run of the job, stamped with the date and time of the job run.

What I don't like about Ola's output file format is the inclusion of the ugly GUID-type JOBID - I don't see the need to know that CommandLogCleanup is really Job 0xE6828CE9E32A244A87B89241D6F20938.

To this end, I have come up with a slightly modified format for the output file name:
L:\Log\JobName_Step1_StepName_$(ESCAPE_SQUOTE(STRTDT))_$(ESCAPE_SQUOTE(STRTTM)).txt
The biggest downside to this is that there is not apparently an Escape Macro for the job name or the step name, so these have to be hardcoded into the output file parameter as follows:
L:\Log\MyJob_Step1_DoStuff_$(ESCAPE_SQUOTE(STRTDT))_$(ESCAPE_SQUOTE(STRTTM)).txt
L:\Log\MyJob_Step2_DoMoreStuff_$(ESCAPE_SQUOTE(STRTDT))_$(ESCAPE_SQUOTE(STRTTM)).txt
What I do to maintain this format is to keep a text file named OutputFileWithDateStamp.txt with this templated information in it so that when I create a new job or find a job with no proper output file, I can open the txt file and quickly modify the job and step names and apply the file name.

--

SQL Server Maintenance Plans are a little different (if you are still using them - a better option in almost all cases is Ola's solution mentioned above - and if you don't believe me check out Ola's references).  You don't configure output files on each job step, but rather on the maintenance plan itself via the Reporting and Logging button:




If you click this button when creating or designing a maintenance plan, you can configure a path for output files for all jobs created by the maintenance plan, and they are automatically datestamped similar to (but not exactly the same as) the format above.  Again this is not on by default (sigh) so you need to configure this on every maintenance plan.

Something to realize is that even if Reporting and Logging is configured, you will not see anything on the Advanced Properties of the job step where you would normally see an output file - it is managed through the Plan, not by the job.

--

The last catch to this is that the designated directory can fill up with these files over time, making the directory unwieldy to browse (and, depending on the pre-existing free space, even possibly fill the drive).  The final step to enacting a plan like this is to create a job or Windows Task to purge the chose output directory based on the age of the file.  The easiest way to do this is to be using Ola's Solution and to rely on his Output File Cleanup job to purge the file older than a certain age, by default 30 days, from the designated path.  Similarly if you are still using Maintenance Plans there is a Maintenance Cleanup task you can configure to clean up report files older than a selected age.

If you aren't using Ola's code or SQ Server Maintenance Plans, you can use a variant of the same code to create your own cleanup job.  Create a SQL Agent job and in the job step use the following code:
cmd /q /c "For /F "tokens=1 delims=" %v In ('ForFiles /P "L:\Log" /m *_*_*_*.txt /d -30 2^>^&1') do if EXIST "L:\Log"\%v echo del "L:\Log"\%v& del "L:\Log"\%v"
d - 30 tells the code to delete files older than 30 days and L:\Log is the output file path to clean up.

 --
 
Next, I hope to create a script to cycle through all of the jobs on a server and to configure an output file for any job step that doesn't already have one - look for this in an upcoming blog post!