One DBA's Ongoing Search for Clarity in the Middle of Nowhere


*or*

Yet Another Andy Writing About SQL Server

Wednesday, March 12, 2014

Why Did My Job Fail Anyway?

One of my biggest pet peeves when I check a new SQL Server for a client is the fact that by default there is no method of telling why a job has failed.

My philosophy is that all job steps should have output files, no matter how mundane or simple the job.

Sure you can look at the job history and see something like this:
Log        Job History (User Databases.Reorganize Indexes)
Step ID        1
Server        Server1
Job Name        User Databases.Reorganize Indexes
Step Name        Reorganize Indexes
Duration        04:20:28
Sql Severity        0
Sql Message ID        0
Operator Emailed       
Operator Net sent       
Operator Paged       
Retries Attempted        0

Message
Executed as user: Domain\Bob. ....0.2531.0 for 64-bit  Copyright (C) Microsoft Corp 1984-2005. All rights reserved.    Started:  2:00:01 AM  Progress: 2014-03-09 02:00:02.90     Source: {656CCD7A-0959-4B4F-9A86-81B67288F87D}      Executing query "DECLARE @Guid UNIQUEIDENTIFIER      EXECUTE msdb..sp...".: 100% complete  End Progress  Progress: 2014-03-09 02:02:38.54     Source: Reorganize Index Task      Executing query "USE [db1]  ".: 6% complete  End Progress  Progress: 2014-03-09 02:02:38.96     Source: Reorganize Index Task      Executing query "ALTER INDEX [PK_at_jobs] ON [dbo].[at_jobs] REORGA...".: 12% complete  End Progress  Progress: 2014-03-09 02:02:38.97     Source: Reorganize Index Task      Executing query "USE [db1]  ".: 18% complete  End Progress  Progress: 2014-03-09 02:02:39.16     Source: Reorganize Index Task      Executing query "ALTER INDEX [PK_at_records] ON [dbo].[at_records] ...".: 25% complete  End Progress  Progress: 2014-03-09 02:02:39.16     Source: Reorganize Index Task      Executing query "USE [db1]  ".: 31% complete  End Progress  Progress: 2014-03-09 02:02:39.18     Source: Reorganize Index Task      Executing query "ALTER INDEX [PK_license_data] ON [dbo].[license_da...".: 37% complete  End Progress  Progress: 2014-03-09 02:02:39.18     Source: Reorganize Index Task      Executing query "USE [db1]  ".: 43% complete  End Progress  Progress: 2014-03-09 02:02:39.18     Source: Reorganize Index Task      Executing query "ALTER INDEX [PK_tf_counter] ON [dbo].[tf_counter] ...".: 50% complete  End Progress  Progress: 2014-03-09 02:02:39.18     Source: Reorganize Index Task      Executing query "USE [db1]  ".: 56% complete  End Progress  Progress: 2014-03-09 02:02:39.18     Source: Reorganize Index Task      Executing query "ALTER INDEX [PK_tf_gen_jobs] ON [dbo].[tf_gen_jobs...".: 62% complete  End Progress  Progress: 2014-03-09 02:02:39.18     Source: Reorganize Index Task      Executing query "USE [db1]  ".: 68% complete  End Progress  Progress: 2014-03-09 02:02:39.24     Source: Reorganize Index Task      Executing query "ALTER INDEX [PK_tf_jobs] ON [dbo].[tf_jobs] REORGA...".: 75% complete  End Progress  Progress: 2014-03-09 02:02:39.25     Source: Reorganize Index Task      Executing query "USE [db1]  ".: 81% complete  End Progress  Progress: 2014-03-09 02:02:39.32     Source: Reorganize Index Task      Executing query "ALTER INDEX [PK_tf_keyviol] ON [dbo].[tf_keyviol] ...".: 87% complete  End Progress  Progress: 2014-03-09 02:02:39.32     Source: Reorganize Index Task      Executing query "USE [db1]  ".: 93% complete  End Progress  Progress: 2014-03-09 02:02:41.51     Source: Reorganize Index Task      Executing query "ALTER INDEX [PK_tf_records] ON [dbo].[tf_records] ...".: 100% complete  End Progress  Progress: 2014-03-09 02:02:41.53     Source: Reorganize Index Task      Executing query "USE [db2]  ".: 1% complete  End Progress  Progress: 2014-03-09 02:02:41.80     Source: Reorganize Index Task      Executing query "ALTER INDEX [affiliate_pk] ON [ap].[affiliate_rang...".: 2% complete  End Progress  Progress: 2014-03-09 02:02:41.81     Source: Reorganize Index Task      Executing query "USE [db2]  ".: 4% complete  End Progress  Progress: 2014-03-09 02:02:41.88     Source: Reorganize Index Task      Executing query "ALTER INDEX [approver_backup_pk] ON [ap].[approver...".: 5% complete  End Progress  Progress: 2014-03-09 02:02:41.88     Source: Reorganize Index Task      Executing query "USE [db2]  ".: 7% complete  End Progress  Progress: 2014-03-09 02:02:41.92     Source: Reorganize Index Task      Executing query "ALTER INDEX [approver_expense_map_pk] ON [ap].[app...".: 8% complete  End Progress  Progress: 2014-03-09 02:02:41.92     Source: Reorganize Index Task      Executing query "USE [db2]  ".: 10% complete  End Progress  Progress: 2014-03-09 02:02:41.97     Source: Re...  The package execution fa...  The step failed.

Tells you a lot, doesn't it?

The next worse thing is the DBA who has set up an output file, but just sets it to a fixed path and filename:
At least there is output...for the last run of the job.  May be useful for a daily or weekly job, but for  job that runs multiple times a day all you will see is the output from the last successful run of the job:

Job 'Hourly2' : Step 1, 'SnapManager for SQL Server Backup Step' : Began Executing 2014-01-01 12:55:00
(0 rows(s) affected)
Not really useful when you need to figure out why the job failed twice last night but has run successfully since then, is it?

Is there a better answer?

As you can see in the above screenshot, it is possible to check a box to "Append output to the existing file" but this has some downsides as well.  It results in one big file, which can be difficult to search for any particular job run.  It also makes it nearly impossible to automate the history cleanup since you either have to manually purge the old rows within the file, or automatically delete the whole file which loses *all* of the history, not just that older than X days.

Isn't there a better way?  Anybody?  Bueller?

I was clued into my favorite solution by my work using Ola Hallengren's amazing Maintenance Solution.  Ola uses date-stamped output files in his jobs like this:
CommandLogCleanup_0xE6828CE9E32A244A87B89241D6F20938_1_20140312_132320.txt
So to learn how this worked, I went into Ola's procedure code and found that it is all based on Escape Macros and Tokens:
L:\Log\CommandLogCleanup_$(ESCAPE_SQUOTE(JOBID))_$(ESCAPE_SQUOTE(STEPID))_$(ESCAPE_SQUOTE(STRTDT))_$(ESCAPE_SQUOTE(STRTTM)).txt
Pretty amazing - it resolves the issue above of overwriting the same file over and over *and* also heads off the problem of appending to the same file and having one giant log file.  Having a format like this results in a unique text file for each run of the job, stamped with the date and time of the job run.

What I don't like about Ola's output file format is the inclusion of the ugly GUID-type JOBID - I don't see the need to know that CommandLogCleanup is really Job 0xE6828CE9E32A244A87B89241D6F20938.

To this end, I have come up with a slightly modified format for the output file name:
L:\Log\JobName_Step1_StepName_$(ESCAPE_SQUOTE(STRTDT))_$(ESCAPE_SQUOTE(STRTTM)).txt
The biggest downside to this is that there is not apparently an Escape Macro for the job name or the step name, so these have to be hardcoded into the output file parameter as follows:
L:\Log\MyJob_Step1_DoStuff_$(ESCAPE_SQUOTE(STRTDT))_$(ESCAPE_SQUOTE(STRTTM)).txt
L:\Log\MyJob_Step2_DoMoreStuff_$(ESCAPE_SQUOTE(STRTDT))_$(ESCAPE_SQUOTE(STRTTM)).txt
What I do to maintain this format is to keep a text file named OutputFileWithDateStamp.txt with this templated information in it so that when I create a new job or find a job with no proper output file, I can open the txt file and quickly modify the job and step names and apply the file name.

--

SQL Server Maintenance Plans are a little different (if you are still using them - a better option in almost all cases is Ola's solution mentioned above - and if you don't believe me check out Ola's references).  You don't configure output files on each job step, but rather on the maintenance plan itself via the Reporting and Logging button:




If you click this button when creating or designing a maintenance plan, you can configure a path for output files for all jobs created by the maintenance plan, and they are automatically datestamped similar to (but not exactly the same as) the format above.  Again this is not on by default (sigh) so you need to configure this on every maintenance plan.

Something to realize is that even if Reporting and Logging is configured, you will not see anything on the Advanced Properties of the job step where you would normally see an output file - it is managed through the Plan, not by the job.

--

The last catch to this is that the designated directory can fill up with these files over time, making the directory unwieldy to browse (and, depending on the pre-existing free space, even possibly fill the drive).  The final step to enacting a plan like this is to create a job or Windows Task to purge the chose output directory based on the age of the file.  The easiest way to do this is to be using Ola's Solution and to rely on his Output File Cleanup job to purge the file older than a certain age, by default 30 days, from the designated path.  Similarly if you are still using Maintenance Plans there is a Maintenance Cleanup task you can configure to clean up report files older than a selected age.

If you aren't using Ola's code or SQ Server Maintenance Plans, you can use a variant of the same code to create your own cleanup job.  Create a SQL Agent job and in the job step use the following code:
cmd /q /c "For /F "tokens=1 delims=" %v In ('ForFiles /P "L:\Log" /m *_*_*_*.txt /d -30 2^>^&1') do if EXIST "L:\Log"\%v echo del "L:\Log"\%v& del "L:\Log"\%v"
d - 30 tells the code to delete files older than 30 days and L:\Log is the output file path to clean up.

 --
 
Next, I hope to create a script to cycle through all of the jobs on a server and to configure an output file for any job step that doesn't already have one - look for this in an upcoming blog post!

1 comment:

  1. Great post, it helped me out, made my job easy! Thank you!

    ReplyDelete