One DBA's Ongoing Search for Clarity in the Middle of Nowhere


*or*

Yet Another Andy Writing About SQL Server

Showing posts with label Perfmon. Show all posts
Showing posts with label Perfmon. Show all posts

Tuesday, February 16, 2016

Configuring a Perfmon Collector for SQL Server

Something I always stress to people when I write and present is the importance of tracking data *before* you need it.  Microsoft does a fair job collecting some data via DMVs, the default trace, and Extended Events, but the data is still very limited in many ways, and in many cases the data doesn't persist past a SQL service restart.

One of the important tools for a SQL Server DBA (or Dev, or Analyst, or...) is Windows Perfmon.  Perfmon can be configured to track a ton of data, but it doesn't collect anything unless you are actively watching via the Perfmon GUI or you have configured a Perfmon Collector.

One of the downsides to Perfmon, like most other monitoring tools, is performance overhead.  The more frequently you measure data, the more impactful it can be.

https://33.media.tumblr.com/871fe906d55f4f01872fd65482f94a8a/tumblr_inline_njxqxpHf471s778iq.jpg
Because of this, I set up my collector to gather data every five minutes to lessen that impact.  This number is a happy medium from past discussions with other colleagues for a number that is frequent enough to notice trends over time while still being infrequent enough to have minimal impact.  Every five minutes may strike you as liable to miss problems, and it can - if something spikes (or valleys) for a moment - or even a minute or two - you may not see it.  For many Perfmon counters however, you will see an extreme followed by a gradual change like this image from my recent "Server Using All of my RAM" blog post:


As you can see, Page Life Expectancy (PLE) on this graph dips, gradually climbs, and then dips again.  With a collection every five minutes you may not catch the exact peak - all you know is that the PLE was 50,000 at 12:55am and then only 100 at 1:00am on 03/13.  It may have climbed higher than that before it dipped, but by 1:00am it had dipped down to around 100 (coincidentally at 1am the CheckDB job had kicked off on a large database).

If you really need to know (in this example) exactly how high PLE gets before it dips, or exactly how low it dips, or at what specific time it valleys or dips, you need to actively watch or set up a collector with a more frequent collection.  You will find that in most cases this absolute value isn't important - it is sufficient to know that a certain item peaks/valleys in a certain five minute interval, or that during a certain five minute interval ("The server was slow last night at 3am") a value was in an acceptable/unacceptable range.

If you do set up a collector with a more frequent interval, make sure to delete it (or at least turn it off) after you have collected your data.  I am a fan of deleting it outright so that it doesn't accidentally get turned back on and cause impact, but sometimes it does make more sense to leave it in place. #ItDepends

--

My mechanism (not originally created by me but significantly modified from the original source material - I do not know who was the original author) uses a folder structure that I can ZIP and then unzip into C:\Perflogs, with batch files to create the collector and manage it via Windows Scheduled tasks.

Here are the counters I collect by default:

  • "\Memory\Available MBytes"
  • "\Memory\Pages/sec"
  • "\MSSQLSERVER:Access Methods\Forwarded Records/sec"
  • "\MSSQLSERVER:Access Methods\Full Scans/sec"
  • "\MSSQLSERVER:Access Methods\Index Searches/sec"
  • "\MSSQLSERVER:Buffer Manager\Buffer cache hit ratio"
  • "\MSSQLSERVER:Buffer Manager\Free List Stalls/sec"
  • "\MSSQLSERVER:Buffer Manager\Free pages"
  • "\MSSQLSERVER:Buffer Manager\Lazy writes/sec"
  • "\MSSQLSERVER:Buffer Manager\Page life expectancy"
  • "\MSSQLSERVER:Buffer Manager\Page reads/sec"
  • "\MSSQLSERVER:Buffer Manager\Page writes/sec"
  • "\MSSQLSERVER:General Statistics\User Connections"
  • "\MSSQLSERVER:Latches(*)\Latch Waits/sec"
  • "\MSSQLSERVER:Locks(*)\Lock Waits/sec"
  • "\MSSQLSERVER:Locks(*)\Number of Deadlocks/sec"
  • "\MSSQLSERVER:Memory Manager\Target Server Memory (KB)"
  • "\MSSQLSERVER:Memory Manager\Total Server Memory (KB)"
  • "\MSSQLSERVER:SQL Statistics\Batch Requests/sec"
  • "\MSSQLSERVER:SQL Statistics\SQL Compilations/sec"
  • "\MSSQLSERVER:SQL Statistics\SQL Re-Compilations/sec"
  • "\Paging File(*)\% Usage"
  • "\PhysicalDisk(*)\Avg. Disk sec/Read"
  • "\PhysicalDisk(*)\Avg. Disk sec/Write"
  • "\PhysicalDisk(*)\Disk Reads/sec"
  • "\PhysicalDisk(*)\Disk Writes/sec"
  • "\Process(sqlservr)\% Privileged Time"
  • "\Process(sqlservr)\% Processor Time"
  • "\Processor(*)\% Privileged Time"
  • "\Processor(*)\% Processor Time"
  • "\SQLSERVER:Access Methods\Forwarded Records/sec"
  • "\SQLSERVER:Access Methods\Full Scans/sec"
  • "\SQLSERVER:Access Methods\Index Searches/sec"
  • "\SQLSERVER:Buffer Manager\Buffer cache hit ratio"
  • "\SQLSERVER:Buffer Manager\Free List Stalls/sec"
  • "\SQLSERVER:Buffer Manager\Free pages"
  • "\SQLSERVER:Buffer Manager\Lazy writes/sec"
  • "\SQLSERVER:Buffer Manager\Page life expectancy"
  • "\SQLSERVER:Buffer Manager\Page reads/sec"
  • "\SQLSERVER:Buffer Manager\Page writes/sec"
  • "\SQLSERVER:General Statistics\User Connections"
  • "\SQLSERVER:Latches(*)\Latch Waits/sec"
  • "\SQLSERVER:Locks(*)\Lock Waits/sec"
  • "\SQLSERVER:Locks(*)\Number of Deadlocks/sec"
  • "\SQLSERVER:Memory Manager\Target Server Memory (KB)"
  • "\SQLSERVER:Memory Manager\Total Server Memory (KB)"
  • "\SQLSERVER:SQL Statistics\Batch Requests/sec"
  • "\SQLSERVER:SQL Statistics\SQL Compilations/sec"
  • "\SQLSERVER:SQL Statistics\SQL Re-Compilations/sec"
  • "\System\Processor Queue Length"

As you can see there is duplication between MSSQLSERVER counters and SQLSERVER counters - this is because at some version of SQL the hive name changed from MSSQLSERVER to SQLSERVER, and by including both of them in the list it covers all of the bases regardless of the Windows/SQL Server version being monitored.  If the listed hive doesn't exist, the collector creates without error, so it doesn't hurt to have both of them in the list.

This is my default list, curated over time from experiences and input from others - if you have other counters you want to collect just edit the list accordingly.

One note - if you have a named instance, the counter hive will be named differently - something like MSSQL$instancename.  The easiest way to handle this is to edit the counter list, copy-paste the SQLSERVER counter list, and then find-replace SQLSERVER to MSSQL$instancename for the new items.

--

The folder structure starts at the top level with a folder named SQLServerPerf (in my case Ntirety-SQLServerPerf).  Below that are two folders, SQLServerPerf\BatchFiles and SQLServerPerf\Logs.

By default, the Logs folder is empty,  In BatchFiles are five files - a .CFG file that includes the counter list above, and four .BAT files to create/start/stop/cycle the collector itself.

To start, unzip the package (or copy the SQLServerPerf folder) into C:\PerfLogs, resulting in C:\PerfLogs\SQLServerPerf\

Why do I use C:?  By default Windows creates a PerfLogs folder there, plus using that path guarantees (almost guarantees) that the batch files will run since Windows servers in general have C: drives - using a different drive would require edits to the files to reference that different drive, and it you absolutely can't write to C: that is the fix - edit the files to change references from C: to the drive of your choice.

--

SQLServerPerf\BatchFiles\SQLServer.CFG is a text file whose contents are just the counter list:
"\Memory\Available MBytes"
"\Memory\Pages/sec"
"\MSSQLSERVER:Access Methods\Forwarded Records/sec"
"\MSSQLSERVER:Access Methods\Full Scans/sec"
"\MSSQLSERVER:Access Methods\Index Searches/sec"
"\MSSQLSERVER:Buffer Manager\Buffer cache hit ratio"
"\MSSQLSERVER:Buffer Manager\Free List Stalls/sec"
"\MSSQLSERVER:Buffer Manager\Free pages"
"\MSSQLSERVER:Buffer Manager\Lazy writes/sec"
"\MSSQLSERVER:Buffer Manager\Page life expectancy"
"\MSSQLSERVER:Buffer Manager\Page reads/sec"
"\MSSQLSERVER:Buffer Manager\Page writes/sec"
"\MSSQLSERVER:General Statistics\User Connections"
"\MSSQLSERVER:Latches(*)\Latch Waits/sec"
"\MSSQLSERVER:Locks(*)\Lock Waits/sec"
"\MSSQLSERVER:Locks(*)\Number of Deadlocks/sec"
"\MSSQLSERVER:Memory Manager\Target Server Memory (KB)"
"\MSSQLSERVER:Memory Manager\Total Server Memory (KB)"
"\MSSQLSERVER:SQL Statistics\Batch Requests/sec"
"\MSSQLSERVER:SQL Statistics\SQL Compilations/sec"
"\MSSQLSERVER:SQL Statistics\SQL Re-Compilations/sec"
"\Paging File(*)\% Usage"
"\PhysicalDisk(*)\Avg. Disk sec/Read"
"\PhysicalDisk(*)\Avg. Disk sec/Write"
"\PhysicalDisk(*)\Disk Reads/sec"
"\PhysicalDisk(*)\Disk Writes/sec"
"\Process(sqlservr)\% Privileged Time"
"\Process(sqlservr)\% Processor Time"
"\Processor(*)\% Privileged Time"
"\Processor(*)\% Processor Time"
"\SQLSERVER:Access Methods\Forwarded Records/sec"
"\SQLSERVER:Access Methods\Full Scans/sec"
"\SQLSERVER:Access Methods\Index Searches/sec"
"\SQLSERVER:Buffer Manager\Buffer cache hit ratio"
"\SQLSERVER:Buffer Manager\Free List Stalls/sec"
"\SQLSERVER:Buffer Manager\Free pages"
"\SQLSERVER:Buffer Manager\Lazy writes/sec"
"\SQLSERVER:Buffer Manager\Page life expectancy"
"\SQLSERVER:Buffer Manager\Page reads/sec"
"\SQLSERVER:Buffer Manager\Page writes/sec"
"\SQLSERVER:General Statistics\User Connections"
"\SQLSERVER:Latches(*)\Latch Waits/sec"
"\SQLSERVER:Locks(*)\Lock Waits/sec"
"\SQLSERVER:Locks(*)\Number of Deadlocks/sec"
"\SQLSERVER:Memory Manager\Target Server Memory (KB)"
"\SQLSERVER:Memory Manager\Total Server Memory (KB)"
"\SQLSERVER:SQL Statistics\Batch Requests/sec"
"\SQLSERVER:SQL Statistics\SQL Compilations/sec"
"\SQLSERVER:SQL Statistics\SQL Re-Compilations/sec"
"\System\Processor Queue Length"
--

SQLServerPerf\BatchFiles\SQLPerfmonCollector-Create.bat is a batch (text) file whose contents create the collector and also several Windows Scheduled Tasks that manage the collector:
logman create counter SQLServerPerf -f bin -si 300 -v nnnnnn -o "c:\perflogs\SQLServerPerf\Logs\SQLServerPerf" -cf "c:\perflogs\SQLServerPerf\BatchFiles\SQLServer.cfg"
timeout /T 2
logman start SQLServerPerf
timeout /T 2
schtasks /create /tn "Cycle SQLServerPerf Perfmon Counter Log" /tr C:\PerfLogs\SQLServerPerf\BatchFiles\SQLPerfmonCollector-Cycle.bat /sc daily /st 23:59:58 /ed 01/01/2099 /ru system
timeout /T 2
schtasks /create /tn "Start SQLServerPerf Perfmon Counter Log" /tr C:\PerfLogs\SQLServerPerf\BatchFiles\SQLPerfmonCollector-Start.bat /sc onstart /ru system
timeout /T 2
schtasks /create /tn "Purge SQLServerPerf Perfmon Counter Log" /tr "PowerShell -command {Get-ChildItem -path C:\PerfLogs\SQLServerPerf\Logs -Filter *.blg | where {$_.Lastwritetime -lt (date).addmonths(-13)} | remove-item}" /sc daily /st 23:59:58 /ed 01/01/2099 /ru system
pause
As you can see there are references to the direct path to our C:\PerfLogs\SQLServerPerf - if you move the files to another drive/path, these references need to be changed.

This batch file does the following:

  1. Uses "logman create" to create the actual collector, writing files to to the Logs folder and gathering the counters listed in the SQLServer.cfg file
  2. Uses "timeout" to pause for two seconds to allow the user to see any return messages
  3. Uses "logman start" to start the collector
  4. Pauses two more seconds
  5. Uses "schtasks /create" to create a scheduled task to run SQLPerfmonCollector-Cycle.bat nightly at 11:59:58pm (Batch file contents to follow)
  6. Pauses two more seconds
  7. Uses "schtasks /create" to create a scheduled task to run SQLPerfmonCollector-Start.bat on system startup - this makes sure that after a reboot the collector is automatically started for minimal interruption
  8. Pauses two more seconds
  9. Uses "schtasks /create" to create a scheduled task to run nightly at 11:59:58pm to run a Powershell command to delete log (.BLG) files older than 13 months
  10. Uses "pause" to stop progress until the user hits a key - this again prevents the window from closing until the user acknowledges the progress
This is specifically written to be run interactively (right-click on the BAT file and "Run" or if available, "Run As Administrator") - if you want to truly automate the deploy of the collector, the  pause command should be removed.

--

**Important Note** - current versions of Windows have settings for some of what I am using Scheduled Tasks to do, such as restarting the collector.  This wasn't always the case, and I wanted a solution I could deploy without worrying about the Windows version.

--

SQLServerPerf\BatchFiles\SQLPerfmonCollector-Cycle.bat is a batch (text) file whose contents stop and then start the collector:
logman stop SQLServerPerf
timeout /T 2
logman start SQLServerPerf
I included the timeout to pause for two seconds as I sometimes had issues with the stop/start process where I would get an error from the start command complaining that the collector was already running since the stop hadn't completed yet - adding the two second pause give the collector time to be completely stopped before the start attempt.

--

SQLServerPerf\BatchFiles\SQLPerfmonCollector-Start.bat is a batch (text) file whose contents start the collector:
logman start SQLServerPerf
--

SQLServerPerf\BatchFiles\SQLPerfmonCollector-Stop.bat is a batch (text) file whose contents stop the collector:
logman stop SQLServerPerf
I included this for completeness when I was creating the batch files although I have yet to need it - if I want to stop the collector for some reason I simply do so interactively in the Perfmon GUI.

--

As noted above, I keep the SQLServerPerf folder as a ZIP, and then to deploy it I unzip it into C:\PerfLogs, and then run the Create batch file - it's a simple as that.

The output file from a Perfmon collector by default is a .BLG file.  As noted above, there is a scheduled tasks to stop and start the collector every night, resulting in a new .BLG file each day:


To look at an individual day's info, you simply open that day's file in Perfmon (by opening Perfmon and then opening the file, or by double-clicking the BLG file itself).  If you wish to string multiple days' data together, use the Relog command as I described in this post from a couple years ago: "Handling Perfmon Logs with Relog."

As you can see in the above screenshot, on my laptop the collector stores about 1MB per day.  Even on a big OLTP production cluster I don't see more than 1.5MB-2.0MB/day.  Stretch that out over 13 months (remember we purge files older than 13 months) and it comes out to a max size of around 800MB.  Hopefully 1GB of Perfmon data isn't going to fill your C: drive (if it is you have other problems) but if it is going be a problem, relocate the folders as described above.

--

There are few things more important for *reacting* to problems than to be *proactively* gathering data - and a Perfmon collector helps you to do just that.

Hope this helps!


Friday, January 8, 2016

Why is my SQL Server using all of the RAM on the server?

The TL;DR on this is simple:

"BECAUSE IT'S SUPPOSED TO!"

A frequent complaint we receive comes from a client that has an in-house sysadmin monitoring tool like Microsoft SCOM/SCCM.  They turn the tool on and it starts red-alerting because the Windows server hosting SQL Server is at 90%+ used RAM.  The sysadmin (or local DBA) logs on to the server and finds that there is 5GB free (~4%), and the sqlservr.exe process is using 120GB of the 128GB on the server!
http://cdn.meme.am/instances/55759451.jpg
Like almost all things SQL Server, this is a simplification of a situation that requires some serious #ItDepends.

  • What is the baseline - how much RAM is usually free on this server?  This can be seen by tracking the Perfmon counter Memory\Available MBytes.
  • What is the Buffer Manager\Page Life Expectancy number - not right this second, but over time?
  • Is there any actual performance issue on the server?
I have written before about the need for baselines, especially regarding memory and CPU use on the Windows server itself.  As a DBA, I don't care that my Windows server is at 95% RAM *if* both Windows and SQL Server aren't having performance problems *over time*.  

I started my IT career on a support/help desk supporting servers and desktops, and 16 years ago being at 5% free RAM probably meant you had 500MB-1GB free, which was definitely an OMG! (although then we probably didn't actually say OMG.)

Today's servers often have 96GB or more of RAM, and the larger the server the less relevant a percentage-based measurement is - just because your server has 256B of RAM doesn't mean Windows is suddenly going to need 25GB+ of RAM to run cleanly.


This brings us back to our three questions above.  Rather than worrying about some artificial guideline like 10%, how much memory is usually free on this server?  Track Memory\Available MBytes over time and see where it tops out - if the regular operation of your server always leaves 2-4GB free, you are probably OK from an Windows point of view.

What is the Page Life Expectancy of the SQL Server instance over time?  As many other have written about, PLE is one of the most misused statistics in SQL Server.  A DBA checks Perfmon or runs a DMV query (like this one from the awesome set curated by Glenn Berry (blog/@GlennAlanBerry)):
-- Page Life Expectancy (PLE) value for each NUMA node in current instance  (Query 36) (PLE by NUMA Node)
SELECT @@SERVERNAME AS [Server Name], [object_name], instance_name, cntr_value AS [Page Life Expectancy]
FROM sys.dm_os_performance_counters WITH (NOLOCK)
WHERE [object_name] LIKE N'%Buffer Node%' -- Handles named instances
AND counter_name = N'Page life expectancy' OPTION (RECOMPILE);
They find that the PLE is some relatively small number and freak out, or worse, they find it is a large number and think "my SQL Server doesn't need this much RAM, I can cut back."

danger-cat.jpg (500×421)
https://lost100pounds.files.wordpress.com/2011/05/danger-cat.jpg
Like Available MBytes, PLE is a number that needs to be measured over time.  On the servers I touch I always recommend setting up a "permanent" Perfmon collector of SQL Server-related counters so that you can look at the numbers over time. For example, a usual PLE chart over a week can look like this:


...or it can look like this:


Even over time, some people look at this second chart and say one of two things:

  • My PLE goes down to 14, what do I do?
  • My PLE maxes out at 50,000, what do I do?
Again, #ItDepends - what is going on in SQL Server during each of these times?  If you look at the above chart you can see that PLE generally climbs during the day (with some minor fluctuations), but then tanks over night pretty regularly. 

Look at when SQL Server dips to 14 - is that when reindexing or CheckDB runs?  If so, is performance good enough?  If CheckDB runs in two hours, and your window is three to four hour, isn't that acceptable?  Are other things running at the same time that are negatively impacted, or is that window clear?

If you do need to improve the processes running at that moment (when PLE dips), what is the Available MBytes at this same time and over time?  If it is high over time, it shows that there is head room on your server to consider increasing your Max Server Memory and to help your PLE stay higher.  If your Available MBytes is not consistently high, you can't consider increasing Max Server Memory unless you increase actual server RAM.

This is the 60-second version of quick memory troubleshooting, but think about it - isn't this (even just this small piece of it) a lot more complicated than "I have less than 10% free RAM."


Is anybody or any process complaining (besides SCCM)?  Are any numbers out of line - for example is a virus scan that normally takes 2 hours suddenly taking 4-5?  If nothing else is abnormal, is there a problem?

--

As I mentioned at the start of this post, quite often your SQL Server process(es) are simply *supposed* to be using the lion's share by far of the resources on the Windows server - especially RAM.  As long as you reserve sufficient head room for the operating system (usually 4GB-16GB as described here by Jonathan Kehayias (blog/@SQLPoolBoy) but modified by monitoring Avail MBytes) then the percentage of RAM used probably doesn't matter.

...and if you have good sysadmins they will listen to you and realize that SQL Servers (like Oracle, Exchange, and other database servers) are simply different from web servers or file servers.

Hope this helps!





Thursday, May 29, 2014

Handling Perfmon Logs with Relog


In a recent situation, I was tasked by a client to check one of their servers which they believed to be grossly over-provisioned.  As a VM they knew they could easily pull both CPU and RAM from the server if they could determine the true needs of the server.

One of the items I regularly recommend to my clients is to have a "permanent" Perfmon collector log, where a basic set of counters is collected every five minutes, with the log file cycled every day.  This generates 1.5MB-2.0MB of files daily (a trivial amount of storage even if maintained for several years) and daily cycling makes the logs easier to handle when troubleshooting issues.  This is also useful for something most people don't have - a baseline over time - but to use it this way requires some serious finagling as you will see.

--

The way I set up the collector by default, the collector writes to a file in a designated directory, and I use a couple of batch files that use the logman command to stop and start the collector, and then a pair of Windows Scheduled Tasks, one to stop and start the collector each night (effectively cycling to a new output file) and one to start the collector on server start-up (so that a server reboot doesn't kill the collector altogether).  I know there are ways to configure a collector to cycle to new files each day, but they are version-specific based on your version of Windows, while the batch file/scheduled task method works across O/S versions.

--

The net effect of this method is that you end up with a directory with a large stack of blg files like this:


You can see a file for each day plus a new file any time the Server restarts.  This is great when you are looking for info of what happened on the night of 05/24/2014, but not so much when you need to know what happened over the last thirty days.

Enter relog!

Relog is a command line utility for manipulating perfmon output files (either .blg or .csv).  It is frequently used to filter existing files by counter or to convert .blg files into .csv files, both of which are described in this blog post and this one by John Pertell (blog/@jayape).  A lesser known use of relog is its ability to combine individual files into a single larger file, like this:

relog day1.blg day2.blg day3.blg -f BIN -o Day1_2_3.blg

...which would combine the three files day1, day2, and day3 into a single large file Day_1_2_3.

This is a necessary evil for many types of analysis since it can be difficult to wade through multiple perfmon log files at once.  Most tools claim to support it, but most of them also have vague disclaimers that amount to YMMV.  The most reliable way to handle this situation is to use relog to combine the files prior to beginning the analysis.


In the given situation, I had 43 files covering a little more than a month to analyze.  I hadn't collected this many files at once before, but didn't really think about it as any big deal.  I used notepad to mock together the relog command (not for the faint of heart):


C:\PerfLogs\Blog>relog Ntirety-SQLServerPerf_000066.blg Ntirety-SQLServerPerf_000067.blg Ntirety-SQLServerPerf_000068.blg Ntirety-SQLServerPerf_000069.blg Ntirety-SQLServerPerf_000070.blg Ntirety-SQLServerPerf_000071.blg Ntirety-SQLServerPerf_000072.blg Ntirety-SQLServerPerf_000073.blg Ntirety-SQLServerPerf_000074.blg Ntirety-SQLServerPerf_000075.blg Ntirety-SQLServerPerf_000076.blg Ntirety-SQLServerPerf_000077.blg Ntirety-SQLServerPerf_000078.blg Ntirety-SQLServerPerf_000079.blg Ntirety-SQLServerPerf_000080.blg Ntirety-SQLServerPerf_000081.blg Ntirety-SQLServerPerf_000082.blg Ntirety-SQLServerPerf_000083.blg Ntirety-SQLServerPerf_000084.blg Ntirety-SQLServerPerf_000085.blg Ntirety-SQLServerPerf_000086.blg Ntirety-SQLServerPerf_000087.blg Ntirety-SQLServerPerf_000088.blg Ntirety-SQLServerPerf_000089.blg Ntirety-SQLServerPerf_000090.blg Ntirety-SQLServerPerf_000091.blg Ntirety-SQLServerPerf_000092.blg Ntirety-SQLServerPerf_000093.blg Ntirety-SQLServerPerf_000094.blg Ntirety-SQLServerPerf_000095.blg Ntirety-SQLServerPerf_000096.blg Ntirety-SQLServerPerf_000097.blg Ntirety-SQLServerPerf_000098.blg Ntirety-SQLServerPerf_000099.blg Ntirety-SQLServerPerf_000100.blg Ntirety-SQLServerPerf_000101.blg Ntirety-SQLServerPerf_000102.blg Ntirety-SQLServerPerf_000103.blg Ntirety-SQLServerPerf_000104.blg Ntirety-SQLServerPerf_000105.blg Ntirety-SQLServerPerf_000106.blg Ntirety-SQLServerPerf_000107.blg Ntirety-SQLServerPerf_000108.blg -f BIN -o Combined.blg

...and went to run it from the command prompt.  Much to my surprise I received the following:


At first the error ("The specified log file type has not been installed on this computer") threw me and made me investigate whether one of the files was corrupted in some way.  All of the files were roughly the same size, with the days that had multiple files (due to server restarts) still combining to be about the same size.  I collected smaller subsets of 2 files/5 files/10 files without issue.  At this point I turned to my Google-Fu and quickly found someone with a similar problem.  Microsoft employee Dustin Metzgar (blog/@DustinMetzgar) had a blog post from August 2013 "Performance Monitor Issues" that described my situation exactly. Issue #4 in the post is "Combining several BLGs into one is limited to 32 files and has poor error messages" - sound familiar?  Sure enough, if I modified my relog command to only include 32 files, it worked!

The item that Dustin doesn't address is how to handle the excess files.  I found that with a little creativity (very little) I was able to use multiple relog commands to combine groups of files into larger files and then combine those larger files into one final large file:

C:\PerfLogs\Blog>relog Ntirety-SQLServerPerf_000066.blg Ntirety-SQLServerPerf_000067.blg Ntirety-SQLServerPerf_000068.blg Ntirety-SQLServerPerf_000069.blg Ntirety-SQLServerPerf_000070.blg  Ntirety-SQLServerPerf_000071.blg  Ntirety-SQLServerPerf_000072.blg  Ntirety-SQLServerPerf_000073.blg  Ntirety-SQLServerPerf_000074.blg  Ntirety-SQLServerPerf_000075.blg  Ntirety-SQLServerPerf_000076.blg  Ntirety-SQLServerPerf_000077.blg  Ntirety-SQLServerPerf_000078.blg  Ntirety-SQLServerPerf_000079.blg  Ntirety-SQLServerPerf_000080.blg  Ntirety-SQLServerPerf_000081.blg Ntirety-SQLServerPerf_000082.blg Ntirety-SQLServerPerf_000083.blg Ntirety-SQLServerPerf_000084.blg Ntirety-SQLServerPerf_000085.blg Ntirety-SQLServerPerf_000086.blg Ntirety-SQLServerPerf_000087.blg Ntirety-SQLServerPerf_000088.blg Ntirety-SQLServerPerf_000089.blg Ntirety-SQLServerPerf_000090.blg Ntirety-SQLServerPerf_000091.blg Ntirety-SQLServerPerf_000092.blg Ntirety-SQLServerPerf_000093.blg Ntirety-SQLServerPerf_000094.blg Ntirety-SQLServerPerf_000095.blg Ntirety-SQLServerPerf_000096.blg Ntirety-SQLServerPerf_000097.blg  -f BIN -o Combined1.blg 



C:\PerfLogs\Blog>relog Ntirety-SQLServerPerf_000095.blg Ntirety-SQLServerPerf_000096.blg Ntirety-SQLServerPerf_000097.blg Ntirety-SQLServerPerf_000098.blg Ntirety-SQLServerPerf_000099.blg Ntirety-SQLServerPerf_000100.blg Ntirety-SQLServerPerf_000101.blg Ntirety-SQLServerPerf_000102.blg Ntirety-SQLServerPerf_000103.blg Ntirety-SQLServerPerf_000104.blg Ntirety-SQLServerPerf_000105.blg Ntirety-SQLServerPerf_000106.blg Ntirety-SQLServerPerf_000107.blg Ntirety-SQLServerPerf_000108.blg -f BIN -o Combined2.blg 


C:\Perflogs\Blog>relog Combined1.blg Combined2.blg -f BIN -o Combined.blg


I was then able to sift through the combined.blg file for the information covering the entire period easily.  After manual examination, I next loaded the blg file into the Performance Analysis of Logs (PAL) Tool for further analysis, but that is the subject for another blog post. :)

One item I haven't been able to run down yet is why the large output file (combined.blg) only has 11795 samples while the two input files have 12659 samples total.  This would be expected behavior/by design if I had applied some type of filter in the relog command, but in this case I did not.  You can see in the previous screenshots (from the initial relogs) that the input samples match the output samples exactly in those cases.  If I find an answer I will update the post with the information.

Hope this helps!

Wednesday, April 23, 2014

Where are my 32-bit Perfmon Counters?

I have run into this before, but not in this same way, so I decided it would be worth writing about.

On a client server they are running 32-bit SQL Server 2005 (I know, not supported - that's a different rant/post) on a 64-bit Windows 2008 installation.  When I logged onto the server for the first time to try to pull some basic data, I was greeted with one of the most-hated items of dealing with 32-bit SQL Server on a 64-bit operating system:

NO PERFMON COUNTERS!!!!!!

Public Service Announcement - Always *always* install your applications with the same architecture as your O/S - meaning always install only 64-bit applications on your 64-bit Windows Server!

Having seen this before, I went to the first item in my 32-bit toolbox - the WOW64 version of Perfmon located at C:\Windows\WOW64\perfmon.msc.  Many people don't seem to know this exists, but it allows you to access the 32-bit version of Perfmon required to access the counters for 32-bit SQL Server (or any other 32-bit application with performance counters) on a 64-bit Windows machine.

No Good.

I placed a shortcut for the 32-bit Perfmon on the desktop for future use (regardless of how I eventually resolved the issue, I knew I would definitely need to use the 32-bit Perfmon) and went to Management Studio to see if the counters were in sys.dm_os_performance_counters, which they were.  This showed my that the counters did exist - if they were completely missing or broken, they wouldn't be visible in the DMV either.

In the past when I had run into this issue, it had been on SQL Express and the counters didn't exist at all - not even in the DMV.  In those situations the answer was to load the counters via lodctr after copying them from another machine of the same version.  Since the counters existed I didn't believe that would be the answer (although eventually it was close - keep reading) so I turned to my Google-Fu to see where I should go next...

...and didn't find much.  A few forum postings and blog articles with different symptoms and different resolutions, but nothing AHA.  In my frustration I went through the steps to repair or reload the existing counters in case some of them were corrupted somehow.

First, I tried to just rebuild the existing counters via lodctr /R but that did not resolve the issue.

Next I went through the full steps to unload and reload the counters as described here by Alex Pinsker (blog/@alex_pinsker).  The catch is that near the end of the process Alex calls for a server reboot, and this was a production server.  I tried several things that I could do without impacting production, including:
  • Re-syncing the counters with WMI (WINMGMT.EXE /RESYNCPERF
  • Restarting the "Performance Logs and Alerts" service
  • Restarting "Windows Management Instrumentation" (WMI) service
  • Restarting the "Remote Registry" service (which shouldn't have mattered since I was trying to access the counters locally, but it isn't impactful to try)
...but it didn't work.  I gave up for the afternoon after coordinating a reboot for overnight with the client's server team.

A new day, and BINGO - I could see the SQL Server counters in the 32-bit Perfmon via the interactive ("real-time") Performance Monitor after the reboot. 

With counters firmly in hand, I went to the next step - trying to set up a data collector.  I went through the basic steps to set up the collector and started it, and then a short time later I stopped and re-started the collector to view the contents of the perf file.

ARGH- no SQL Server counters.  The Windows hives of counters (Processor, Logical Disk, etc.) were there, but no SQL Server counters.

I had initially opened the perf file by double-clicking it, so, just to make sure, I opened 32-bit Perfmon manually via my shortcut and changed its data source to my perf file just in case that was the issue, but no good.

More than a little Google-Fu later and I finally found something that described my situation pretty well, from a Microsoft CSS Engineer in Romania: http://blogs.technet.com/b/roplatforms/archive/2010/05/03/creating-a-custom-data-collector-set-with-sql-x86-counters-on-an-x64-os.aspx

The engineer, Bogdan Palos, was writing about a situation on Small Business Server (SBS) compared to Windows 2008 X64, but the basic scenario was similar - trying to create a data collector of 32-bit counters on a Windows 2008 X64 installation.

The suggested fix seemed more than a little bizarre:
...copy the 64bit version of “sqlctr90.dll” from another machine running a x64 SQL instance, to the “c:\windows\system32” directory of the Server 2008 x64 OS machine running SQL Server 2005 x86. We recommend using the DLL from similar SP levels of SQL Server
Apparently the files from the x86 SQL Server installation are not enough - you actually need a file from an X64 install?

But guess what - it worked!  After simply copying the sqlctr90.dll file from another SQL Server 2005 X64 of the same version number into the c:\Windows\system32 directory on this server and re-starting my Perfmon collector, it began collecting the SQL Server hives of counters.

Just one more little thing to file under "Seriously, Microsoft?"

What did I learn?
  • 32-bit is evil (already knew this, but it was definitely reinforced).
  • Microsoft will never cease to amaze me in how some things need to be hacked to work.
  • There is always something else to learn!

Hope this helps!