One DBA's Ongoing Search for Clarity in the Middle of Nowhere


*or*

Yet Another Andy Writing About SQL Server

Friday, January 8, 2016

Why is my SQL Server using all of the RAM on the server?

The TL;DR on this is simple:

"BECAUSE IT'S SUPPOSED TO!"

A frequent complaint we receive comes from a client that has an in-house sysadmin monitoring tool like Microsoft SCOM/SCCM.  They turn the tool on and it starts red-alerting because the Windows server hosting SQL Server is at 90%+ used RAM.  The sysadmin (or local DBA) logs on to the server and finds that there is 5GB free (~4%), and the sqlservr.exe process is using 120GB of the 128GB on the server!
http://cdn.meme.am/instances/55759451.jpg
Like almost all things SQL Server, this is a simplification of a situation that requires some serious #ItDepends.

  • What is the baseline - how much RAM is usually free on this server?  This can be seen by tracking the Perfmon counter Memory\Available MBytes.
  • What is the Buffer Manager\Page Life Expectancy number - not right this second, but over time?
  • Is there any actual performance issue on the server?
I have written before about the need for baselines, especially regarding memory and CPU use on the Windows server itself.  As a DBA, I don't care that my Windows server is at 95% RAM *if* both Windows and SQL Server aren't having performance problems *over time*.  

I started my IT career on a support/help desk supporting servers and desktops, and 16 years ago being at 5% free RAM probably meant you had 500MB-1GB free, which was definitely an OMG! (although then we probably didn't actually say OMG.)

Today's servers often have 96GB or more of RAM, and the larger the server the less relevant a percentage-based measurement is - just because your server has 256B of RAM doesn't mean Windows is suddenly going to need 25GB+ of RAM to run cleanly.


This brings us back to our three questions above.  Rather than worrying about some artificial guideline like 10%, how much memory is usually free on this server?  Track Memory\Available MBytes over time and see where it tops out - if the regular operation of your server always leaves 2-4GB free, you are probably OK from an Windows point of view.

What is the Page Life Expectancy of the SQL Server instance over time?  As many other have written about, PLE is one of the most misused statistics in SQL Server.  A DBA checks Perfmon or runs a DMV query (like this one from the awesome set curated by Glenn Berry (blog/@GlennAlanBerry)):
-- Page Life Expectancy (PLE) value for each NUMA node in current instance  (Query 36) (PLE by NUMA Node)
SELECT @@SERVERNAME AS [Server Name], [object_name], instance_name, cntr_value AS [Page Life Expectancy]
FROM sys.dm_os_performance_counters WITH (NOLOCK)
WHERE [object_name] LIKE N'%Buffer Node%' -- Handles named instances
AND counter_name = N'Page life expectancy' OPTION (RECOMPILE);
They find that the PLE is some relatively small number and freak out, or worse, they find it is a large number and think "my SQL Server doesn't need this much RAM, I can cut back."

danger-cat.jpg (500×421)
https://lost100pounds.files.wordpress.com/2011/05/danger-cat.jpg
Like Available MBytes, PLE is a number that needs to be measured over time.  On the servers I touch I always recommend setting up a "permanent" Perfmon collector of SQL Server-related counters so that you can look at the numbers over time. For example, a usual PLE chart over a week can look like this:


...or it can look like this:


Even over time, some people look at this second chart and say one of two things:

  • My PLE goes down to 14, what do I do?
  • My PLE maxes out at 50,000, what do I do?
Again, #ItDepends - what is going on in SQL Server during each of these times?  If you look at the above chart you can see that PLE generally climbs during the day (with some minor fluctuations), but then tanks over night pretty regularly. 

Look at when SQL Server dips to 14 - is that when reindexing or CheckDB runs?  If so, is performance good enough?  If CheckDB runs in two hours, and your window is three to four hour, isn't that acceptable?  Are other things running at the same time that are negatively impacted, or is that window clear?

If you do need to improve the processes running at that moment (when PLE dips), what is the Available MBytes at this same time and over time?  If it is high over time, it shows that there is head room on your server to consider increasing your Max Server Memory and to help your PLE stay higher.  If your Available MBytes is not consistently high, you can't consider increasing Max Server Memory unless you increase actual server RAM.

This is the 60-second version of quick memory troubleshooting, but think about it - isn't this (even just this small piece of it) a lot more complicated than "I have less than 10% free RAM."


Is anybody or any process complaining (besides SCCM)?  Are any numbers out of line - for example is a virus scan that normally takes 2 hours suddenly taking 4-5?  If nothing else is abnormal, is there a problem?

--

As I mentioned at the start of this post, quite often your SQL Server process(es) are simply *supposed* to be using the lion's share by far of the resources on the Windows server - especially RAM.  As long as you reserve sufficient head room for the operating system (usually 4GB-16GB as described here by Jonathan Kehayias (blog/@SQLPoolBoy) but modified by monitoring Avail MBytes) then the percentage of RAM used probably doesn't matter.

...and if you have good sysadmins they will listen to you and realize that SQL Servers (like Oracle, Exchange, and other database servers) are simply different from web servers or file servers.

Hope this helps!





No comments:

Post a Comment