Interpreting CPU-Related SQL Server Wait Stats
This is just as important in the analysis of SQL Server performance as it is in economic analysis, the study of medical research data, or financial markets. In this post we will look at how to correctly interpret CPU wait stats to see if they really do indicate a problem and if so, what action can be taken.
In the previous instalment on monitoring SQL Server wait stats we saw the output of sys.dm_os_wait_stats. We saw that signal_wait_time_ms represented the number of milliseconds tasks had to wait for a thread after whatever resources they had been waiting for became available. If a substantial amount of time is spent waiting for a thread, then we might easily imagine that the system CPU cores are not keeping up with demand and are a performance bottleneck. This is not necessarily the case.
Some authors have correctly suggested that the total of the signal wait times is less important than the fraction of the total wait time taken up by signal waits. This is only half the picture. If you aggregate signal wait times over all waits, you will see high ratios that do not represent any sort of performance bottleneck whatsoever.
The following query was taken from a blog discussing cpu-related waits.
SELECT CAST(100.0 * SUM(signal_wait_time_ms) / SUM(wait_time_ms)
AS NUMERIC(20,2)) AS signal_cpu_waits
FROM sys.dm_os_wait_stats
If we run this query we might see a result like this:
Overall Signal Waits Percent
According to the source, this would represent moderately high percentage. Let’s see what happens if we look at the individual signal waits.
SELECT *, CAST(100.0 * signal_wait_time_ms / NULLIF(wait_time_ms,0)
AS DECIMAL(5,2)) AS [signal wait percent]
Signal Wait Percent for Individual Waits
We see that the the first three rows include signal wait times that are not only large, they are exactly equal to the total wait times. This is behavior common for those waits that I have called the “nothing to do” waits. Since the wait times for nothing-to-do waits are often huge, their contribution to the calculated ratio seen in the above query is disproportionately large. This might cause someone to believe there is a CPU problem when in fact there is not. Perhaps worse, any actual CPU problem would make a less significant contribution to this inflated ratio and might go unnoticed.
Note that the fourth wait type on the list is SOS_SCHEDULER_YIELD.. We will consider this shortly.
Three More SQL Server Wait Types
The overall ratio between signal wait times and total wait times is only useful when the nothing-to-do waits have all been filtered out.
Three More SQL Server Wait Types
The overall ratio between signal wait times and total wait times is only useful when the nothing-to-do waits have all been filtered out.
SOS_SCHEDULER_YIELD
For our purposes here, the most important fact is that SQL Server’s thread schedulers always dole out a uniform timeslice of 4 milliseconds called a quantum. Therefore the threads of all but the shortest tasks will relinquish the CPU and, if the system is not busy, will almost immediately reacquire it. Therefore a thread yielding and another task being assigned to a thread is a common occurrence, and a large count of SOS_SCHEDULER_YIELD waits tells us nothing.
Since long-running queries will hit the 4 millisecond thread execution limit often, they contribute greatly to the observance of high SOS_SCHEDULER_YIELD wait counts. As long as everything is optimized appropriately there is nothing wrong with having long-running queries; some tasks simply require more time than others.
Note that in the query results above, the ratio of signal wait to total wait is very high, yet the average wait time, i.e. the total wait divided by the number of waits, is exceptionally low. CPU availability was not a problem here, illustrating another potential danger of relying solely on ratios.
SOS_SCHEDULER_YIELD waits are a good illustration of how, ultimately, many performance conditions cannot be diagnosed on the basis of single measurements from individual tools. Imagine that the Windows performance monitor is showing consistently high CPU activity, perhaps 80-90%. A problem? We cannot tell. There may simply be a lot to do and the CPU is humming away performing those tasks in an orderly and efficient fashion. However, the same CPU percentage might be seen in a situation where there were lots of threads ready to go but not getting a CPU slice in a timely manner. In a situation like this, it would be the combination of high CPU usage from the performance monitor ( or sys.dm_os_performance_counters if you prefer) and high signal waits from sys.dm_os_wait_stats that would be a primary clue.
THREADPOOL
A THREADPOOL wait should rarely be seen on a healthy system. A THREADPOOL wait occurs when there a no threads in the thread pool for assignment to a runnable task. This may occur if the configured maximum number of worker threads is tool small for the workload. Before adjusting the configuration value, however, you should examine whether this is a common condition or has only occurred during a rare period of exceptionally high usage. The maintenance of threads comes at a cost, and the thread maximum should not be adjusted for conditional that occur rarely.
CXPACKET
Conclusion
No comments:
Post a Comment