“Our Database Servers are Performing Very Slow”, looks a familiar statement for DBAs, isn’t it?
There are N number of possibilities and ways to find the bottleneck but our focus today would be on WAIT STATS AND QUEUES. Waits tell us where SQL Server is spending most of the time waiting and enables us to focus on most significant performance counters and Queues.
WHAT are QUEUES and WAITS?
QUEUES measure system resources and utilization and it is represented by Performance monitor objects and counters. A look at the highest wait types can give us the perspective to focus on the most relevant performance counters to find the problem area.
WAITS – SQL Server is designed to track the wait information when a query request waits on a resource (IO, Memory, Network, CPU etc.). When a query encounters something that is not available, SQL Server sets wait type for that wait reason. This data is retained in SQL Dynamic Views which we can use to understand why things are slower than expected.
It is important to understand SQL Server process life cycle to interpret wait types.
It’s something like waiting in queue till the person in front finishes, similarly SQL Server SQLOS uses a scheduler to manage execution queue and is mainly divided into 3 states:
Running: A query which is being executed on a CPU is called a running query until it has to wait for a resource, it then moves onto a Suspended state, a query which is waiting for the resources. The goal is to find why and what it is waiting for. Once the resources are available the process moves to a Runnable State,and wait for the CPU time.
WHERE TO FIND?
A combination of following 3 DMV’s can give us the required information.
Sys.dm_os_wait_stats –Primary DMV that aggregates wait stats information at Instance level for all sessions since last SQL Restart or since wait stats last cleared. High numbers in these can indicate potential bottlenecks or hot spots with in the Server Instance. Just click on this link for Query to capture wait stats.
sys.dm_os_waiting_tasks and sys.dm_exec_requests– A query combining these 2 provide us with detailed information at session id level on the wait types, wait time, query that is being executed, objects that are participating etc. Click Here for the query.
WHAT DOES ALL THESE MEAN?
First of all Waits always occur and not all wait types are areas of concern for example the ones related to background tasks. We will discuss some of the most frequent and important ones here. Remember each of these can be discussed individually in greater length but I am going to make it short and simple with some useful links.
SOS_SCHEDULER_YIELD – Occurs when a task voluntarily yields the scheduler for other tasks to execute. During this wait the task is waiting for its quantum to be renewed. More threads clearly states CPU is under pressure. A simple solution is to add more CPU which may not be feasible all the time. Best is to look for CPU intensive queries and Long Running Queries and fine tune them. Recompiles and related performance counters can be used here as they are CPU intensive. Click this to view detailed discussion.
PAGELATCH_X – Latches are short term lightweight synchronization objects to facilitate the access to In-Memory pages. Some of the memory pages can be in high demand and this mechanism ensures transaction integrity. TempDB and Indexes which are heavily used are the main cause of contention and can be looked upon in PerfMon counter: Average Latch Wait Time (ms), Latch Waits/sec and Total Latch Wait Time (ms)and DMV sys.dm_os_waiting_tasks. Click this to find more.
CXPACKET — indicates that multiple CPUs are working in parallel, dividing up the query in smaller pieces. Ordinarily a well-tuned OLTP application would not parallelize unless an index is missing or there is an incomplete WHERE clause. Check for MAXDOP (Max degree of parallelism) setting, Missing Indexes and the Query Plan to optimize the same. You can limit parallelism by setting MAXDOP to some number less than the total number of CPU’s or MAXDOP = 1. Click this for more.
PAGEIOLATCH_XX – Latency between the disks to memory is the major cause for these wait types. When a query request a page/data that is not in buffer cache, it has to wait for the data to be transferred from disk to buffer. Function Sys.dm_io_virtual_file_stats provides a breakdown of SQL Server reads, writes, and IO_Stalls for a particular database or transaction log file. Related PerfMon Counters are Physical disk: disk seconds/read,Physical disk: disk seconds/write, Buffer Manager: Page Life Expectancy, Buffer Hit Cache Ratio and Memory: Page Faults/sec. Before blaming the disks it’s always better to rule out that IO is not caused by Insufficient Memory. Click Here for detailed discussion.
ASYNC_IO_COMPLETION and IO_COMPLETION – Simply put, tasks are waiting for IO to finish and Disk subsystem is likely a bottleneck here. Counters and DMV discussed in PAGEIOLATCH can be used here as well to further confirm the issue. Placing SQL Log files on different disks, SAN storage lookup and looking at virtual file stats are some of the ways to unearth this issue.
WRITELOG – Usually occurs on a Heavy transactional database while data in the log cache is being flushed to the disk. Checkpoints and Transaction commits are common causes for this wait type. Moving around Database log file to an independent disk really helps here. Look for IO related performance counters to find IO bottlenecks.
LCK_M_XX – When any task is waiting to acquire lock on any resource this wait type occurs. A common reason for the wait is that the resource has already been locked by other tasks. High blocking can lead to time outs and degradation in application performance. Keeping transactions small, using isolation levels and data partitioning are some of possible resolves. Also Cross check if increased blocking is not due to IO or Memory. Sp_Who2 and DMV sys.dm_tran_locks are good places to look at for related information.
In the end
Wait types in correlation with performance counters are a strong tool for debugging performance issues and indicating the possible problem area. There are myriad of wait types and can be discussed in greater detail individually. Putting it all together in a single post is not possible.