Monday, 26 December 2016

SQL Database Performance Challenges

There’s a good chance you’ve come across various performance challenges while writing queries and working within SQL if you use it on a regular basis. You might be dealing with slow-moving or poorly-written queries, inefficient subqueries, indexes with tons of writes to them but little or no reads, or different CPU and memory issues. This article will explain how to find and address multiple SQL performance challenges.

Definition Bank


Let’s look at some definitions before we begin:

Wait States or Queue States: A period of waiting that comes after executing queries or loading resources related to particular tasks. While SQL is executing one or more queries or pulling resources, a certain amount of time must be spent both scanning the storage and data and performing the calculation or task at hand. Two common wait types are latch wait types, such as PAGEIOLATCH_EX, which refers to a wait that happens when a task is waiting on a latch for an I/O request type of buffer, and CXPACKET wait types, a typical problem related to high server CPU usage due to poorly-written parallel queries (queries designed to run concurrently). A third common wait type is the WRITELOG wait, which is related to the SQL session writing the contents of the cache of a log to the disk where the log is stored.

Locking: In SQL, there are lock resources and lock modes. Lock resources refer to the places where SQL can place locks, and lock modes refer to the locks that can be placed on resources so they can be accessed by concurrent tasks and transactions. There are several resources where locks can be placed, such as a row in a table or a lock on each row within an index. There are also several lock mode types, such as shared locks and exclusive locks. Some locks are completely fine, but others can be detrimental to performance.

Disk and Network I/O: SQL data and transactions funneling in and out of the disk, cache or through the network. The more there are, the worse the performance can be. However, fine-tuning your queries and indexing can significantly reduce the input and output on the physical and logical disks and network.

Contention: Typically a term related to contention in locking. Locking in SQL helps to ensure consistency when performing read or write tasks in the database, but contention when locking can happen. Contention can occur, for example, when processes are trying to perform updates concurrently on the same page.

High CPU Usage: High server CPU usage as it relates to SQL is directly connected to the SQL processes being run, poor query execution, system tasks and excessive compilation and recompilation of queries. The CPU can also be strained if there are bad indexes in place.

1. Poorly-Written SQL

Because SQL is declarative, you can write the same query in many ways to receive the same results. These differences in how queries are written can affect performance negatively. There are two ways you can rewrite queries to improve performance:

Rewrite SQL to minimize query footprint and make them run faster.

Rewrite SQL to remove the need for subqueries.

Rewriting Queries

Let’s look at an example of rewriting a query to improve performance. Say you want to find all overlap for a given range. You can do this in a few different ways, but let’s consider these two:

SELECT *

FROM range

WHERE end_time >= @start

AND start_time <= @end

However, this query would obtain the same results, only faster:

SELECT *

FROM range

WHERE (start_time > @start AND start_time <= @end)

OR (@start BETWEEN start_time AND end_time)

The latter is faster because it is more specific in the exact ranges that are to be found.

Removing Subqueries

A subquery is a query that needs another query to run correctly (or at all). However, there are inherent performance problems with subqueries; they can be slower and result in returns of NULL values. However, subqueries can be rewritten as Joins to avoid such problems. Below is an example of a subquery rewritten as a Join:

SELECT * FROM employeeTable WHERE id NOT IN (SELECT id FROM employeeTable2);

SELECT * FROM employeeTable WHERE NOT EXISTS (SELECT id FROM employeeTable2 WHERE employeeTable.id=employeeTable2.id);

Can be turned into:

SELECT employeeTable.*

FROM employeeTable LEFT JOIN employeeTable2 ON employeeTable.id=employeeTable2.id

WHERE employeeTable2.id IS NULL;

The latter achieves the same results without the use of a subquery, improving performance and avoiding any other issues.

2. Bad Indexes

If the number of write functions to a table and its indexes is exceedingly greater than the number of reads to it, there is a good chance the underlying indexes are harming overall performance. Every time a write is performed to an SQL column that has an index, a corresponding modification must also be activated to the column indexes. If a lot of the activity is write activity, it might be worth it to consider removing or altering the indexes involved. Doing so would likely increase performance by reducing the overall output of write activity. Find bad indexes by using Dynamic Management Views to analyze query execution statistics. After finding the indexes that have many writes but zero or few reads, consider dropping those indexes to improve performance.

3. Locking Contention

Locking contention can occur when there are processes that are trying to perform lock updates concurrently on the same lock resources. Furthermore, when two or more tasks are blocking one other because each task has a lock on a resource in which the other tasks are attempting to place a lock, a deadlock can occur because neither can resolve.

For server purposes, this can be explained as multiple requests from multiple users. This increases the likelihood to create conflicts over time because numerous processes can request access to same database resources concurrently.

Here are some tips for detecting contention and deadlocks:

Use a database information tool for detection. According to Microsoft, “…the Database Engine provides monitoring tools … trace flags, and the deadlock graph event in SQL Server Profiler.”

Ensure that all database resources are accessed in the same order every time.

Make updates to SQL before beginning a task or transaction.

Don’t allow the data to be used during tasks and transactions.

Limit or avoid the use of cursors when coding.

Keep your transactions small.

4. Memory Problems

As a general rule, SQL servers are memory hogs, even if there is a good deal of memory installed in the server. To improve performance and reduce memory usage, SQL Server automatically stores data in memory caches. Once data is read from the drive, the data is not going to change unless SQL needs to update the data. If your SQL server has enough memory in it to cache the entire database, SQL Server will cache the whole thing. SQL is programmed to make up for things like redundant code, NULL values, too many writes and not enough reads in queries, which all lead to more memory being used and a lull in performance.

Here is what you can do to check and manage memory usage:

Open SQL Server Management Studio and connect to the correct server.

Choose Windows Authentication from the Authentication menu.

Choose View and then Object Explorer, then right-click on the name of the server and choose Properties. Click Memory on the left, and you will see different options, including “Use AWE to allocate memory,” “Minimum Server Memory” and “Maximum Server Memory.”

Check AWE to Allocate Memory if using a 32-bit server with fewer than 4GB of memory, but leave it unchecked if using more memory or a 64-bit server. Check that the Minimum Server Memory is set to zero. Don’t change that number–if it has been changed, do some research to figure out why, because it is not recommended to change it or set the minimum and maximum to the same amount. If you need to reduce the Maximum Server Memory, reduce the minimum as well. If you need to increase the maximum, keep the value of the minimum the same.

The last option, Maximum Server Memory, will have a default value. If you are trying to reduce the memory that SQL Server can use, you can set it to anything you want. Keep in mind that if you set it lower, performance will dip lower as well. Sometimes administrators will set this to a lower number, say 6GB, to free up memory for the OS to operate smoothly. You may want to test performance based on changing minimums and maximums to see what works best for your particular setup.

5. High CPU Usage

High CPU usage can occur for a number of reasons, but the following commonly cause it:

Poorly-written queries

High compilations and recompilations

High usage of the temporary table

System threads causing high spikes in CPU

Because we’ve already gone over poorly-written queries and touched upon high compilations and recompilations, let’s look at the other possible issues:

High usage of the temporary table: As your queries become more complex, temporary tables tend to be used more. Try to eliminate the usage of temporary tables to boost CPU performance.

System threads: Check this by writing select * from sys.sysprocesses where spid<51 and check for any system thread issues.

Resource Monitor can also use high amounts of CPU power when virtual memory runs low. Specific troubleshooting information can be found on Microsoft’s support website

Monitoring SQL Server Wait Stats: Interpreting CPU Waits



Interpreting CPU-Related SQL Server Wait Stats

A stockbroker once told me he had made a poor decision because he had been mislead by the numbers. A curious choice of words, indeed; it was as if he were an innocent bystander who had been deceived by the nefarious digits on some spreadsheet. In any sort of analysis, the responsibility lies firmly upon the analyst for the correct interpretation of the data represented by the numbers.

This is just as important in the analysis of SQL Server performance as it is in economic analysis, the study of medical research data, or financial markets. In this post we will look at how to correctly interpret CPU wait stats to see if they really do indicate a problem and if so, what action can be taken.

In the previous instalment on monitoring SQL Server wait stats we saw the output of sys.dm_os_wait_stats. We saw that signal_wait_time_ms represented the number of milliseconds tasks had to wait for a thread after whatever resources they had been waiting for became available. If a substantial amount of time is spent waiting for a thread, then we might easily imagine that the system CPU cores are not keeping up with demand and are a performance bottleneck. This is not necessarily the case.

Some authors have correctly suggested that the total of the signal wait times is less important than the fraction of the total wait time taken up by signal waits. This is only half the picture. If you aggregate signal wait times over all waits, you will see high ratios that do not represent any sort of performance bottleneck whatsoever.

The following query was taken from a blog discussing cpu-related waits.

SELECT CAST(100.0 * SUM(signal_wait_time_ms) / SUM(wait_time_ms)

AS NUMERIC(20,2)) AS signal_cpu_waits

FROM sys.dm_os_wait_stats

If we run this query we might see a result like this:


Overall Signal Waits Percent

According to the source, this would represent moderately high percentage. Let’s see what happens if we look at the individual signal waits.

SELECT *, CAST(100.0 * signal_wait_time_ms / NULLIF(wait_time_ms,0)

AS DECIMAL(5,2)) AS [signal wait percent]




Signal Wait Percent for Individual Waits

We see that the the first three rows include signal wait times that are not only large, they are exactly equal to the total wait times. This is behavior common for those waits that I have called the “nothing to do” waits. Since the wait times for nothing-to-do waits are often huge, their contribution to the calculated ratio seen in the above query is disproportionately large. This might cause someone to believe there is a CPU problem when in fact there is not. Perhaps worse, any actual CPU problem would make a less significant contribution to this inflated ratio and might go unnoticed.

Note that the fourth wait type on the list is SOS_SCHEDULER_YIELD.. We will consider this shortly.

Three More SQL Server Wait Types

The overall ratio between signal wait times and total wait times is only useful when the nothing-to-do waits have all been filtered out. 


SOS_SCHEDULER_YIELD

We don’t have time or space right now to discuss the details of user mode scheduling (UMS) of thread execution; too bad, because it is a fascinating topic. UMS is based on the concept that applications know better than the operating system when and how to shift execution to another thread, and can therefore reduce the costly kernel context switches required by the operating system. UMS did this by manipulating the priority of the waiting threads, so that the operating system would always give a time slice to the thread SQL Server wanted it to. UMS was introduced in SQL Server 7, and for years SQL Server was pretty much the only program to use it. However, the Windows folks at Microsoft took notice, thought it was a great idea, and with the introduction of Windows 7 and Windows 2008 Server R2 the operating system itself included a new set of API functions to provide UMS to any program that wishes it.

For our purposes here, the most important fact is that SQL Server’s thread schedulers always dole out a uniform timeslice of 4 milliseconds called a quantum. Therefore the threads of all but the shortest tasks will relinquish the CPU and, if the system is not busy, will almost immediately reacquire it. Therefore a thread yielding and another task being assigned to a thread is a common occurrence, and a large count of SOS_SCHEDULER_YIELD waits tells us nothing.

Since long-running queries will hit the 4 millisecond thread execution limit often, they contribute greatly to the observance of high SOS_SCHEDULER_YIELD wait counts. As long as everything is optimized appropriately there is nothing wrong with having long-running queries; some tasks simply require more time than others.

Note that in the query results above, the ratio of signal wait to total wait is very high, yet the average wait time, i.e. the total wait divided by the number of waits, is exceptionally low. CPU availability was not a problem here, illustrating another potential danger of relying solely on ratios.

SOS_SCHEDULER_YIELD waits are a good illustration of how, ultimately, many performance conditions cannot be diagnosed on the basis of single measurements from individual tools. Imagine that the Windows performance monitor is showing consistently high CPU activity, perhaps 80-90%. A problem? We cannot tell. There may simply be a lot to do and the CPU is humming away performing those tasks in an orderly and efficient fashion. However, the same CPU percentage might be seen in a situation where there were lots of threads ready to go but not getting a CPU slice in a timely manner. In a situation like this, it would be the combination of high CPU usage from the performance monitor ( or sys.dm_os_performance_counters if you prefer) and high signal waits from sys.dm_os_wait_stats that would be a primary clue.


THREADPOOL

A THREADPOOL wait should rarely be seen on a healthy system. A THREADPOOL wait occurs when there a no threads in the thread pool for assignment to a runnable task. This may occur if the configured maximum number of worker threads is tool small for the workload. Before adjusting the configuration value, however, you should examine whether this is a common condition or has only occurred during a rare period of exceptionally high usage. The maintenance of threads comes at a cost, and the thread maximum should not be adjusted for conditional that occur rarely.

CXPACKET

CXPACKET waits can occur when SQL Server is trying to synchronize multiple threads for a query with a parallel execution plan. The response to CXPACKET waits depends on the query itself. Might the query benefit from additional indexes? Are there any flaws in the query that might cause problems, such as inappropriate data types? If the query seems OK and the indexes seem suitable, it may be possible to reduce CXPACKET waits by adjusting the maximum degree of parallelism.

Conclusion

Wait stats relating to CPU usage rarely provide a direct indicator of a single problem. At best, CPU wait stats can call attention to the possibility of problems, but only further investigation of the queries involved and the server workload can determine whether problems actually exist and, if so, what action is to be taken.