Microsoft KB Archive/71052

INF: Understanding the Microsoft SQL Server Optimizer ID Number: Q71052

1.10 OS/2

Summary:

The following information is reprinted with permission from The Cobb Group’s “Microsoft Networking Journal” ?, January 1991. For a sample issue of the journal, call The Cobb Group at (800) 223-8720.

================================================================ Understanding the Microsoft SQL Server Optimizer ================================================================

One of the most important features of Microsoft’s new SQL Server 1.1 database engine is its ability to automatically optimize data manipulation queries. A data manipulation query is any type of query that supports the WHERE or HAVING key words in SQL (Structured Query Language), such as SELECT, DELETE, and UPDATE. The Microsoft SQL Server query optimizer finds the most cost efficient query execution plan for manipulating the data. The optimizer also must accomplish the optimization process itself within an acceptable length of time. It does this in three phases:

o Query Analysis - breaks the query down into different types of clauses.

o Index Selection - finds which indexes are useful for each of the different clauses, and determines whether they are cost effective enough to use.

o Join Selection - looks at all possible plans, estimates their associated costs, and saves the most cost effective plan for execution.

To have a clear understanding of how the optimization process for SQL Server actually works, we need to look at each of these phases in detail.

Query Analysis

Query analysis breaks down the initial query that has been input by looking for clauses within the query that it can optimize, such as the search clause

column_1 = 45

or the join clause

table1.column_2 > table2.column_4

SQL Server can optimize clauses that limit a scan. You can use several relational operators that produce perfectly valid SQL queries, but they will contain some non-optimizable clauses. In these cases, the optimizer uses indexes for the portions of the query that it can break into optimizable clauses, but SQL Server will execute the non-optimizable portions using table scans.

Operators that are difficult to optimize include NOT = and OR. Because NOT = is an exclusive rather than an inclusive relational operator, it cannot be used to limit a scan. The entire table must be scanned to determine which tuples qualify. The query analyzer cannot optimize any clause that includes NOT =. If you use OR to join conditions for columns in different tables, the optimizer cannot logically use it to limit a scan and therefore cannot accomplish optimization.

Index Selection

If input queries contain any optimizable SQL syntax, the query will be broken into its component clauses, and the optimizer will then perform index selection for the optimizable clauses. Before SQL Server will use a table index with the associated clauses, it must pass three tests during index selection.

Is the index useful?

First, the index must be useful. This implies that there is a description of the index in the database System tables, and that the columns in the clause match a prefix of the columns in the index itself. In order to better understand this, let’s use the example of a phone book. Let’s assume you used SQL syntax to create an index on a phone book table called phone_book that contained, among other columns, last_name and first_name for every table entry:

create index name_index on phone_book (last_name, first_name)

This index would be useful for performing a selection that covered the entire index, i.e., a selection including both last_name and first_name. It also would be useful for a selection that matches a prefix of the index, i.e., a selection including just last_name, such as looking up all of the people with the last name of “Smith”. It would not, however, be useful for a selection that did not match a prefix of the index, such as a selection on first_name. In other words, this index would not help you find all the people in the phone book with the first name of “George”. You would still need to start at the beginning of the book and read each entry to determine if it met the qualifications. The obvious point here is that by analyzing the types of queries that are anticipated on a particular table and providing useful indexes, a developer can help SQL Server improve performance and efficiently optimize more ad hoc queries.

How selective is the clause?

If the optimizer finds a useful index for a query clause, index selection can satisfy the second test: determining the selectivity of the clause. Selectivity is an estimate of the percentage of the tuples in a table that will be returned for the clause. This is one of the most useful, and often underused features of SQL Server. Each index has an associated distribution page that stores index statistics. These statistics can provide a good estimate of the selectivity of the clause, but only if the information contained on the index distribution page is accurate and up-to-date for the table.

If a developer establishes an index when creating a table – before any data has been entered – there will be no associated distribution page created for the index. Simply entering data into the table will not cause SQL Server to create statistical data. Because no statistical information is available, the optimizer will use standardized selectivity values based on the type of relational operator used in the clause. While these estimated selectivity values will help the optimizer decide whether or not to use the associated indexes for a clause, they are average values that may not reflect the actual distribution of the table. In order to provide SQL Server with accurate statistics that reflect the actual tuple distribution of a populated table, the developer or system administrator needs to update the statistical information for the table by executing the SQL Server UPDATE STATISTICS command. For our phone_book example, this would be accomplished by executing either

update statistics phone_book name_index

to update the statistics for the single index name_index, or

update statistics phone_book

to update the statistical information for every index associated with the table phone_book. Once the developer or administrator establishes statistics for a particular table, the optimizer will use them to determine selectivity for clauses.

For dynamic tables where data distribution is changed through insertions, deletions, and updates, the statistical information may become invalid over time. The system administrator is responsible for understanding the uses of the various objects in the database and executing the UPDATE STATISTICS command as appropriate to keep the statistical information current for the table. The frequency and type of usage incurred by each object will determine the frequency with which this needs to be done.

How much does it cost ?

Index selection will use the estimated selectivity of each clause and the type of index that is available to calculate the cost in terms of page accesses for the clause. This is the final test in terms of index selection. If the estimated cost of using an available index for the clause is less than a straight table scan of the data, the optimizer will use the index. This is another area where analysis of the anticipated queries on a table can allow the developer to assist the SQL Server optimizer.

A standard non-clustered index, at the leaf level, contains a series of pointers to the associated data pages in the table that stores the rows. Because there is no relationship between a non-clustered index and the order in which the data is stored in the table, the cost of using a non-clustered index includes an additional page access for each qualifying row. As a result of this associated page access overhead, when the selectivity of a clause reaches 10 percent to 15 percent of the table size of a smaller table, it is less expensive for the optimizer to execute a table scan than to use the index to access the data. There are, however, several index characteristics a developer should consider that can improve the performance of indexed data retrieval and potentially improve throughput for data manipulation queries.

If a non-clustered index covers a query (i.e., both the selection and conditional column information are part of the created index), there will not be an additional page access cost associated with the use of the index. In our phone_book example above, the name_index index would cover the query

select last_name from phone_book where first_name = “George”

and therefore would perform more efficiently for this query and improve the data retrieval performance.

Each table may have one clustered index. Because there is a one-to-one correspondence between a clustered index and the order in which data is stored in the table (in fact, the leaf level of a clustered index maps directly to the stored data pages), there is no additional page access cost associated with this index for each qualifying row. For tables where a clustered index is appropriate, this can again improve data manipulation query performance. A clustered index forces the table to store tuples in sorted order, so there is both sorting and page splitting overhead associated with inserts, deletes, and updates that may make this type of index inappropriate for either extremely dynamic or extremely large tables that are subject to modification.

Expansion overhead for a clustered index can be reduced using SQL Server’s FILLFACTOR feature. Using FILLFACTOR reserves a percentage of free space on each data page of the index to accommodate future expansion. This is a tuning issue that the developer and system administrator will need to address when considering the use of clustered indexes.

Finally, if a unique index is created for the table so that no duplicate index values can be inserted, there is a special case that reduces processing costs. If all of the keys in the WHERE clause use the equality relational operator and are contained in the unique index, the optimizer knows that this query can match only one row in the table. In this case there is a minimal fixed cost involved in using the unique index to process the query.

Once all of the clauses have been processed for index selection and have an associated processing time, the optimizer is ready to perform join selection. The optimizer compares all combinations of the clauses, then selects the join plan with the lowest estimated processing costs. Because the combinations of the clauses can grow exponentially as the complexity of a query increases, SQL Server uses tree pruning techniques to minimize the overhead associated with these comparisons.

Using Tuning Tools

SQL Server includes several tuning tools that allow users to view this internal optimization process. After executing the command

set showplan on

All queries sent to SQL Server will return not only the expected results, but also a description of the query execution plan the optimizer used to produce these results. For each step of the execution plan, this description shows whether the optimizer decided to use an existing index or a table scan to retrieve the results. After SQL Server executes the command

set statistics time on

it displays the time taken to parse and compile each statement, along with the time that it took to execute each step in the query execution plan. These two commands used together can help users modify their queries and index structures to further improve the performance of the SQL Server database engine.

As you can see from this discussion of SQL Server optimization, there are no hard and fast rules – such as “when the number of rows exceeds 100” – that you can use to determine when SQL Server will take advantage of indexes to assist in query processing. However, by understanding the underlying processes of the optimizer and analyzing the types of queries being performed on the tables, you can create specific indexes for your data tables that will help your SQL Server optimizer improve system throughput.