Microsoft KB Archive/285996

= Cost of using non-clustered indexes may be incorrect if data correlation exists =

Article ID: 285996

Article Last Modified on 9/5/2007

-

APPLIES TO


 * Microsoft SQL Server 7.0 Standard Edition
 * Microsoft SQL Server 2000 Standard Edition

-



This article was previously published under Q285996



SYMPTOMS
In scenarios where there is a data correlation between the order of the non-clustered index entries and the rows in the base table, the optimizer may choose a plan that performs a table scan or clustered index scan rather than using a qualifying non-clustered index to filter the data. Data Correlation means that there is a relationship between the key values of the indexes such that they provide close to the same ordering of the data instead of a random distribution of data.



CAUSE
The SQL Server optimizer is built on the assumption that there is no data correlation between indexes and hence assumes that the cost in physical IO of using a non-clustered index to retrieve the data row to be 1 physical IO operation for every qualifying result row.



For a table scan, the SQL Server optimizer computes the physical IO operations to be equal to the number of pages in the table regardless of the number of estimated rows that meet the filter conditions; however, this is adjusted by the capability of Readahead, which allows a single physical IO to retrieve many pages (up to 8) as well as using sequential IOs instead of random IOs. Based on this it is easy to see that if there are more rows in the table that meet the filter conditions matching a non-clustered index than compared to (pages read/readahead read) * (pages of the table), it is cheaper and ultimately faster to use a table scan.



WORKAROUND
You can use the following suggestions to help the SQL Server optimizer choose the best plan when a data correlation exists.

Note The items are listed from least impact to client applications to the most impact.
 * Reorder index key columns so that the key column with the highest cardinality is the first column of the index.
 * Create indexes to cover the queries. Note that on SQL Server 7.0 and SQL Server 2000 that multiple single or multi-column indexes can be combined to make a covering index.
 * Add index hints to the queries.



MORE INFORMATION
There are two types of data correlation problems that have been discovered, which cause the SQL Server optimizer to over estimate the physical IO used by a non-clustered index:
 * A single table where the non-clustered index key is correlated to the clustered index key.
 * A single table that is a heap and the data entered in the non-clustered index key columns is ever increasing or decreasing values and there are very few updates or delete operations on the table.

The following examples help to illustrate this situation and are all based on the following table script: use pubs go create table x ( xid int identity(1,1), xrand int, xdate datetime, xdata char(100)) go --insert 100k rows set nocount on go begin transaction declare @x int select @x = 0 while @x < 100000 begin insert x (xrand,xdate,xdata) values (rand * 100000,getdate,'a') if @x % 10000 = 9999 begin commit transaction begin transaction end select @x = @x + 1 end commit transaction go set nocount off go It is important to note this information about the examples that follow:  The database needs to be set offline after the indexes are created in each test to make sure that the pages of the table are not in the SQL Server cache. On each query there is an index hint that is commented. Run the scenario once with the comment to see the optimizer's choice and then uncomment the index hint to view the other behavior. The SET STATISTICS PROFILE ON statement causes the query to generate a second result set that contains the query plan and costing information. Because the data used is random in some columns, there is slightly different results each time new data is generated. For the best results, use the SQL Server Query Analyzer tool and set the results to Grid format instead of Text.  To get a more accurate gauge of the actual performance uncomment the following statements that wrap around the query as follows: --declare @x datetime --select @x = getdate --select datediff(ms,@x,getdate) This causes the query execution time including compilation in milliseconds to display.  All query cost values in the examples that follow are approximate and are rounded to 2 decimal places.</li></ol>

<div class="moreinformation_section">

The first example shows that the optimizer is correct in choosing the table scan over using the non-clustered index to locate the row when no data correlation exists. This is the basis for comparison with future examples. This operation is referred to as a Bookmark Lookup operator in SQL Server 7.0 and SQL Server 2000. In this example, on SQL Server 2000, the cost of the table scan was 1.22 and executed in 2.5 seconds and the cost of using the non-clustered index was 5 and executed in 8.3 seconds.

<div class="moreinformation_section">

drop index x.xrand drop index x.xid drop index x.xdate go create clustered index xid on x(xid) go create index xrand on x(xrand) go use master go exec sp_dboption 'pubs', offline, true exec sp_dboption 'pubs', offline, false go use pubs go set statistics profile on go --declare @x datetime --select @x = getdate select * from x --(index=2) where xrand between 1000 and 2000 --select datediff(ms,@x,getdate) go set statistics profile off go The second example shows the optimizer choosing a table scan instead of using the non-clustered index to locate the row when a data correlation exists between the clustered index and the non-clustered index. In this example on SQL Server 2000, the cost of the table scan was 1.33 and the cost of using the non-clustered index was 5.25. Based on this the optimizer did choose the cheapest estimated plan. However, the table scan took 2.3 seconds to execute and the non-clustered index took 0.2 seconds to execute. So, with the data correlation, using the non-clustered index is faster than performing a table scan.

Despite the higher cost of the non-clustered index seek, the non-clustered index seek is faster because the bookmark lookup benefits from the data being arranged sequentially. Typically, a bookmark lookup is an expensive operator because each read key from the non-clustered index causes a random I/O against the base table. If the base table is large and there are many bookmark lookups, you may experience situations where the same page incurs multiple physical I/Os because the page is aged from cache between accesses. When the non-clustered index keys are correlated with the base table, the bookmark lookup behaves more like sequential I/O because each page from the base table never requires more than one physical I/O. And any secondary accesses for additional lookups will occur immediately after the first lookup. drop index x.xrand drop index x.xid drop index x.xdate go create clustered index xdate on x(xdate) go create unique index xid on x(xid) go use master go exec sp_dboption 'pubs', offline, true exec sp_dboption 'pubs', offline, false go use pubs go set statistics profile on go --declare @x datetime --select @x = getdate select * from x --(index=2) where xid between 1000 and 2000 --select datediff(ms,@x,getdate) go set statistics profile off go The third example shows the optimizer choosing a table scan instead of using the non-clustered index to locate the row when a data correlation exists between data entry order or row order left over from a dropped clustered index and the non-clustered index. In this example, on SQL Server 2000, the cost of the table scan was 1.38 and the cost of using the non-clustered index was 5.25. Based on this the optimizer did choose the cheapest estimated plan. However, the table scan took 1 second to execute and the non-clustered index took 0.14 seconds to execute. So, once again with the data correlation, using the non-clustered index is faster than performing a table scan. drop index x.xrand drop index x.xid drop index x.xdate go create index xid on x(xid) go use master go exec sp_dboption 'pubs', offline, true exec sp_dboption 'pubs', offline, false go use pubs go set statistics profile on go --declare @x datetime --select @x = getdate select * from x --(index=2) where xid between 1000 and 2000 --select datediff(ms,@x,getdate) go set statistics profile off go In order to detect data correlations you can check the following things:
 * The index is on values that are always increasing or decreasing in the table itself (either the clustered index or the heap). You can also detect this by running a query with an order by clause that matches one of the index key columns and then visually examine the other index key columns in the results to see if they are also in order.
 * If hinting the non-clustered indexes with no table or index pages in cache is faster than the same query running with a table scan.
 * For composite indexes, the first few index columns are the same in both indexes.
 * If there is some known relationship where the key of one index affects the domain of another index.

<div class="references_section">