Indexing for Dummies

From time to time these days I have to put my DBA hat back on and review database schema designed by those whose focus is on application development. What I often find is that attempts to improve performance through indexing have been done in vain as the query optimizer either doesn’t use them enough to merit their creation or they have been created on the wrong columns.

Frequently, optimisation is referred to as a dark art, but here I am going to attempt to give some basic pointers to help enighten the novice database designer…

Amazon Link: High Performance SQL Server DBA: Tuning & Optimization Secrets: Tuning and Optimization Secrets (IT In-Focus)

Clustered Indexes:

Clustered indexes are structured around the data pages of the table they are built on (or more accurately, the data pages are organised according to the sort order of the index). Moreover, the tables data pages are actually the leaf level nodes of the index itself, with the root and intermediate nodes containing pointers to either the index node below or the data pages of the underlying table. For this reason, when you build a clustered index, the data pages are organised into the order of the clustered index column(s). With this in mind here a several points about their use:

  1. If a table is going to receive frequent inserts, it is worth considering as candidates, columns that increment in the same order as the inserts take place. This means that data pages can be appended, rather than inserting in the middle, thereby reducing the page splits and fragmentation that can occur. Often this will be an identity column, but not always. There may be a business key that fits this requirement and is also part of frequent range scanning queries. An example of this might be the date key in a calendar dimension.
  2. Don’t create them on string columns whose size is potentially large. This includes, GUIDs, (N)VARCHARS and (N)CHAR datatypes. Clearly the larger the index, the bigger the maintenance overhead.
  3. Although by default, clustered indexes are created on the primary key, sometimes you are better off dropping this index and instead creating it on other columns. An example of this would be where an application often queries data in a certain sort order or from a certain range. In this instance choosing the sort or the range key may be beneficial.
  4. There is an overhead in maintaining any index. If the column you choose is updated regularly, then there will be an overhead while the index data pages (and therefore the table) are reordered and the dependent non clustered index’s row locators are updated for the new position of the data.
  5. Consider creating a clustered index on a view, where aggregations and joins are part of the definition. This can be an excellent tool for query intensive OLAP operations as the resultset is stored on the index. Remember though that maintaining this type of index is expensive and so wouldn’t be suitable where the underlying data frequently changes.

Non Clustered Indexes:

Non clustered indexes are structured in B-Trees in the same way as a clustered index. However, data rows are not sorted according to the index, and leaf level data is held in index pages and not data pages like a clustered index. Finally the leaf level holds row locators which will either be a clustered index key or in the case of heaped table, a pointer to a row.

  1. Foreign keys should all have an index to improve join performance to the table they depend on. (This is especially true in a STAR schema, where queries will often aggregate measures in the fact, by attributes in a dimension).
  2. Indexes can be considered, where a column has a good (say around 20% or higher) selectivity ratio. Thats to say the number of distinct values is high compared to the total number of rows in the table. Using a 1m row table as an example, two possible columns that demonstrate opposite ends of the selectivity spectrum would be:-
    • Identity Columns – Highly selective, there are as many distinct values as there are rows in the table (1m distinct values / 1 m rows in table * 100 = 100%).
    • BIT columns – Unselective, there are only 2 (may be 3 including NULL) possible values. (3 distinct values / 1 m rows * 100 = 0.0003%)
  3. Columns that regularly feature in JOIN and GROUP BY clauses.
  4. Remember that query performance will improve where indexes are created on these columns in the same order as they appear in the GROUP BY clause. An anology you could use for this might be a world atlas with an index at the back showing what the grid references are ordered by country then city. If you were asked to find all grid references for cities that start “LON”, then you’d have to check city listings for each country. However, if the index was by city, then country this task would be much easier.
  5. In bulk modifying operations, performance might be improved by dropping the index, and then rebuilding them after the batch is complete.
  6. Consider specifying a lower fill factor. Doing this means that when the index is built, more space will be left in the index leaf pages, that can be filled by subsequent inserts and updates reducing the number of page splits. (This causes its own space and maintainance issues which will need addressing, but can improve the immediate performance of the queries that modify the data).

General Points:

  1. Database maintenance mustn’t be overlooked. Rebuild your indexes and update your statistics on a regular basis to ensure that index fragmentation is kept under control and that the optimiser is using the best possible query plan.
  2. Remember that for indexes to be effective, they need to be used! If they aren’t being used, then either your statistics aren’t up to date; or the optimiser simply doesn’t see a benefit. Remove the index or refactor the query so that it does use it. Index usage can be monitored on your production environment with a trace, and the dynamic system view sys.dm_db_index_usage_stats in SQL 2005 will also contain relevant statistics.
  3. Even following the above rules, doesn’t preclude testing your work. SET STATISTICS IO/TIME ON before running your queries in development, so that you can see the work performed by different permutations of your index/query. Also, display the query plans to find out what operations the optimiser has performed to retrieve your results.

These pointers will give you a good start when designing the indexing strategy for your database. However, there is plenty of indepth information on MSDN; books online; and a whole host of database forums and blogs.

After upgrading your hardware, I believe that good indexing strategy is the source of the greatest gain in database performance, so don’t overlook it 😉

Cheers

Frank

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.