Passion led us here

How to use Indexing for SQL Query Optimization

Time to up your querying game!

Ashish Tomar
Towards Data Science
8 min readDec 1, 2020

--

Introduction

I assume that you like SQL and want to refine your querying skills to up your querying game. And you have probably heard that indexing is great for query optimization, but you are not sure about what exactly it is, why is it used, and how to use it.

Welcome! You are at the exact place where you should be. I’ll explain it in a very easy-to-understand manner, and I promise you’ll enjoy learning about it.

Photo by Christian Wiediger on Unsplash

Suppose you are in the e-Commerce Analytics team at Amazon. The data that you’re dealing with is huge. It has millions of rows. I’ll use the following hypothetical table called ‘product’ containing 12 million products for all the demonstrations. (Fun Fact: Amazon sells more than 12 million products, excluding books, media, wine, and services.)

Fig.1 Table ‘product’, with 12 million rows (Image by Author)
Fig.2 4 sample rows of table ‘product’ (Image by Author)

Let’s begin with a simple query.

SELECT COUNT(*)
FROM product
WHERE category = ‘electronics’;

Now to run this query, the database has to scan all 12 million rows to check each record for a match. Let’s say this query takes 4 seconds to execute.

Can you make it faster? Yes, you can.

How? By indexing.

Indexing

What is Indexing?

Image by Michal Jarmoluk from Pixabay

Let me explain the whole concept of indexing intuitively. It’s named ‘indexing’ because of how an index works in a book. If you’re reading a book on statistics and you want to read about ‘linear regression’, you’ll not want to flip hundreds of pages one by one, to get to the chapter which talks about ‘linear regression’.

Instead, you’ll open the index page, look for ‘linear regression’ and go to the page directly.

This is the technique that databases use via indexing. When you create an index, the database somehow rapidly finds the data that the query wants. I’ll talk about this ‘somehow’ later in the article.

How to create indexes

Let’s create an index on the ‘product’ table and include ‘category’ in the index.

Syntax:
CREATE INDEX [index_name]
ON [table_name] ([column_name]);
Query:
CREATE INDEX product_category_index
ON product (category);

When you execute this query, it will take much longer than a normal query. The database scans 12 million rows and builds a ‘category’ index from scratch. Let’s say this takes 4 minutes.

Now, let’s test the performance of the old query with indexing.

SELECT COUNT(*)
FROM product
WHERE category = ‘electronics’;

You’ll see that the query will be much faster this time. It’ll probably only take 400 milliseconds this time.

Even the queries which extend further than using ‘category’ as a condition will benefit from the indexing on ‘category’. Let’s see an example.

SELECT COUNT(*)
FROM product
WHERE category = ‘electronics’
AND product_subcategory = ‘headphone’;

This query will take less time than what it would have normally taken, say 600 milliseconds for this query. The database can quickly find all ‘electronics’ products using the index. And from the smaller set of records, it finds headphones in a normal manner.

Now, let’s change the order of the conditions in the ‘WHERE’ clause.

SELECT COUNT(*)
FROM product
WHERE product_subcategory = ‘headphone’
AND category = ‘electronics’;

Even when the ‘product_category’ is mentioned before ‘category’, the database still picks the column which had an index, i.e. ‘category’, and then scans rows for finding the specified ‘product_subcategory’ from that subset of records.

How does it know that?

Fig.3 Possible query plans for the query optimizer (Image by Author)

The database considers all possible paths to execute the query and then chooses the most optimal path.

It’s time for some database lingo now. Each of the possible paths is called a Query Plan. It is essentially a sequence of steps used to access data in a SQL relational database management system (RDBMS).

And this feature of RDBMS that determines the most efficient way to execute a given query by considering all possible query plans is called Query Optimizer’.

Multi-column Indexing

Now, let’s explore multi-column indexing.

An index can be made for more than one column.

CREATE INDEX product_category_product_subcategory_index
ON product (category, product_subcategory);

Here, we have an index on both ‘category’ and ‘product_subcategory’. The important thing to note here is that the order matters here. It’s like sorting the data first by ‘category’ and then by ‘product_subcategory’.

And the query gets even more fast by using this multi-column index. Let’s say it comes down to 60 milliseconds.

Moreover, a database can have more than one index.

When to use and when not to use Indexing?

Indexes speed up the performance of a database. And as the database gets larger, indexes become even more useful.

But there are two important things you should remember:

  • Indexes require storage
  • When you add data to the database, the original table is updated first, and then all the indexes of that table are updated
Photo by Moritz Kindler on Unsplash

Thus, it’s good to use indexes on databases in data warehouses that get new data updated on a scheduled basis (off-peak hours) and not on production databases that would receive new updates all the time. This is because if the database is constantly receiving updates, then the indexes won’t get updated and hence, will remain unusable.

Types of Indexing

Let me touch upon the two types of database indexes briefly, to give you a comprehensive understanding of the topic:

1. Clustered

2. Non-clustered

Clustered Indexes

A clustered index is the unique index of the table that uses the primary key to organize the data within the table. A clustered index doesn’t have to be explicitly declared but is created by default when the primary key is defined. The primary key sorted in ascending order is used as the clustered index, by default.

Let me demonstrate this with an easy example.

Fig.4 Clustered Index (Image by Author)

This table ‘product’ will have a clustered index ‘product_pkey’ automatically created, organized around the primary key ‘product_id’.

Now, when you run a query to search the table by ‘product_id’ (like in the query below), the clustered index will help the database to perform optimal searches, and return the result faster.

SELECT product_name, category, price
FROM product
WHERE product_id = 3;

You must be wondering how exactly does it do that?

Indexes use an optimal search method known as binary search.

Illustration by Don Suratos on Pinterest

Binary search is an efficient algorithm for finding an entry from a sorted list of entries. It works by repeatedly dividing the data in half and checking if the entry you are searching for via your query comes before or after the entry in the middle of the data. If the value of your search entry is less than the entry in the middle, it narrows the search to the lower half, otherwise, it narrows the search to the upper half. It does this repeatedly until the value is found. This method decreases the number of searches required and thus, makes the execution of queries faster.

The following table helps to understand the impact of binary search in terms of number of searches:

Fig.5 Binary Search Complexity (Image by Author)

Similarly, for our dataset with 12 million rows, only 24 searches at most are required instead of 12 million searches in the worst case scenario, if a binary search is employed. I think now you know the power of indexes.

Non-clustered Index

Now the question is how to extend this power of indexing to columns other than the primary key. The answer is through non-clustered indexes.

All the queries that we learned to write at the beginning of the article to optimize the query performance were all using non-clustered indexes, the indexes which have to be explicitly defined.

A non-clustered index is stored in one place and the physical data in the table is stored in another place. It’s like the index page of a book, that we talked about earlier. The index page of the book is located in one place and the contents of the book are located in another. This allows for more than one non-clustered index per table, as we saw earlier.

And how exactly is that done?

Suppose you write the query that involves searching for an entry in a column you have already created a non-clustered index for. The non-clustered index inherently contains the following:

  • column entries on which you’ve created the index
  • addresses of the corresponding row (in the main table) that the column entry belongs to

You can see this visually in the left mini-table in the figure:

Fig.6 Non-clustered Index (Image by Author)

Let me explain this using a query.

CREATE INDEX product_category_index
ON product (category);
SELECT product_name, category, price
FROM product
WHERE category = ‘electronics’;

The database follows 3 steps:

  • Firstly, it goes to the non-clustered index (product_category_index), finds the column entry that you searched for (category = ‘electronics’), using binary search.
  • Secondly, it looks for the address of the corresponding row in the main table that the column entry belongs to.
  • Finally, it goes to that row in the main table and fetches other column values, as per the requirement of your query (product_name, price).

So, there’s an additional step (of finding the address and going to that row in the main table) involved in the working of a non-clustered index and hence, it’s slower than a clustered index.

Conclusion

This was all about the world of indexing for optimizing the performance of SQL queries, especially when you’re dealing with huge datasets. I’ll write more about the other methods of SQL query optimization very soon.

I hope I was able to deliver on my promise of making the article enjoyable and easy to understand. And I hope you found it useful.

To end with a powerful message by the greatest golfer of all time, Tiger Woods,

“No matter how good you get you can always get better, and that’s the exciting part.”

--

--