Open Source For Geeks: How does database indexing work?

Monday, 15 May 2017

How does database indexing work?

Background

Database indexing is a wide topic. Database indexing plays a important role in your query result performance. But like everything this too has a trade off. In this post we will see what database indexing is and how does it work.

Clustered and Non clustered index

Before we go to how indexing actually works lets see the two types on indexes -

Clustered index
Non clustered index

Data in tables of a database need to be stored on a physical disk at the end of the day. It is important the way data is stored since data lookup is based on it. The way data is stored on physical disk is decided by an index which is known as Clustered index. Since data is physically stored only once , only one clustered index is possible for a DB table. Generally that's the primary key of the table. That's right. Primary key of your table is a Clustered index by default and data is stored physically based on it.

NOTE : You can change this though. You can create a primary key that is not clustered index but a non clustered one. However you need to define one clustered index for your table since physical storage order depends on it.

Non clustered indexes are normal indexes. They order the data based on the column we have created non clustered index on. Note since data is stored only once on the disk and there is just one column(or group of columns ) which can be used to order stored data on disk we cannot store the same data with ordering specified by non clustered index (that's the job of clustered index). So new memory is allocated for non clustered indexes. These have the column on which it is created as the key of the data row and a pointer to the actual data row (which is ordered by clustered index - usually the primary key). So if a search is performed on this table based on the columns that have non clustered indexes then this new data set is searched (which is faster since records are ordered with respect to this column) and then the pointer in it is used to access the actual data record with all columns.

Now that we have knowledge of clustered and non clustered indexes lets see how it actually works.

Why is indexing needed?

When data is stored on disk based storage devices, it is stored as blocks of data. These blocks are accessed in their entirety, making them the atomic disk access operation. Disk blocks are structured in much the same way as linked lists; both contain a section for data, a pointer to the location of the next node (or block), and both need not be stored contiguously.

Due to the fact that a number of records can only be sorted on one field, we can state that searching on a field that isn’t sorted requires a Linear Search which requires N/2 block accesses (on average), where N is the number of blocks that the table spans. If that field is a non-key field (i.e. doesn’t contain unique entries) then the entire table space must be searched at N block accesses.

Whereas with a sorted field, a Binary Search may be used, this has log2 N block accesses. Also since the data is sorted given a non-key field, the rest of the table doesn’t need to be searched for duplicate values, once a higher value is found. Thus the performance increase is substantial.

What is indexing?

This is rather a silly question given we already saw clustered and non clustered indexes but lets give it a try.

Indexing is a way of sorting a number of records on multiple fields. Creating an index on a field in a table creates another data structure which holds the field value, and pointer to the record it relates to. This index structure is then sorted, allowing Binary Searches to be performed on it. This index is obviously the non clustered one. There is no need to create separate data structure for Clustered indexes since the original data is stored physically sorted based on it.

The downside to (non clustered) indexing is that these indexes require additional space on the disk, since the indexes are stored together in a table using the MyISAM engine, this file can quickly reach the size limits of the underlying file system if many fields within the same table are indexed.

Performance

Indexes don't come free. They have their own overhead. Each index creates an new data set ordered by the columns on which it is created. This takes extra space (though not as much as the original data set since this just has single data and pointer to actual row data). Also inserts are slower now since each insert will have to update this new index data set as well. Same goes for deletion.

Data structures used in indexes

Hash index :

Think of this using a HashMap. Key here would be derived from columns that are used to create a index (non clustered index to be precise). Value would be pointer to the actual table row entry. They are good for equality lookups like get rows of data of all customer whose age is 24. But what happens when we need a query like set of data of customer with age greater than 24. Hash index does not work so go in this case. Hash indexes are just good for equality lookups.

Eg.

B-tree Indexes:

These are most common types of indexes. It's kind of self balancing tree. It stores data in an ordered manner so that retrievals are fast. These are useful since they provide insertion, search and deletion in logarithmic time.

Eg.

Consider above picture. If we need rows with data less that 100 all we need are notes to the left of 100.

These are just common ones. Others are R-tree indexes, bitmap indexes etc.

To get a complete idea on how indexes work internally please refer - SQL Indexing and Tuning.

NOTE : There is no clustered index in oracle database. A regular non clustered is automatically created on primary key. As an alternative in Oracle DB you can explore Index organized tables.

Important Points to remember

Avoid using function in where clause which takes column parameter as index. Functions render indexes useless. Functions are like blackbox and database optimized does not know about it's relationship with argument the function takes. So instead of using the index it will perform full table scan. Eg

create index employee_name_idx on employee(name);
select * from employee where name='Aniket'; --uses index
select * from employee where lower(name)='aniket'; --cannot use index

If you do have a concatenated index then choose the column order so that mulitple sql queries (of you usecases) can use it. Eg.

select * from employee where name='Aniket'; --sql1
select * from employee where name='Aniket' and age=26; -- sql2
create index employee_name_idx on employee(name, age); --correct
create index employee_name_idx on employee(age, name); --incorrect (sql1 cannot use this)

Avoid like expressions in your where clause which starts with a wildcard. it will be of no use even the column is indexed. Eg.

create index employee_name_idx on employee(name);
select * from employee where name like '%ket'; -- will be very slow
select * from employee where name like 'Anik%'; --will be fast as it used prefix on index

Always index for equality first (vrs lets say a range)

create index employee_name_idx1 on employee(joining_date, name); --slower
create index employee_name_idx2 on employee( name, joining_date); --faster
select * from employee where name='Aniket' and joining_date >= TO_DATE('2017-08-20')

Always check if covered indexes are possible. This prevents actual table access since you get all the data in index only. So lets say you have a table with columns A-Z. Also lets say you have an index on column B and your query is something like -

select A from table where B='XYZ'
Query looks good and it will use our index defined on column B and speed up the query time in the process but for each hit in btree leaf of index it will need to access actual table row to get A.
Now consider you have index on (B,A). Since your where clause has B this index will be picked up. However in this case once entries are located we already have value of A which we need to find. So this will avoid lookup of actual table row.
This obviously depends on your use case. None the less it's important to consider this use case.