MySQL and Indexes

MySQL is a GPLed database. Check out mysql.com and mysql.org for more information. I'll assume you are familiar with what a database is and how to get MySQL running.

These tests were performed with Perl 5.6.1, and MySQL 3.23.51.

Let's imagine we have a table that has a lot of data (ie, a lot of rows), and we want to try and improve our performance. The oldest trick is to use an index on commonly used columns in your queries. Quite simply, if we have a where clause in our commonly used SQL statements, then the columns we are selecting on may benefit from indexes.

Indexes can cover multiple columns, but in this simple test, we'll cover just one.

Our sample data consists of two columns: a primary key (giving us a row ID), and a varchar column I'll call 'name'. Here is out SQL to create the table:

create table t_0 (ID int primary key auto_increment,
name varchar(255) not null);

I have called this table t_0, since it is our test table with no index. Now, to compare the cost and benefits of using indexes, I will want a couple of other tables with similar attributes, and then I'll want to fill them all with the same data.

We said above that an index can span multiple columns, but it can also span a subset of one column. The index can be on the first 'n' characters of a column, or all of it. Thus, we'll make a selection on tables with differing indexes: I have chosen to index using 1 character, 2, 3, 5, and the entire column. Thus our SQL for creating this is:

Create table t_1 (ID int primary key  auto_increment,
name varchar(255) not null,
index index_name(name(1)));

Create table t_2 (ID int primary key  auto_increment,
name varchar(255) not null,
index index_name(name(2)));

Create table t_3 (ID int primary key  auto_increment,
name varchar(255) not null,
index index_name(name(3)));

Create table t_5 (ID int primary key  auto_increment,
name varchar(255) not null,
index index_name(name(5)));

Create table t_all (ID int primary key  auto_increment,
name varchar(255) not null,
index index_name(name));

Now we have our tables, lets fill them. This is where the cost of indexes comes into play. When an indexed column is updated, then index on that column also needs to be updated. This is handled by MySQL, so you don't have to add any code to do it.

I developed a script called seed.pl, which you can grab. It populates the tables you tell it with random data. In the name column I am just putting 10 random digits. While we insert this data, we time how long it takes to execute using the "Time::HiRes" module. We repeat this for different numbers of inserts: but the trend should be the same:

Time in milliseconds per insert on tables with differing index lengths
No. inserts	Index length
No. inserts	None	1	2	3	5	all
1	0.58000000	0.33800000	0.30400000	0.30100000	0.41000000	0.47100000
10	0.36580000	0.30080000	0.31310000	0.29020000	0.29500000	0.47340000
100	0.31126000	0.24504000	0.24965000	0.24512000	0.24775000	0.42198000
1000	0.24679500	0.25829200	0.24833000	0.24489000	0.25230100	0.40500300
10,000	0.23984620	0.25703830	0.24361310	0.26397630	0.25284590	0.40103270

The smaller insert runs are useless data. Its the 10,000 one that starts to give us some idea of what is going on.

So, if we index the entire name column, then we are seeing almost double the amount of time being taken. Just to be obvious, at 10,000 inserts, the average increase for each index was:

% increase in insert time when using an index
Index length
1	2	3	5	all
7.15%	1.57%	10.1%	5.43%	67.2%

So we know there is an increase, which becomes very large when the entire column is indexed. The aim here is to work out if this extra work load on inserts is worth it when doing selects.

We have another script which does the select on these tables we have created, called test.pl. I have seeded each of the tables with 72,000 records. I'll do 100 iterations:

Time in milliseconds to select from 72,000 rows
No. selects	None	1	2	3	5	all
100	0.154572	0.000555	0.000364	0.000332	0.000331	0.000488
% decrease	-	278.5%	424.6%	465.6%	467.0%	316.7%

What we notice is an immediate massive decrease in time taken to do a select on these 72,000 rows, however, indexing the entire column is not the best solution. We saw earlier that indexing the entire column also gave us the biggest drop in performance, so it looks like we never want to do both. Of course, the gain in selecting is dependent on the length of the data being selected, and the number of rows, but in this example, we can see that an index of length 3 will be sufficient.

Summary

Adding an index to a column is a performance loss during insert, and a gain during select.
Indexing an entire column is not the optimal index, but is better than none
The sweet spot for the index length is dependent on the number of rows and length of data in the indexed column
If you don't do many inserts on this table, then you are safer with a longer index than the sweet spot
If you do many more inserts than selects on this table then you are better off with a smaller index than the sweet spot

That was fun. Try your own numbers. If you write it up, I'll link to your web page...