How an index made rendering slow
I noticed that a view I was working on was rendering much slower than I would expect it to. The view showed a list of events, together with the person that generated the event and the device that the event belongs to. It took nearly half a second to render 25 events, while other similar pages render in a couple of milliseconds.
At first I thought that I had an N+1 - that each rendered event went to fetch its person and device from the database, generating 2 database queries per rendered row.
So why did the view render so slow?
I went to look at the development server logs - which, in Rails, show a lot of useful information for debugging problems like these - and I saw this:
It was clearly visible that the preload worked. ActiveRecord made three queries to the database - one to get the events, another to get the users, and a third to get the devices.
But where does the
WHERE deleted_at IS NULL come from?
I looked at the
Device models. Both invoke the
acts_as_paranoid method which the paranoia gem adds to
We use that gem at work to soft delete records in the database - the data is kept in the table, it just isn’t returned unless specifically requested.
Paranoia knows which records are deleted by checking a
deleted_at column that has to be added to each model that can act as paranoid. When the record is deleted,
deleted_at is set to the current timestamp. A record can be restored (un-deleted) by setting
deleted_at back to
Paranoia filters out deleted records by adding
WHERE deleted_at IS NULL to all queries.
Why are the queries to fetch the users and devices so slow?
In situations like these I reach for the
explain method on an ActiveRecord relation. It tells the database to explain how it will execute the query. This shows important information like the indices it plans to use, in which order it wants to filter rows, and more.
Here is the explain result for one of the slow queries:
The result shows that the database sees two indices with which it can filter the table to generate the result -
Out of these two it choose to use
index_users_on_deleted_at, which is a bit odd since the query mostly filters by IDs.
Why did it choose that index instead of using the primary key? And why is that so slow?
SHOW INDEX, an introspection tool in MariaDB that shows statistics about indices on a table. The query planner uses those same statistics to plan how to filter the requested rows. Understanding what the query planner sees can explain why MariaDB picked
index_users_on_deleted_at instead of
These are the statistics for the
Cardinality column shows that the
PRIMARY index has millions of values indexed, while
index_users_on_deleted_at only has a few thousand.
But why is that index so slow?
The decision by the query planner to use
index_users_on_deleted_at felt off. It seemed like a mistake so I dug deeper.
MariaDB offers query optimizer tracing, which is fancy term for a “detailed explanation why it did what it did”.
I opened a DB console and ran
SET optimizer_trace='enabled=on'; to enable tracing. Then I ran the slow query and right after that I checked to see the optimizer trace with
SELECT * FROM information_schema.optimizer_trace LIMIT 1;.
This returned a large JSON object, which showed every decision that the query planner made and explained why it chose the approach it did.
The trace showed that the
PRIMARY index had a much lower execution cost (was much faster) than
index_users_on_deleted_at, but it chose to use
index_users_on_deleted_at for “rowid filtering”. I had no clue what “rowid filtering” was so I went to look it up in the documentation.
Row ID filtering is an optimization. The basic idea is this - if you have two indices on a table, an you are filtering by both; more often than not it’s quicker to filter by the more restrictive index first, and then scan the result to filter by the less restrictive one.
Or in other words, it’s better to pick the index that eliminates most values first so that you have fewer rows to work with, and then scan through the result and filter by any other filters. Makes sense.
MariaDB, unlike PostgreSQL, doesn’t support conditional indexing. This means that each index you add to a table indexes the whole table.
This is great when you have many rows with different values in the indexed column, but it can backfire when you have a table that has many rows with the same value.
In my case the
deleted_at column has around 4k different values in about 3 million rows. But a single value -
NULL - of those 4k is associated with nearly all 3 million rows. This means that the database will load and scan through 3 million records to find the ones matching the
IN clause - which is extremely slow.
How do I make the query fast?
Luckily MariaDB, unlike PostgreSQL, provides an escape hatch for situations like this. Through the
USE INDEX option, it allows you to specify the index it should use when filtering.
ActiveRecord doesn’t have a method for
USE INDEX but it can be added by passing a string to the
Telling the planner to use the
PRIMARY index speeds up the query drastically. The execution time fell from 80 ms to 1 ms.
But this doesn’t solve my problem. I can’t call the
from method in a preload query - that query is built and executed by ActiveRecord internally.
Though, now that I understand the problem I can trick the query planner into doing what I want.
If I replace
eager_load I can force the query planner to use at least one
PRIMARY index and thereby speed things up considerably.
And it worked! Both
PRIMARY indices were used when filtering, and the query returns in 3 ms.
Why did this work?
I had a hunch that a
LEFT OUTER JOIN would force the optimizer to use at least one
PRIMARY index on the joined tables. Because
eager_load does the same thing as
preload, in this case, but using a
LEFT OUTER JOIN it was the perfect solution.
And I was right. The query optimizer trace for the
eager_load query showed that the optimizer tries to apply the same optimization as before, but its cost is higher than using the
Device::EventLog table doesn’t have any more restrictive index that is also used in the query so it can’t apply the same optimization. While the joined tables are implicitly filtered by so many IDs that the cost of using the
PRIMARYindex goes way down compared to the optimization.
At least that’s what I think is going on.
Either way, sometimes replacing
eager_load can utilize indices better and therefore speed up query execution.
P.S. Because of this problem I took a deep dive into query planners and optimizers in both PostgreSQL and MariaDB. This is a fascinating topic, a science on its own, and a world with very different opinions about what the “correct” solution is. It made me appreciate the “no hints” approach that PostgreSQL takes, but at the same time showed me the value of escape hatches that MariaDB provides. It made me see that there really is no one-size-fits-all solution for SQL databases - as always in software, you stick with what works for you.