Why does Rails put the type column first in an index for a polymorphic association?

#deep-dive #ruby

Last week, I had a discussion with a coworker about how Rails indexes columns used in polymorphic associations. He thought that the order of columns in the index should be flipped - instead of indexing by type and ID, it should index by ID and type - as that way the most restrictive column is first, and therefore the index is more efficient. While I argued that the way that Rails indexes polymorphic associations is very pragmatic while also being efficient.

First off, what are we talking about?

We have an Access Log model in our app. It holds a log of who granted access to which device and when. But the thing granting access can be a Person, another Device or a Schedule.

In Rails, an association that can point to many different types of models is called a "polymorphic associations".

class AccessLog < ApplicationRecord
  belongs_to :device
  belongs_to :grantor, polymorphic: true
end

# Since grantor is polymorphic we can assign any type of model to it
log = AccessLog.create!(device: Device.all.sample, grantor: Person.all.sample)
log.grantor.class # => Person

log.update!(grantor: Device.all.sample)
log.grantor.class # => Device

log.update!(grantor: Schedule.all.sample)
log.grantor.class # => Schedule

# Since the device association isn't polimorphic we can't assign anything
# but a device to it
log.update!(device: Person.all.sample) # => ActiveRecord::AssociationTypeMismatch: Device expected, got Person

A polymorphic and a regular association are implemented differently in the database.

A regular association is just a column that holds an ID that points to a record in the other model's table; usually it also has a foreign key constraint, which ensures that the assigned ID actually exists in the other table.

While a polymorphic association uses two columns, one to hold the type of model it's for and another to hold the ID of the record it's for.

create_table "access_logs", force: :cascade do |t|
  # Regular association - holds the ID of the record from the devices table
  t.bigint "device_id", null: false
  
  # Polymorphic association
  t.string "grantor_type", null: false # holds the model name
  t.bigint "grantor_id", null: false # holds the ID of the record
  
  t.index ["device_id"], name: "index_access_logs_on_device_id"
  t.index ["grantor_type", "grantor_id"], name: "index_access_logs_on_entry"
end

# ensures that every ID in device_id coresponds to an id in the devices table
add_foreign_key "access_logs", "devices"

When you generate an association, Rails will automatically create an index for it. In the case of a polymorphic association, it will create a composite index - that's an index that indexes two or more columns simultaneously - in which it will put the type column first and the ID column last.

Why does it matter which column is first in the index?

This has to do with how databases build indices. By default, MySQL, MariaDB and PostgreSQL build b-tree indices.

An index is similar to a regular book index, but instead of having chapter names and page numbers, it has values from the indexed column and internal record IDs.

Comparison between a book and a naive database index

The problem with this kind of index is that it will grow as the number of distinct values in the indexed column grows.

E.g. if you create an index on a column that holds a person's email address, this kind of index would have one entry for each email, and the database would have to scan through each email until it finds the right one. To speed things up, we can do what dictionaries usually do and create a multi-level index.

In a dictionary, you'd usually have an index that points you to another index. E.g. the main index would tell you that the index for all words starting with R is on page 6. Then the R sub-index would tell you that words starting with RU are on pages 342 to 361. Then you'd go to page 342 and search for the word "Ruby" until you reached page 361.

B-trees are similar to multi-level indices in dictionaries, though they have extra rules and nuances that aren't important now.

What's important is that a b-tree index can have a sub-index, which in term can have a sub-index, and so on. And the more items each sub-index has, the slower it is to find something.

These indices work for single values such as words and emails, but how would you make an index of two words, or an email and a number?

To create a composite index, most databases just concatenate the values together and build a regular b-tree index. This is a gross simplification of what actually happens, but in essence, that's what they do. Because of that, the order of the columns becomes important.

If you tell the database to build an index over grantor_type first and then grantor_id it would take the two and concatenate them together like so "Person|443". But if you tell it to build an index over grantor_id first and then grantor_type it would concatenate them like so "443|Person".

The order of the columns in the index changes the structure of the index.

When the ID column is first, the index points to fewer rows, so the database has to scan through fewer rows, and therefore it can find a match quicker. That's why it's a general recommendation to put the most specific column first in a composite index - it narrows down the search a lot.

Naive composite database index with the ID column first and type column second

When the type column is first, there are a lot more rows to scan through in the end.

Naive composite database index with the type column first and ID column second

Why would anyone create an index and put the type column first then?

You don't have to follow a multi-level index all the way to the last index entry to find something. E.g. if you are searching for all words that start with R, you can go from the main index to find all sub-indices of words starting with R. Take the start page of the first index and the last page of the last index. Now you know that all words between these two pages start with R, and you didn't have to scan through anything.

In a real app, you'd sometimes want to filter all records that are associated with another record of a certain type. E.g. you might want to fetch all Access Logs that were granted by a Schedule, or all People associated with a Property through an Access Log.

class Property < ApplicationRecord
  has_many :access_logs
  
  has_many :schedule_activations,
    -> { where(grantor_type: "Schedule") },
    class_name: "AccessLog"
    
  has_many :active_people,
    through: :access_logs,
    source: :grantor,
    source_type: "Person"
end

# Calling
Propert.find(42).schedule_activations.count
# Executes the following SQL statement
# ```sql
# SELECT COUNT(*)
# FROM access_logs
# WHERE property_id = 42 
#   AND grantor_type = 'Schedule'
# ```

# Calling
Propert.find(42).active_people.count
# Executes the following SQL statement
# ```sql
# SELECT COUNT(tenants.*)
# FROM tenants INNER JOIN access_logs
#   ON tenants.id = access_logs.grantor_id
# WHERE access_logs.property_id = 42
#   AND access_logs.grantor_type = 'Person'

In that case, the composite index over grantor_id first and then grantor_type is basically useless. The database would have to traverse the whole index to aggregate a range that it then has to scan through.

While the index over grantor_type first and then grantor_id can quickly give you the range that you have to scan through by looking at the first and last entry of a sub-index.

We can test this out.

I created a test MariaDB database, with an access_logs table that I've filled with a million rows. Then I created a composite index over grantor_type and grantor_id that I've named test_index.

When I ask MariaDB to explain how it would go about counting all rows where the grantor_type is Person, it says that it would use test_index to generate the count.

MariaDB [test]> CREATE INDEX test_index ON access_logs (grantor_type, grantor_id);
Query OK, 0 rows affected (1.209 sec)
Records: 0  Duplicates: 0  Warnings: 0

MariaDB [test]> EXPLAIN  SELECT COUNT(1) FROM access_logs WHERE grantor_type = 'Person';                                                 
+------+-------------+---------------+------+---------------+------------+---------+-------+--------+--------------------------+
| id   | select_type | table         | type | possible_keys | key        | key_len | ref   | rows   | Extra                    |
+------+-------------+---------------+------+---------------+------------+---------+-------+--------+--------------------------+
|    1 | SIMPLE      | access_logs   | ref  | test_index    | test_index | 183     | const | 498290 | Using where; Using index |
+------+-------------+---------------+------+---------------+------------+---------+-------+--------+--------------------------+
1 row in set (0.002 sec)

MariaDB [test]> SELECT COUNT(1) FROM access_logs WHERE grantor_type = 'Person';
+----------+
| COUNT(1) |
+----------+
|   333400 |
+----------+
1 row in set (0.042 sec)

If I ask the same question, but change the order of columns in the index, then MariaDB says that it can't use any index to generate the count but it still tries to use test_index, and it takes about twice as long to count the rows.

MariaDB [test]> DROP INDEX test_index ON access_logs;
Query OK, 0 rows affected (0.045 sec)
Records: 0  Duplicates: 0  Warnings: 0

MariaDB [test]> CREATE INDEX test_index ON access_logs (grantor_id, grantor_type);
Query OK, 0 rows affected (1.146 sec)
Records: 0  Duplicates: 0  Warnings: 0

MariaDB [test]> EXPLAIN  SELECT COUNT(1) FROM access_logs WHERE grantor_type = 'Person';
+------+-------------+---------------+-------+---------------+------------+---------+------+--------+--------------------------+
| id   | select_type | table         | type  | possible_keys | key        | key_len | ref  | rows   | Extra                    |
+------+-------------+---------------+-------+---------------+------------+---------+------+--------+--------------------------+
|    1 | SIMPLE      | access_logs   | index | NULL          | test_index | 192     | NULL | 996580 | Using where; Using index |
+------+-------------+---------------+-------+---------------+------------+---------+------+--------+--------------------------+
1 row in set (0.000 sec)

MariaDB [test]> SELECT COUNT(1) FROM access_logs WHERE grantor_type = 'Person';
+----------+
| COUNT(1) |
+----------+
|   333400 |
+----------+
1 row in set (0.095 sec)

These results will vary from database to database, but the principle still stands.

For example, PostgreSQL can sometimes utilize an index even when it isn't optimal. In the case of an index for a polymorphic association, changing the order of the columns has much less impact than in MariaDB. In both cases, PostgreSQL uses the index and produces a result in about the same time.

postgres=# CREATE INDEX test_index ON access_logs (grantor_type, grantor_id);
CREATE INDEX

postgres=# EXPLAIN ANALYZE SELECT COUNT(1) FROM access_logs WHERE grantor_type = 'Person';                                                                      
QUERY PLAN                                                                       -----------------------------------------------------------------------------------------------------------------------------------------------------------------------                                                                            Finalize Aggregate  (cost=6457.79..6457.80 rows=1 width=8) (actual time=12.408..13.780 rows=1 loops=1)                                                ->  Gather  (cost=6457.57..6457.78 rows=2 width=8) (actual time=12.403..13.777 rows=3 loops=1)
    Workers Planned: 2
	Workers Launched: 2
	->  Partial Aggregate  (cost=5457.57..5457.58 rows=1 width=8) (actual time=10.314..10.314 rows=1 loops=3)
	  ->  Parallel Index Only Scan using test_index on access_logs  (cost=0.42..5107.64 rows=139972 width=0) (actual time=0.027..6.566 rows=111133 loops=3)
	    Index Cond: (grantor_type = 'Person'::text)                                      Heap Fetches: 0
Planning Time: 0.145 ms
Execution Time: 13.801 ms
(10 rows)

postgres=# DROP INDEX test_index;
DROP INDEX

postgres=# CREATE INDEX test_index ON door_releases (grantor_id, grantor_type);
CREATE INDEX

postgres=# EXPLAIN ANALYZE SELECT COUNT(1) FROM door_releases WHERE grantor_type = 'Person';
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=13778.29..13778.30 rows=1 width=8) (actual time=14.570..15.922 rows=1 loops=1)
   ->  Gather  (cost=13778.08..13778.29 rows=2 width=8) (actual time=14.566..15.919 rows=3 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Partial Aggregate  (cost=12778.08..12778.09 rows=1 width=8) (actual time=12.210..12.210 rows=1 loops=3)
               ->  Parallel Index Only Scan using test_index on door_releases  (cost=0.42..12428.15 rows=139972 width=0) (actual time=0.037..8.808 rows=111133 loops=3)
                     Index Cond: (grantor_type = 'Person'::text)
                     Heap Fetches: 0
 Planning Time: 0.079 ms
 Execution Time: 15.944 ms
(10 rows)

Simply put.

The type column is first in an index for a polymorphic association because then the same index can be used to speed up association look-ups (querying by both type and ID columns) and to filter by the type of the associated record (querying by just the type column).

These are the two most common use cases that you'll encounter in a Rails app. And also the use-cases for which Rails has built-in interfaces for (e.g. the source_type option).

This is more efficient in terms of speed (index creation and updating) and space (the size of the index on disk) than having two or more separate indices, which would be required in some databases if the columns were flipped.