Monday, October 29, 2018

It's just an expression

PostgreSQL has lots of great features for text search.

In particular, there are a bunch of ways to do case-insensitive searches for text:
  • There's standard SQL ILIKE… but than can be expensive — especially if you put %'s at both start and end — and it's overkill if you want an exact string match.
  • The same goes for case-insensitive regexp matching: overkill for simple case-insensitive matches. It does work with indexes, though!
  • Then there's the citext extension, which is pretty much the perfect answer. It lets you use indexes and still get case-insensitive matching, which is really cool. It Just Works.
OK, but what if you didn't have the foresight to use citext? And what if you don't want to go through the pain of changing the data type of that column? In a case like this, an expression index can be really handy.

Without an index, a case-insensitive match like this…

sandbox# select addy from email_addresses where lower(addy) = 'quinn@example.com';

… is forced to use a sequential scan, lowercasing the addy column of each row before comparing it to the desired address:

                                               QUERY PLAN
----------------------------------------------------------------------------------------------------------
 Seq Scan on email_addresses  (cost=0.00..1.04 rows=1 width=32) (actual time=0.031..0.032 rows=1 loops=1)
   Filter: (lower(addy) = 'quinn@example.com'::text)
   Rows Removed by Filter: 3
 Planning time: 0.087 ms
 Execution time: 0.051 ms
(5 rows)


For my toy example, a four-row table, that's not too expensive, but for a real table with thousands or millions of rows, it can become quite a pain point. And a regular index on email_addresses(addy) won't help; the lower() operation forces a sequential scan, regardless of whether an index is present.

But an expression index will do the trick. An astute commenter noted that, in my toy example, walking the index is actually slower than a sequential scan. But for a table with many rows (most of which are not the one email address I'm looking for!) an index scan will be dramatically faster.

sandbox# create index email_addresses__lower__addy on email_addresses (lower(addy));
CREATE INDEX
sandbox# explain analyze select addy from email_addresses where lower(addy) = 'quinn@example.com';
                                                                  QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using email_addresses__lower__addy on email_addresses  (cost=0.13..8.15 rows=1 width=32) (actual time=0.093..0.095 rows=1 loops=1)
   Index Cond: (lower(addy) = 'quinn@example.com'::text)
 Planning time: 0.105 ms
 Execution time: 0.115 ms
(4 rows)

In short, expression indexes are one of those neat tricks that can save you a lot of pain.

2 comments:

  1. Probably i miss the obvious, but for my eyes the index version is half as fast, than plain vanilla?

    ReplyDelete
    Replies
    1. Good catch! The reason for this is that I used a toy example, with just a few rows in the email_addresses table. In a case like this, sequentially scanning the rows is faster than walking an index.

      In fact, to make PostgreSQL use the index at all, I had to do 'SET enable_seqscan = off' in the psql session where I ran the examples. Otherwise the planner would opt for the clearly-better option.

      You can see a hint of this in the first EXPLAIN ANALYZE plan, where its shows 'Rows Removed by Filter: 3', even though it scanned the entire table. In a real-life example, this could be thousands or millions of rows, which would change the Seq Scan/Index Scan tradeoff.

      Delete