Analyzing Scans in PostgreSQL

Introduction & Data Setup

Before we dive in, it is vital to understand the basic elementary blocks of PostgreSQL query plans. This has been covered in a separate blog post, and I highly encourage readers to go through first -

https://arctype.com/blog/postgresql-query-plan-anatomy/

Scans.
Joins.
Sort.
Aggregates etc.,

In this article, we will dive deep into the Scan node type, which is probably the most fundamental piece. Let's use the same fake_data table that was used in the query plan blog. For ease of understanding, we are going to disable parallel scans. We will enable this a bit later and also understand parallel scans in a separate section.

SET max_parallel_workers_per_gather = 0;

Scans

There will be other node types in the examples below, but the focus will be only on the scan nodes and highlighted throughout the examples below.

Sequential Scans

Let's search for an id with value 1000. Sequential Scan

This is the simplest of them all; if there is no index or a lesser number of rows, then the Planner resorts to scanning all of the rows present. You would/should never encounter Sequential Scans in a typical production system since they are very slow and get slower as the data increases. There are exceptions when the table is too small where an index is pointless and Sequential Scans are fast enough that you don't have to worry.

Index Scans

Let's create a simple BTree index on the id column to speed up the above query.

CREATE INDEX id_idx ON fake_data USING BTREE(id)

After index creation, the Planner now uses the index to perform what is called an Index Scan. As you can see, it is significantly faster (around 3500x) than a sequential scan. We should create appropriate indexes to speed up queries based on our access/query patterns. Here is an article that goes in-depth on what kind of indexes to create for different access patterns:

https://arctype.com/blog/database-index/

Index Only Scans

Index Only scans are very similar to index scans except that they scan only the indexes and do not touch the table data. This is possible only if the query contains the indexed column in both the SELECT and WHERE clauses. A slight tweak to the above query demonstrates this, One thing to note is that PostgreSQL might sometimes choose to do Index Scan in place of Index Only Scan if the table is not vacuumed well, i.e., the Planner will realize that there could be some mismatch between the indexes and the actual table data. This can happen if some large delete/insert in/from the table and the indexes have not been updated. It is good practice to time the auto vacuum correctly and vacuum if there is any large operation like a table load. Index Only Scans can be very fast in terms of I/O and time since indexes are practically cached in shared_buffers.

Bitmap Heap Scan & Bitmap Index Scan

Let's compare on a different column name which is text content.

CREATE INDEX name_str_idx on fake_data USING BTREE(name)

EXPLAIN ANALYZE
SELECT
    *
FROM
    fake_data
WHERE
    fake_data.name = 'David Bennett'

Bitmap Heap & Bitmap Index Scan The Planner decided to go with Bitmap heap Scan even though there was a BTree index on the name column. The reason is that index scans cause random I/O if there is no ordering to the rows (name is text content). This is costly in rotational hard drives. To solve this, the Planner takes a two-stage approach. The Bitmap Index Scan constructs a data structure called a bitmap from the index present and represents the approximate location of pages in the disk and feeds it to the parent node, i.e., Bitmap Heap Scan, which then uses that to fetch the pages/rows.

There is a discussion in the PostgreSQL mailing list, which goes into depth on its workings, but on a high level, a Bitmap Heap Scan always works in tandem with Bitmap Index Scan to speed things up. For example, suppose there are a large number of matching rows. In that case, the Planner decides to do Bitmap Heap Scan + Bitmap Index Scan over a traditional Index Scan, in which the executor iterates the Bitmap instead of the Index itself for faster results.

Bitmap And & Bitmap Or

The Bitmap data structure is not only beneficial for single matches but can also be combined if there are two indexes. Bitmap And is also similar, Keep in mind that in certain situations Bitmap And or Bitmap Or won't work, and we will have to create composite indexes. But in many situations, the Planner can combine two individual indexes very efficiently.

Parallel Scans

Sequential Scans are the slowest of all the plans we have seen so far because there is nothing to optimize there. The Planner goes over the data sequentially and tries to find the result. PostgreSQL optimizes this further by adding parallelism in the queries. Let's simulate this by removing the index for the id column and running the query. If you are using the same session as before, restart your database client for max_parallel_workers_per_gather setting to reset to default. Parallel Sequential Scans The default workers configuration is 2 and hence two workers have been used to run the query. Workers are processes behind the scenes, which enables the executor to run the scans in parallel. Going deep into how parallelization works is perhaps a topic of another blog. Still, the general recommendation is to keep the workers equal to how many cores are there in the CPU for the most efficient results.

Conclusion

Hopefully, by now, you have a good idea of scans and why PostgreSQL uses specific types of scans in different places. Understanding these scan types will help users answer questions like,

Why is not PostgreSQL using my index in a direct WHERE comparison?
Why is PostgreSQL not able to combine two indexes using Bitmap Scan?
How to optimize our queries for Index Only Scan?

One thing to understand is that there are certain kinds of queries, such as Count, Avg and other aggregate queries, which always result in Sequential Scan because they have to scan the entirety of the table anyway for the result. Stay tuned for more articles on PostgreSQL node types.