Hans Hasselberg

PostgreSQL: CLUSTER table USING index;

25 Mar 2014

TLDR; CLUSTER table USING index can greatly increase performance but is hard to maintain.

UPDATE 15 January 2016: 2 months ago we clustered our biggest table. We are using 1/20 of the IOPS now.

UPDATE 29 April 2015: We did not end up using Cluster table.

The CLUSTER1 documentation is great, and it covers the technical details very well. You have to read all of it if you intend to use it. This post explains why and how I’m using clustered tables.

Background

I’m currently preparing the migration the migration of our tasks PostgreSQL 9.1 database to a PostgreSQL 9.3 database. We are switching to a hosted database instead of running our own server, because, frankly, we’re not good at operating database servers. The servers we are using to host our database were on huge machines: 2 hi.4xlarge7 instances. We can throw everything at them and I want to stop doing that. My goal is to migrate the database to 1 db.m2.2xlarge8 instance with 1000 provisioned IOPS.

As you probably noticed there is quite a big difference between 2 hi.4xlarge and 1 db.m1.xlarge. For example the latter has 2*120 times less IOPS - the resource we struggle with the most. I set this goal because I believe it is realistic, and that we only need these big machines because we are doing it wrong.

This blog post explains why clustering the table was crucial to archiving my goal!

CLUSTER

For the scope of this blog post I’m going to assume a very simple table schema:

CREATE TABLE tasks (
  id serial PRIMARY KEY, 
  title character varying(255), 
  list_id integer
);
CREATE INDEX index_list_id on tasks(list_id);

I generated 500,000,000 tasks in 500,000 lists and every 1000th tasks belongs to the same list.

The most frequent type of query is SELECT * FROM tasks WHERE list_id IN (?, ?, ?). Unfortunately this query is rather expensive especially when querying for many lists. The query I’m using for demonstration includes every 5000th list—100 in total:

EXPLAIN (ANALYZE, BUFFERS) SELECT * 
FROM tasks 
WHERE list_id IN (5000, 10000, ...);
------------------------------------- Query Plan
Index Scan using index_list_id on tasks
  Index Cond: (list_id = ANY ('{5000, 10000, ...}'::integer[]))
  Buffers: shared hit=5316 read=5120
Total runtime: 2478.740 ms
(4 rows)

Now lets repeat the same query on a clustered table which has the schema and data as tasks:

CLUSTER TABLE tasks_clustered USING index_list_id;
ANALYZE tasks_clustered;
EXPLAIN (ANALYZE, BUFFERS) SELECT * 
FROM tasks 
WHERE list_id IN (5000, 10000, ...);
------------------------------------- Query Plan
Index Scan using index_list_id on tasks_clustered
  Index Cond: (list_id = ANY ('{5000, 10000, ...}'::integer[]))
  Buffers: shared hit=399 read=199
Total runtime: 80.665 ms
(4 rows)

You can look at the full queries and query plans in a separate gist3. As you can see the query was >30 times faster on the clustered table because significantly less buffers were needed by PostgreSQL to respond.

What happened? That was my question exactly when I was experimenting with the data some time ago. What is CLUSTER actually doing?

When a table is clustered, it is physically reordered based on the index information.1

A clustered table doesn’t help when querying rows randomly. It can greatly increase performance when you query a range of index values or a single index value with multiple entries because the queried data is in one place on the disk.

Looking back at our query SELECT * FROM tasks WHERE list_id IN (?, ?, ?), it is clear why the clustered table is so much faster! The tasks are grouped together on the disk according to their list id. PostgreSQL can read every list’s tasks from disk without jumping around. Fast and convenient! For the unclustered table, however, the tasks for each list are spread across the disk.

Maintenance

While the benefits of a clustered table are obvious, there are things you need to consider before using it. Clustering is a one-time operation1, and updates, inserts, or deletes will fragment the table again. Depending on your use case you will probably be forced to cluster your table regularly to maintain the order. Clustering issues an ExclusiveLock1,4, and as a result you can neither read nor write while clustering.

When dealing with clustered tables you should set fillfactor2 appropriately. It will avoid fragmentation by enabling PostgreSQL to put the updated row on the same page as the original one.

I believe another possibility to cluster a table is to use pg_reorg which:

Reorganize tables in PostgreSQL databases without any locks.5

It looks promising, but I haven’t played around with it because there are only a few extensions available on AWS RDS PostgreSQL6.

Fin

It is hard to maintain a clustered table, but I’m still amazed by its impact and benefits. Clustering seems to be the solution for tables which suffer from too many reads from queries on a foreign key.

I would love to hear about your experiences with clustering and the techniques you used to maintain it!

Acknowledgements

I would like to thank Torsten for working with me on the database stuff and helping me write this blog post. I also want to thank Ryan for proofreading this blog post! <3

Sources

  1. PostgreSQL: Cluster
  2. PostgreSQL: Fillfactor
  3. Complete Query and Explain from example
  4. PostgreSQL: Locks
  5. PGFoundary: pg_reorg
  6. AWS RDS PostgreSQL
  7. AWS EC2: hi.4xlarge
  8. AWS RDS: db.m2.2xlarge
  9. How does CLUSTER ON improve index performance
  10. Lock-free clustering of large PostgreSQL data sets

If you read this far you should probably follow me on twitter.