Causes

Each clue points to different causes, so we need to be able to examine the combinations and eliminate the non-applicable causes. In most cases, we will need to run scripts on the databases to determine which potential cause is the culprit for the given incident, but there may be some cases where we can figure it out without needing scripts.

Potential causes

Lock contention
Long-running transactions
Expensive queries
Idle connections
VACUUM is not running/running continuously
Redis falling over (e.g. running out of memory)
PostgreSQL is in recovery

Lock contention

This issue arises when two or more queries attempt to acquire the same lock or conflicting locks on rows in the DB. It is usually quite easy to rule this one out, as the following query will show if any queries are blocking others due to conflicting locks:

SELECT
   activity.pid,
   activity.usename,
   activity.query,
   blocking.pid AS blocking_id,
   blocking.query AS blocking_query
FROM
   pg_stat_activity AS activity
JOIN pg_stat_activity AS blocking ON
   blocking.pid = ANY(
       pg_blocking_pids(activity.pid)
   )
ORDER BY
   blocking_id;

If a query is blocking other queries, we’ll need to decide whether or not it’s safe to cancel the query. This can result in data loss, so it should ideally be used as a last resort.

Long-running queries

See which ones are taking the longest:

SELECT
   pid,
   datname,
   usename,
   application_name,
   client_hostname,
   client_port,
   backend_start,
   query_start,
   query,
   state
FROM
   pg_stat_activity
WHERE
   state = 'active'
ORDER BY
   backend_start ASC;

To investigate slow queries, it is important to run EXPLAIN ANALYSE on them. IMPORTANT: using the ANALYSE keyword actually executes the query, so it is only safe to use on SELECT queries. For queries that UPDATE, INSERT, or DELETE, only use EXPLAIN. Working through the output of EXPLAIN ANALYSE is beyond the scope of this doc at present. More on this will be included in the future.

Expensive queries

Expensive queries are not necessarily slow queries, although they usually end up being slow. The bigger problem with expensive queries is that they consume more resources than the system can spare, which slows down other processes and can lead to the instance running out of memory. In extreme cases, an expensive query can start to fill the instance’s buffer memory, and this can lead to snowballing effects, such as this one which hit us last year: The bufferIO wait events continued to grow, even once all of the queries which seemed to be causing the issue had been cancelled. In this instance, we needed to restart the reader instance, which is not ideal.

Idle connections

Idle connections still consume memory, and if enough of them hang around they can cause other processes to get slower. They can be found by looking at pg_stat_activity:

SELECT
   pid,
   usename,
   pg_stat_activity.query_start,
   now() - pg_stat_activity.query_start AS query_time,
   query,
   state,
   wait_event_type,
   wait_event
FROM
   pg_stat_activity
WHERE
   (
       now() - pg_stat_activity.query_start
   ) > INTERVAL '5 minutes';

The state will report as idle, which generally means that the connection is being held but not used. This can occur for a number of reasons, including RDS Proxy pinning the connection. In general, there will always be some number of connections in the idle state, but if there are more than usual it can signal a deeper problem.

Vacuum not running/running continuously

Tables need to be vacuumed so that their indices function correctly and they do not get bloated. If a table is very busy or very large, the vacuum process might not run often enough to clean the table up, and when it does run it can take so long that the usual operation of the table either slows it down or prevents it from completing at all. The following query shows when last each table was vacuumed, as well as some statistics for the table:

SELECT
   relname AS TableName,
   n_live_tup AS LiveTuples,
   n_dead_tup AS DeadTuples,
   last_autovacuum AS Autovacuum,
   last_autoanalyze AS Autoanalyze
FROM
   pg_stat_user_tables
ORDER BY
   n_dead_tup DESC;

If you need to see whether any vacuum processes are stuck, you can use the following query:

SELECT
   datname,
   usename,
   pid,
   state,
   wait_event,
   current_timestamp - xact_start AS xact_runtime,
   query
FROM
   pg_stat_activity
WHERE
   upper(query) LIKE '%VACUUM%'
ORDER BY
   xact_start;

Redis falling over

When Redis, for example, runs out of memory and times out, the Redis connection might timeout (making API calls slower) and might rely more on direct DB calls, hitting the DB harder. Check the Infra dashboard on Cloudwatch to see if Redis is running out of memory.

PostgreSQL is in recovery mode

Run the following query to see if PostgreSQL is in recovery mode. In recovery mode, a PostgreSQL server replays write-ahead logs (WAL) to restore data after a crash or operate as a standby server in replication setups. It’s particularly useful for identifying if the instance is a primary or a standby server in replication scenarios or to confirm if the database has fully recovered after a failure.

SELECT pg_is_in_recovery();

Note that this command only works on the writer instance; the command above will always report as true when run on the reader instance. Always check that you are on the correct instance before running anything which affects the system settings.