Causes
Each clue points to different causes, so we need to be able to examine the combinations and eliminate the non-applicable causes. In most cases, we will need to run scripts on the databases to determine which potential cause is the culprit for the given incident, but there may be some cases where we can figure it out without needing scripts.
Potential causes
- Lock contention
- Long-running transactions
- Expensive queries
- Idle connections
- VACUUM is not running/running continuously
- Redis falling over (e.g. running out of memory)
- PostgreSQL is in recovery
Lock contention
This issue arises when two or more queries attempt to acquire the same lock or conflicting locks on rows in the DB. It is usually quite easy to rule this one out, as the following query will show if any queries are blocking others due to conflicting locks:
SELECT
activity.pid,
activity.usename,
activity.query,
blocking.pid AS blocking_id,
blocking.query AS blocking_query
FROM
pg_stat_activity AS activity
JOIN pg_stat_activity AS blocking ON
blocking.pid = ANY(
pg_blocking_pids(activity.pid)
)
ORDER BY
blocking_id;
If a query is blocking other queries, we’ll need to decide whether or not it’s safe to cancel the query. This can result in data loss, so it should ideally be used as a last resort.
Long-running queries
See which ones are taking the longest:
SELECT
pid,
datname,
usename,
application_name,
client_hostname,
client_port,
backend_start,
query_start,
query,
state
FROM
pg_stat_activity
WHERE
state = 'active'
ORDER BY
backend_start ASC;
To investigate slow queries, it is important to run EXPLAIN ANALYSE on them. IMPORTANT: using the ANALYSE keyword actually executes the query, so it is only safe to use on SELECT queries. For queries that UPDATE, INSERT, or DELETE, only use EXPLAIN. Working through the output of EXPLAIN ANALYSE is beyond the scope of this doc at present. More on this will be included in the future.
Expensive queries
Expensive queries are not necessarily slow queries, although they usually end up being slow. The bigger problem with expensive queries is that they consume more resources than the system can spare, which slows down other processes and can lead to the instance running out of memory. In extreme cases, an expensive query can start to fill the instance’s buffer memory, and this can lead to snowballing effects, such as this one which hit us last year: The bufferIO wait events continued to grow, even once all of the queries which seemed to be causing the issue had been cancelled. In this instance, we needed to restart the reader instance, which is not ideal.
Idle connections
Idle connections still consume memory, and if enough of them hang around they can cause other processes to get slower. They can be found by looking at pg_stat_activity:
SELECT
pid,
usename,
pg_stat_activity.query_start,
now() - pg_stat_activity.query_start AS query_time,
query,
state,
wait_event_type,
wait_event
FROM
pg_stat_activity
WHERE
(
now() - pg_stat_activity.query_start
) > INTERVAL '5 minutes';
The state will report as idle, which generally means that the connection is being held but not used. This can occur for a number of reasons, including RDS Proxy pinning the connection. In general, there will always be some number of connections in the idle state, but if there are more than usual it can signal a deeper problem.
Vacuum not running/running continuously
Tables need to be vacuumed so that their indices function correctly and they do not get bloated. If a table is very busy or very large, the vacuum process might not run often enough to clean the table up, and when it does run it can take so long that the usual operation of the table either slows it down or prevents it from completing at all. The following query shows when last each table was vacuumed, as well as some statistics for the table:
SELECT
relname AS TableName,
n_live_tup AS LiveTuples,
n_dead_tup AS DeadTuples,
last_autovacuum AS Autovacuum,
last_autoanalyze AS Autoanalyze
FROM
pg_stat_user_tables
ORDER BY
n_dead_tup DESC;
If you need to see whether any vacuum processes are stuck, you can use the following query:
SELECT
datname,
usename,
pid,
state,
wait_event,
current_timestamp - xact_start AS xact_runtime,
query
FROM
pg_stat_activity
WHERE
upper(query) LIKE '%VACUUM%'
ORDER BY
xact_start;
Redis falling over
When Redis, for example, runs out of memory and times out, the Redis connection might timeout (making API calls slower) and might rely more on direct DB calls, hitting the DB harder. Check the Infra dashboard on Cloudwatch to see if Redis is running out of memory.
PostgreSQL is in recovery mode
Run the following query to see if PostgreSQL is in recovery mode. In recovery mode, a PostgreSQL server replays write-ahead logs (WAL) to restore data after a crash or operate as a standby server in replication setups. It’s particularly useful for identifying if the instance is a primary or a standby server in replication scenarios or to confirm if the database has fully recovered after a failure.
SELECT pg_is_in_recovery();
Note that this command only works on the writer instance; the command above will always report as true when run on the reader instance. Always check that you are on the correct instance before running anything which affects the system settings.