2 minute read

PostgreSQL replication is a critical component of database high availability and disaster recovery. Ensuring that replication is running smoothly and identifying any issues early is essential to maintaining database reliability.

Replication lag is one of the most crucial metrics to monitor. Lag refers to the delay between the primary and the standby server. A significant lag could indicate network issues, resource constraints, or slow I/O or busy standby/master Node, or locks in the standby node.

In this post, I will walk you through different methods and tools to monitor PostgreSQL replication effectively.

Prerequisites

Before you begin monitoring PostgreSQL replication, ensure that:

  • You have a PostgreSQL master-slave (primary-secondary) replication setup.
  • You have appropriate permissions to access both the primary and replica servers.
  • Monitoring tools like psql, pg_stat_replication, and pg_stat_wal_receiver are enabled.

MASTER

  • Slots Info
    SELECT * FROM pg_replication_slots;
    
  • Replication Slots & Lag in Bytes
    SELECT slot_name,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS replicationSlotLag,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)) AS confirmedLag,
    active
    FROM pg_replication_slots;
    
  • Check if all walreceiver processes are active or not
    SELECT * FROM pg_stat_replication;
    
  • Important fields in pg_stat_replication to see connected clients
    SELECT pid, application_name, client_addr, client_hostname, state, sync_state, replay_lag FROM pg_stat_replication;
    
  • Check Lag stats
    SELECT client_addr, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag
    FROM pg_stat_replication;
    
  • Traffic Source (Based on Application Name)
    SELECT application_name, count(*) FROM pg_stat_activity GROUP BY application_name;
    
  • Check WAL File Numbers (Ensure They Change Over Time)
    SELECT pg_walfile_name(restart_lsn) FROM pg_replication_slots;
    
  • Check WAL Receive & Apply Stats (Check Current Time)
    SELECT application_name, client_addr, client_port, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_state,
    (extract(epoch from (now() - reply_time)))::int FROM pg_stat_replication;
    

STANDBY / SLAVE

Important: Replication Lag can be divided into two parts:

  1. WAL Receive Lag: The time it takes for the standby to receive the WAL file from the primary.
  2. WAL Apply Lag: The time it takes for the standby to apply the received WAL file to the replica.
    • These are the issues are at the Standby side. du -sch * /<data-dir>/main will show increasing disk usage over time. Once the disk usage is saturated, the standby will stop receiving new WAL files from the primary, causing a backlog in primary/Master node and will accumulate WAL files in the WAL log of Master node.
  • Check Hosts Resource Usage
    top -cd1
    free -mh
    grep "out of" /var/log/syslog
    du -sch * /<data-dir>/main     # If Standby is not ab
    
  • WAL Receive & Apply Stats (Check Current Time)
    SELECT pg_is_in_recovery(), pg_is_wal_replay_paused(), pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn(), pg_last_xact_replay_timestamp(), now()-pg_last_xact_replay_timestamp() AS ReplicaLag;
    
  • Check Disk Utilization (Performance) Using iostat
    iostat -dx /dev/vda
    iostat -x -d -m 1 /dev/vda /dev/vdb /dev/vdc /dev/vdd
    iostat -x -d -m 1  /dev/vda /dev/vdb /dev/vdc /dev/vdd | awk '{printf "%-12s %-10s %-10s %-10s %-10s %-10s\n", $1, $2, $3, $8, $9, $21}'
    
  • Check Lag in Bytes
    SELECT
    pg_is_in_recovery() AS is_standby,
    pg_last_wal_receive_lsn() AS receive,
    pg_last_wal_replay_lsn() AS replay,
    pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() AS synced,
    (
     EXTRACT(EPOCH FROM now()) -
     EXTRACT(EPOCH FROM pg_last_xact_replay_timestamp())
    )::int AS lag;
    
  • Check Long-Running Queries
    SELECT
    pid, datname, client_addr, client_port,
    now() - pg_stat_activity.query_start AS duration,
    query,
    state
    FROM pg_stat_activity
    WHERE (now() - pg_stat_activity.query_start) > interval '15 minutes'
    AND state = 'active'
    ORDER BY duration DESC;
    
  • Cancel/Terminate a Specific Query Using PID
    SELECT pg_terminate_backend(<PID>);
    
  • Kill All Long-Running Queries Older Than 15 Minutes
    SELECT pid, pg_terminate_backend(pid)
    FROM pg_stat_activity
    WHERE (now() - pg_stat_activity.query_start) > interval '15 minutes'
    
  • Currently Running Queries
    SELECT query, state, usename, datname, now() - backend_start AS execution_time
    FROM pg_stat_activity
    ORDER BY execution_time DESC;
    
  • WAL Receive Stats
    SELECT pid, slot_name, status, receive_start_lsn, written_lsn, flushed_lsn, last_msg_receipt_time FROM pg_stat_wal_receiver;