End-to-End Monitoring Checklist for PostgreSQL Logical Replication (Publisher + Subscriber)

Logical replication in PostgreSQL is wonderfully flexible — and wonderfully easy to misdiagnose. When something goes wrong, the symptom is almost always the same (“the subscriber is behind”), but the cause can live in any of half a dozen places: WAL generation on the publisher, the replication slot, the network, the apply worker on the subscriber, a conflicting row, a long transaction, or disk pressure on either side.

This article is a practical checklist you can run top-to-bottom when you need to answer three questions, fast:

  1. Is replication actually running?
  2. Is it lagging?
  3. Where in the pipeline is it stuck?

It assumes PostgreSQL 14+ (most queries work on 12+, with a couple of column name differences noted).

The Pipeline You’re Actually Monitoring

Before the queries, get the mental model right. A single logical change travels through this path:

[Client COMMIT]
      │
      ▼
[Publisher WAL]  ──► [Replication Slot] ──► [walsender]
                                                  │
                                            (network)
                                                  │
                                                  ▼
                                        [Subscriber apply worker]
                                                  │
                                                  ▼
                                        [Subscriber tables]

Each arrow is a place lag can accumulate, and each box is a place replication can stall. The checklist below walks the pipeline in order.

Section 1 — Publisher Side

1.1 Are the publication and slot still there?

-- Publications
SELECT pubname, puballtables, pubinsert, pubupdate, pubdelete, pubtruncate
FROM pg_publication;

-- Replication slots backing the subscriptions
SELECT slot_name, plugin, slot_type, database, active, active_pid,
       restart_lsn, confirmed_flush_lsn, two_phase, wal_status, safe_wal_size
FROM pg_replication_slots
WHERE slot_type = 'logical';

What to look for:

  • active = false with no active_pid → the subscriber is not connected. Either it’s down, the network is broken, or credentials/pg_hba.conf are wrong.
  • wal_statusreserved (i.e. extended, unreserved, lost) → the slot is at risk or already broken. lost means the slot is dead and replication cannot resume from it — you’ll need to recreate the subscription or use CREATE SUBSCRIPTION ... WITH (copy_data = false) after reseeding.
  • safe_wal_size shrinking toward 0 → you’re approaching max_slot_wal_keep_size and the slot will be invalidated soon.

1.2 Is the walsender alive and what is it doing?

SELECT pid, usename, application_name, client_addr, state, sync_state,
       backend_start, backend_xmin,
       pg_current_wal_lsn() AS current_wal,
       sent_lsn, write_lsn, flush_lsn, replay_lsn,
       write_lag, flush_lag, replay_lag
FROM pg_stat_replication;

This single view tells you almost everything about the publisher → subscriber link:

ColumnMeaningRed flag
statestreaming is healthycatchup for a long time, or startup stuck
sent_lsnLast WAL location sent over the wireFalls far behind pg_current_wal_lsn()publisher decoding/network bottleneck
write_lsnSubscriber acknowledged write to OSLags sent_lsnnetwork or subscriber I/O slow
flush_lsnSubscriber fsynced WALLags write_lsnsubscriber disk slow
replay_lsnSubscriber actually appliedLags flush_lsnsubscriber apply worker slow / blocked
write_lag, flush_lag, replay_lagSame thing as durationsUse these for alerting; they’re intervals

Quick lag-in-bytes calculation:

SELECT application_name,
       pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn)   AS pending_send_bytes,
       pg_wal_lsn_diff(sent_lsn,             flush_lsn)  AS pending_flush_bytes,
       pg_wal_lsn_diff(flush_lsn,            replay_lsn) AS pending_apply_bytes,
       pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS total_lag_bytes
FROM pg_stat_replication;

The split between pending_send / pending_flush / pending_apply is the single most useful diagnostic. It tells you which arrow in the pipeline is the bottleneck.

1.3 Is logical decoding itself healthy?

pg_stat_replication_slots (PostgreSQL 14+) shows what decoding is actually doing:

See also: Mastering the Linux Command Line — Your Complete Free Training Guide

SELECT slot_name, spill_txns, spill_count, spill_bytes,
       stream_txns, stream_count, stream_bytes,
       total_txns, total_bytes
FROM pg_stat_replication_slots;
  • High and growing spill_* counters → transactions are too large for logical_decoding_work_mem and are spilling to disk on the publisher. This is a major source of replication lag for batch workloads.
  • High stream_* counters with streaming = on (PG14+) → in-progress transactions are being streamed to the subscriber, which is good for large transactions but worth knowing.

Fixes: raise logical_decoding_work_mem, break up huge transactions, or enable streaming = on on the subscription.

1.4 Is anything holding back the slot?

SELECT slot_name,
       pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)         AS retained_wal_bytes,
       pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) AS unconfirmed_bytes,
       active, wal_status
FROM pg_replication_slots
WHERE slot_type = 'logical';
  • retained_wal_bytes keeps growing → the publisher is hoarding WAL because the subscriber isn’t acknowledging. Watch disk on pg_wal/.
  • A dead or inactive slot is the #1 cause of “publisher disk filled up at 3 AM”. Always alert on active = false for more than a short threshold, and on wal_status not in (reserved, extended).

1.5 Publisher-side resource sanity

These are the failure modes that masquerade as replication issues:

-- Long-running transactions block VACUUM and bloat the slot's xmin horizon
SELECT pid, usename, state, xact_start, query_start,
       now() - xact_start AS xact_age, query
FROM pg_stat_activity
WHERE state <> 'idle' AND xact_start IS NOT NULL
ORDER BY xact_start
LIMIT 10;

-- WAL generation rate
SELECT pg_current_wal_lsn(), pg_walfile_name(pg_current_wal_lsn());

Pair these with OS-level checks: pg_wal/ disk usage, CPU on the walsender process, and network egress.

Section 2 — Subscriber Side

2.1 Are the subscription and apply worker running?

-- Configuration
SELECT subname, subenabled, subconninfo, subslotname,
       subsynccommit, substream, subtwophasestate, subdisableonerr
FROM pg_subscription;

-- Live workers
SELECT subname, pid, relid::regclass, received_lsn, last_msg_send_time,
       last_msg_receipt_time, latest_end_lsn, latest_end_time
FROM pg_stat_subscription;

What to check:

  • subenabled = false → someone disabled it (or subdisableonerr did, after a conflict). Re-enable with ALTER SUBSCRIPTION ... ENABLE.
  • One row per subscription with a non-null pid → the apply worker is running.
  • Additional rows with non-null relidtable sync workers, meaning initial copy is still in progress for those tables.
  • now() - last_msg_receipt_time > a few seconds → the subscriber is not even receiving keepalives. Network or publisher problem.
  • latest_end_lsn not advancing while the publisher’s pg_current_wal_lsn() keeps moving → the subscriber is stuck, not just slow.

2.2 Is initial table sync finished?

SELECT srrelid::regclass AS table, srsubstate, srsublsn
FROM pg_subscription_rel;

srsubstate codes:

CodeMeaning
iinitialize
ddata is being copied
ffinished copy
ssynchronized
rready (streaming normally)

You want every row at r. Tables stuck at d for a long time mean the initial COPY is slow — usually disk or a large table without enough parallelism. PG16+ supports parallel apply (max_parallel_apply_workers_per_subscription) which helps once the table is past the copy phase, but the copy itself is still serial per table.

2.3 Has the apply worker errored?

pg_stat_subscription_stats (PG15+) is the single most useful view for “is replication silently broken”:

SELECT subname, apply_error_count, sync_error_count,
       confl_insert_exists, confl_update_origin_differs,
       confl_update_exists, confl_update_missing,
       confl_delete_origin_differs, confl_delete_missing,
       stats_reset
FROM pg_stat_subscription_stats;

Any non-zero error or conflict counter means the apply worker hit something it couldn’t handle. Then go straight to the logs — the details live there, not in the catalog.

2.4 The subscriber log is not optional

Tail it. Logical replication errors only appear in postgresql.log:

LOG:  logical replication apply worker for subscription "sub_orders" has started
ERROR: duplicate key value violates unique constraint "orders_pkey"
DETAIL: Key (id)=(42) already exists.
CONTEXT: processing remote data for replication origin "pg_16401" during "INSERT" for replication target relation "public.orders" in transaction 12345 finished at 0/1A2B3C4D

This is the canonical signal for the most common failure mode: a conflict. The subscriber retries the same transaction forever until you intervene. Options:

  • Fix the offending row on the subscriber so the apply succeeds.
  • Skip the broken transaction: ALTER SUBSCRIPTION sub_orders SKIP (lsn = '0/1A2B3C4D'); (PG15+).
  • On PG15+, set disable_on_error = true so a stuck conflict pages you instead of hammering the log.

2.5 Subscriber resource sanity

Even a healthy pipeline lags when the subscriber can’t keep up:

-- Apply worker is just a regular backend; check what it's waiting on
SELECT pid, backend_type, state, wait_event_type, wait_event, query
FROM pg_stat_activity
WHERE backend_type IN ('logical replication apply worker',
                       'logical replication tablesync worker',
                       'logical replication parallel apply worker');

Common wait events:

  • Lock (especially transactionid or tuple) → another transaction on the subscriber is blocking apply. The apply worker has no special priority.
  • IO → disk bound. Check iostat, WAL on the subscriber, autovacuum activity.
  • ClientRead → waiting on the publisher (network or publisher-side bottleneck).

Also check: every replicated table needs either a primary key or REPLICA IDENTITY FULL, otherwise UPDATE/DELETE will error out on the subscriber. Verify with:

SELECT n.nspname, c.relname, c.relreplident
FROM pg_class c JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE c.relkind = 'r' AND n.nspname NOT IN ('pg_catalog', 'information_schema');

relreplident: d = default (PK), f = full, n = nothing (broken for UPDATE/DELETE), i = index.

Section 3 — Where Is It Stuck? A Decision Tree

Run the queries above, then walk this tree. It maps symptoms to the segment of the pipeline at fault.

total_lag_bytes is large
│
├── pending_send_bytes is most of it
│     → Publisher decoding is the bottleneck
│       • Check pg_stat_replication_slots: high spill_bytes?
│       • Long-running transaction on publisher? (xmin horizon)
│       • CPU on walsender process?
│       • Fix: raise logical_decoding_work_mem; enable streaming;
│              shorten long transactions
│
├── pending_flush_bytes is most of it
│     → Network or subscriber I/O
│       • Is sent_lsn advancing but write/flush not?
│       • Network throughput between sites
│       • Subscriber WAL disk latency (it fsyncs received WAL)
│
├── pending_apply_bytes is most of it
│     → Subscriber apply worker
│       • pg_stat_activity wait_event for the apply worker
│       • Lock contention with other subscriber workload?
│       • Missing index on the replicated table (UPDATE/DELETE
│         without an index does a seq scan PER ROW)
│       • Conflict loop? Check log + apply_error_count
│       • Single big transaction? Enable streaming + parallel apply
│
└── No advance at all + slot active=false
      → Connection is broken
        • pg_hba.conf, password, replication role
        • Network ACL / security group
        • Subscriber's pg_subscription.subconninfo correct?

Section 4 — A Minimal Always-On Monitoring Set

If you only have time to wire up a handful of metrics, these are the ones that pay off:

On the publisher

  1. pg_replication_slots.active per logical slot — alert when false for > N minutes.
  2. pg_replication_slots.wal_status — alert when not in (reserved, extended).
  3. pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) per slot — primary lag-in-bytes metric.
  4. pg_stat_replication.replay_lag per subscriber — primary lag-in-time metric.
  5. pg_stat_replication_slots.spill_bytes rate — early warning for decoding pressure.
  6. pg_wal/ disk usage and free space.

On the subscriber

  1. pg_subscription.subenabled — alert when false.
  2. pg_stat_subscription.pid IS NOT NULL — apply worker alive.
  3. now() - pg_stat_subscription.last_msg_receipt_time — alert when > a few seconds.
  4. pg_stat_subscription_stats.apply_error_count (PG15+) — alert on any increase.
  5. Count of pg_subscription_rel rows where srsubstate <> 'r' — should be 0 in steady state.
  6. Apply worker’s wait_event distribution — to know why it’s slow when it is.

Wire these into Prometheus (postgres_exporter exposes most of them out of the box) or into whatever you already use. The combination of (1) lag-in-bytes from the publisher, (2) apply error count from the subscriber, and (3) the slot’s active flag will catch the vast majority of real incidents before users notice.

The One Sentence Version

When logical replication misbehaves, pg_stat_replication on the publisher tells you which segment of the pipeline is slow, pg_stat_subscription_stats on the subscriber tells you whether it’s silently broken, and the subscriber’s log tells you exactly which row killed it — run those three first, in that order, and you’ll diagnose 95% of incidents in minutes instead of hours.

David Cao
David Cao

David is a Cloud & DevOps Enthusiast. He has years of experience as a Linux engineer. He had working experience in AMD, EMC. He likes Linux, Python, bash, and more. He is a technical blogger and a Software Engineer. He enjoys sharing his learning and contributing to open-source.

Articles: 660

Leave a Reply

Your email address will not be published. Required fields are marked *