Logical replication in PostgreSQL is wonderfully flexible — and wonderfully easy to misdiagnose. When something goes wrong, the symptom is almost always the same (“the subscriber is behind”), but the cause can live in any of half a dozen places: WAL generation on the publisher, the replication slot, the network, the apply worker on the subscriber, a conflicting row, a long transaction, or disk pressure on either side.
This article is a practical checklist you can run top-to-bottom when you need to answer three questions, fast:
- Is replication actually running?
- Is it lagging?
- Where in the pipeline is it stuck?
It assumes PostgreSQL 14+ (most queries work on 12+, with a couple of column name differences noted).
Table of Contents
The Pipeline You’re Actually Monitoring
Before the queries, get the mental model right. A single logical change travels through this path:
[Client COMMIT]
│
▼
[Publisher WAL] ──► [Replication Slot] ──► [walsender]
│
(network)
│
▼
[Subscriber apply worker]
│
▼
[Subscriber tables]
Each arrow is a place lag can accumulate, and each box is a place replication can stall. The checklist below walks the pipeline in order.
Section 1 — Publisher Side
1.1 Are the publication and slot still there?
-- Publications
SELECT pubname, puballtables, pubinsert, pubupdate, pubdelete, pubtruncate
FROM pg_publication;
-- Replication slots backing the subscriptions
SELECT slot_name, plugin, slot_type, database, active, active_pid,
restart_lsn, confirmed_flush_lsn, two_phase, wal_status, safe_wal_size
FROM pg_replication_slots
WHERE slot_type = 'logical';
What to look for:
active = falsewith noactive_pid→ the subscriber is not connected. Either it’s down, the network is broken, or credentials/pg_hba.confare wrong.wal_status≠reserved(i.e.extended,unreserved,lost) → the slot is at risk or already broken.lostmeans the slot is dead and replication cannot resume from it — you’ll need to recreate the subscription or useCREATE SUBSCRIPTION ... WITH (copy_data = false)after reseeding.safe_wal_sizeshrinking toward 0 → you’re approachingmax_slot_wal_keep_sizeand the slot will be invalidated soon.
1.2 Is the walsender alive and what is it doing?
SELECT pid, usename, application_name, client_addr, state, sync_state,
backend_start, backend_xmin,
pg_current_wal_lsn() AS current_wal,
sent_lsn, write_lsn, flush_lsn, replay_lsn,
write_lag, flush_lag, replay_lag
FROM pg_stat_replication;
This single view tells you almost everything about the publisher → subscriber link:
| Column | Meaning | Red flag |
|---|---|---|
state | streaming is healthy | catchup for a long time, or startup stuck |
sent_lsn | Last WAL location sent over the wire | Falls far behind pg_current_wal_lsn() → publisher decoding/network bottleneck |
write_lsn | Subscriber acknowledged write to OS | Lags sent_lsn → network or subscriber I/O slow |
flush_lsn | Subscriber fsynced WAL | Lags write_lsn → subscriber disk slow |
replay_lsn | Subscriber actually applied | Lags flush_lsn → subscriber apply worker slow / blocked |
write_lag, flush_lag, replay_lag | Same thing as durations | Use these for alerting; they’re intervals |
Quick lag-in-bytes calculation:
SELECT application_name,
pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn) AS pending_send_bytes,
pg_wal_lsn_diff(sent_lsn, flush_lsn) AS pending_flush_bytes,
pg_wal_lsn_diff(flush_lsn, replay_lsn) AS pending_apply_bytes,
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS total_lag_bytes
FROM pg_stat_replication;
The split between pending_send / pending_flush / pending_apply is the single most useful diagnostic. It tells you which arrow in the pipeline is the bottleneck.
1.3 Is logical decoding itself healthy?
pg_stat_replication_slots (PostgreSQL 14+) shows what decoding is actually doing:
See also: Mastering the Linux Command Line — Your Complete Free Training Guide
SELECT slot_name, spill_txns, spill_count, spill_bytes,
stream_txns, stream_count, stream_bytes,
total_txns, total_bytes
FROM pg_stat_replication_slots;
- High and growing
spill_*counters → transactions are too large forlogical_decoding_work_memand are spilling to disk on the publisher. This is a major source of replication lag for batch workloads. - High
stream_*counters withstreaming = on(PG14+) → in-progress transactions are being streamed to the subscriber, which is good for large transactions but worth knowing.
Fixes: raise logical_decoding_work_mem, break up huge transactions, or enable streaming = on on the subscription.
1.4 Is anything holding back the slot?
SELECT slot_name,
pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS retained_wal_bytes,
pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) AS unconfirmed_bytes,
active, wal_status
FROM pg_replication_slots
WHERE slot_type = 'logical';
retained_wal_byteskeeps growing → the publisher is hoarding WAL because the subscriber isn’t acknowledging. Watch disk onpg_wal/.- A dead or inactive slot is the #1 cause of “publisher disk filled up at 3 AM”. Always alert on
active = falsefor more than a short threshold, and onwal_statusnot in (reserved,extended).
1.5 Publisher-side resource sanity
These are the failure modes that masquerade as replication issues:
-- Long-running transactions block VACUUM and bloat the slot's xmin horizon
SELECT pid, usename, state, xact_start, query_start,
now() - xact_start AS xact_age, query
FROM pg_stat_activity
WHERE state <> 'idle' AND xact_start IS NOT NULL
ORDER BY xact_start
LIMIT 10;
-- WAL generation rate
SELECT pg_current_wal_lsn(), pg_walfile_name(pg_current_wal_lsn());
Pair these with OS-level checks: pg_wal/ disk usage, CPU on the walsender process, and network egress.
Section 2 — Subscriber Side
2.1 Are the subscription and apply worker running?
-- Configuration
SELECT subname, subenabled, subconninfo, subslotname,
subsynccommit, substream, subtwophasestate, subdisableonerr
FROM pg_subscription;
-- Live workers
SELECT subname, pid, relid::regclass, received_lsn, last_msg_send_time,
last_msg_receipt_time, latest_end_lsn, latest_end_time
FROM pg_stat_subscription;
What to check:
subenabled = false→ someone disabled it (orsubdisableonerrdid, after a conflict). Re-enable withALTER SUBSCRIPTION ... ENABLE.- One row per subscription with a non-null
pid→ the apply worker is running. - Additional rows with non-null
relid→ table sync workers, meaning initial copy is still in progress for those tables. now() - last_msg_receipt_time> a few seconds → the subscriber is not even receiving keepalives. Network or publisher problem.latest_end_lsnnot advancing while the publisher’spg_current_wal_lsn()keeps moving → the subscriber is stuck, not just slow.
2.2 Is initial table sync finished?
SELECT srrelid::regclass AS table, srsubstate, srsublsn
FROM pg_subscription_rel;
srsubstate codes:
| Code | Meaning |
|---|---|
i | initialize |
d | data is being copied |
f | finished copy |
s | synchronized |
r | ready (streaming normally) |
You want every row at r. Tables stuck at d for a long time mean the initial COPY is slow — usually disk or a large table without enough parallelism. PG16+ supports parallel apply (max_parallel_apply_workers_per_subscription) which helps once the table is past the copy phase, but the copy itself is still serial per table.
2.3 Has the apply worker errored?
pg_stat_subscription_stats (PG15+) is the single most useful view for “is replication silently broken”:
SELECT subname, apply_error_count, sync_error_count,
confl_insert_exists, confl_update_origin_differs,
confl_update_exists, confl_update_missing,
confl_delete_origin_differs, confl_delete_missing,
stats_reset
FROM pg_stat_subscription_stats;
Any non-zero error or conflict counter means the apply worker hit something it couldn’t handle. Then go straight to the logs — the details live there, not in the catalog.
2.4 The subscriber log is not optional
Tail it. Logical replication errors only appear in postgresql.log:
LOG: logical replication apply worker for subscription "sub_orders" has started
ERROR: duplicate key value violates unique constraint "orders_pkey"
DETAIL: Key (id)=(42) already exists.
CONTEXT: processing remote data for replication origin "pg_16401" during "INSERT" for replication target relation "public.orders" in transaction 12345 finished at 0/1A2B3C4D
This is the canonical signal for the most common failure mode: a conflict. The subscriber retries the same transaction forever until you intervene. Options:
- Fix the offending row on the subscriber so the apply succeeds.
- Skip the broken transaction:
ALTER SUBSCRIPTION sub_orders SKIP (lsn = '0/1A2B3C4D');(PG15+). - On PG15+, set
disable_on_error = trueso a stuck conflict pages you instead of hammering the log.
2.5 Subscriber resource sanity
Even a healthy pipeline lags when the subscriber can’t keep up:
-- Apply worker is just a regular backend; check what it's waiting on
SELECT pid, backend_type, state, wait_event_type, wait_event, query
FROM pg_stat_activity
WHERE backend_type IN ('logical replication apply worker',
'logical replication tablesync worker',
'logical replication parallel apply worker');
Common wait events:
Lock(especiallytransactionidortuple) → another transaction on the subscriber is blocking apply. The apply worker has no special priority.IO→ disk bound. Checkiostat, WAL on the subscriber, autovacuum activity.ClientRead→ waiting on the publisher (network or publisher-side bottleneck).
Also check: every replicated table needs either a primary key or REPLICA IDENTITY FULL, otherwise UPDATE/DELETE will error out on the subscriber. Verify with:
SELECT n.nspname, c.relname, c.relreplident
FROM pg_class c JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE c.relkind = 'r' AND n.nspname NOT IN ('pg_catalog', 'information_schema');
relreplident: d = default (PK), f = full, n = nothing (broken for UPDATE/DELETE), i = index.
Section 3 — Where Is It Stuck? A Decision Tree
Run the queries above, then walk this tree. It maps symptoms to the segment of the pipeline at fault.
total_lag_bytes is large
│
├── pending_send_bytes is most of it
│ → Publisher decoding is the bottleneck
│ • Check pg_stat_replication_slots: high spill_bytes?
│ • Long-running transaction on publisher? (xmin horizon)
│ • CPU on walsender process?
│ • Fix: raise logical_decoding_work_mem; enable streaming;
│ shorten long transactions
│
├── pending_flush_bytes is most of it
│ → Network or subscriber I/O
│ • Is sent_lsn advancing but write/flush not?
│ • Network throughput between sites
│ • Subscriber WAL disk latency (it fsyncs received WAL)
│
├── pending_apply_bytes is most of it
│ → Subscriber apply worker
│ • pg_stat_activity wait_event for the apply worker
│ • Lock contention with other subscriber workload?
│ • Missing index on the replicated table (UPDATE/DELETE
│ without an index does a seq scan PER ROW)
│ • Conflict loop? Check log + apply_error_count
│ • Single big transaction? Enable streaming + parallel apply
│
└── No advance at all + slot active=false
→ Connection is broken
• pg_hba.conf, password, replication role
• Network ACL / security group
• Subscriber's pg_subscription.subconninfo correct?
Section 4 — A Minimal Always-On Monitoring Set
If you only have time to wire up a handful of metrics, these are the ones that pay off:
On the publisher
pg_replication_slots.activeper logical slot — alert when false for > N minutes.pg_replication_slots.wal_status— alert when not in (reserved,extended).pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)per slot — primary lag-in-bytes metric.pg_stat_replication.replay_lagper subscriber — primary lag-in-time metric.pg_stat_replication_slots.spill_bytesrate — early warning for decoding pressure.pg_wal/disk usage and free space.
On the subscriber
pg_subscription.subenabled— alert when false.pg_stat_subscription.pid IS NOT NULL— apply worker alive.now() - pg_stat_subscription.last_msg_receipt_time— alert when > a few seconds.pg_stat_subscription_stats.apply_error_count(PG15+) — alert on any increase.- Count of
pg_subscription_relrows wheresrsubstate <> 'r'— should be 0 in steady state. - Apply worker’s
wait_eventdistribution — to know why it’s slow when it is.
Wire these into Prometheus (postgres_exporter exposes most of them out of the box) or into whatever you already use. The combination of (1) lag-in-bytes from the publisher, (2) apply error count from the subscriber, and (3) the slot’s active flag will catch the vast majority of real incidents before users notice.
The One Sentence Version
When logical replication misbehaves,
pg_stat_replicationon the publisher tells you which segment of the pipeline is slow,pg_stat_subscription_statson the subscriber tells you whether it’s silently broken, and the subscriber’s log tells you exactly which row killed it — run those three first, in that order, and you’ll diagnose 95% of incidents in minutes instead of hours.




