Bogus recovery conflicts due to outdated XLogReceiptTime

  • Jump to comment-1
    Andres Freund<andres@anarazel.de>
    Apr 29, 2026, 3:50 PM UTC
    Hi,
    While testing something independent, I had a replication setup with pgbench
    r/w running on the primary and pgbench r/o on the standby, with a replication
    slot setup and hotstandbyfeedback=on.
    When I saw recovery conflicts, I got worried that we screwed up something in
    the recent changes around pruning horizons. But that wasn't it. All the
    conflicts I was seeing were due to buffer pin conflicts.
    As the replica was keeping up perfectly, with replay lag well below
    maxstandbystreaming_delay, that should never have triggered conflicts.
    Investigating this was not helped by the fact that we don't log anything
    from the startup process when
    LockBufferForCleanup()
    -> ResolveRecoveryConflictWithBufferPin()
    enters the "We're already behind" path to SendRecoveryConflictWithBufferPin().
    I added some logging and found that GetStandbyLimitTime() thought we were
    quite far behind, even though the actual replay lag was < 1s.
    I think the reason for these conflicts is very old and it's that basically the
    basis on which we compute GetStandbyLimitTime() is bogus, when streaming:
    /*
     * Walreceiver is active, so see if new data has arrived.
     *
     * We only advance XLogReceiptTime when we obtain fresh
     * WAL from walreceiver and observe that we had already
     * processed everything before the most recent "chunk"
     * that it flushed to disk.  In steady state where we are
     * keeping up with the incoming data, XLogReceiptTime will
     * be updated on each cycle. When we are behind,
     * XLogReceiptTime will not advance, so the grace time
     * allotted to conflicting queries will decrease.
     */
    if (RecPtr < flushedUpto)
    	havedata = true;
    else
    {
    	XLogRecPtr	latestChunkStart;
    
    	flushedUpto = GetWalRcvFlushRecPtr(&latestChunkStart, &receiveTLI);
    	if (RecPtr < flushedUpto && receiveTLI == curFileTLI)
    	{
    		havedata = true;
    		if (latestChunkStart <= RecPtr)
    		{
    			XLogReceiptTime = GetCurrentTimestamp();
    			SetCurrentChunkStartTime(XLogReceiptTime);
    		}
    	}
    	else
    		havedata = false;
    I think the logic explained in the comment doesn't make much sense (but it's
    not trivial to come up with something good, see below). If the standby is
    keeping up, the primary is busy and walsender & walreceiver are keeping up,
    it's entirely normal that the walreceiver would have received one or more
    additional "chunks" from the primary. If the primary is committing frequently
    and the standby is keeping up receiving the data, walsender will send out data
    in small chunks. Thereby making it very likely that the latest chunk is newer
    than what we just processed.
    I've seen now-rtime in GetStandbyLimitTime() be many minutes, even though the
    actual replay lag was always under 2s, and most of the time < 0.5s.
    Tracing through the history (a lot of moving around), this seems to have
    originated in:
    commit e76c1a0f4d2127f11c72c02b3d73a5dcb4517173
    Author: Tom Lane <tgl@sss.pgh.pa.us>
    Date: 2010-07-03 20:43:58 +0000
    Replace max_standby_delay with two parameters, max_standby_archive_delay and
    max_standby_streaming_delay, and revise the implementation to avoid assuming
    that timestamps found in WAL records can meaningfully be compared to clock
    time on the standby server.  Instead, the delay limits are compared to the
    elapsed time since we last obtained a new WAL segment from archive or since
    we were last "caught up" to WAL data arriving via streaming replication.
    This avoids problems with clock skew between primary and standby, as well
    as other corner cases that the original coding would misbehave in, such
    as the primary server having significant idle time between transactions.
    Per my complaint some time ago and considerable ensuing discussion.
    
    Do some desultory editing on the hot standby documentation, too.
    I think I've observed this problem in production setups, i.e. recovery
    conflicts being triggered despite everything seemingly keeping up, with no
    "source" of recovery conflicts apparent in the log. I just never had enough
    data to figure out what's going on.
    It really doesn't help that XLogReceiptTime is basically unobservable in stock
    postgres. Recovery doesn't log the fact that it thinks we are behind (or how
    far), it doesn't even log that it triggers a recovery conflicts in that case
    (the cancellation of is logged though).
    Now, how to fix that:
    It doesn't seem trivial. As the commit message cited above explains, we don't
    want to rely on the timestamps from the primary, so we can't just use
    XLogRecoveryCtl->recoveryLastXTime or such. And we can't just always update
    XLogReceiptTime in WaitForWALToBecomeAvailable() (or even more so in
    walreceiver whenever we receive data), because that'd lead to XLogReceiptTime
    being advanced way too aggressively.
    I wonder if a decent minimal improvement would be to update it whenever we had
    to update the XLogReceiptTime whenever we actually had to refresh flushedUpto
    from the walreceiver and the new position is >= RecPtr. While that doesn't
    entirely avoid updating XLogReceiptTime too aggressively, I think it may bound
    it sufficiently, because if we are behind, we don't need to call
    GetWalRcvFlushRecPtr() very often - we only need to do so again, after we
    replayed all the WAL that was available the last time. It be a lot more
    accurate than what we do today, because it'd avoid the "latestChunkStart <=
    RecPtr" condition which is often unreachable today.
    I guess we could also use recoveryLastXTime as an additional lower bound for
    the apply delay? That way it'd "help" avoid a too stale XLogReceiptTime if the
    clocks are in sync, but not hurt if the clocks are out of sync?
    A better solution would probably be a proper "LSN range -> received time"
    mapping. In walsender we have LagTracker, which keeps a ringbuffer of LSN to
    time mappings and computes lags for write/flush/replay. Perhaps we could
    generalize that?
    There's an easy way to reproduce this issue: Configure the standby to use
    recoveryapplydelay=1s and have workload on the primary that commits more
    than once a second.
    If you then configure walreceiver to log debug2 messages (or change the log
    level in ProcessWalSndrMessage(), which makes the volume easier to deal with),
    you can see that the "replication apply delay" in those messages creep up very
    rapidly (as we never hit the "latest chunk" logic anymore).
    Greetings,
    Andres Freund