Fix the race condition for updating slot minimum LSN

  • Jump to comment-1
    Zhijie Hou (Fujitsu)<houzj.fnst@fujitsu.com>
    Jan 27, 2026, 6:32 AM UTC
    Hi,
    During a discussion about a similar issue involving the invalidation of newly
    created slots[1], we found another race condition that might lead to the same
    problem.
    The race condition is that: if a backend creates a new slot and attempts to
    initialize the slot.restart_lsn during WAL reservation, but meanwhile, another
    backend invokes ReplicationSlotsComputeRequiredLSN(), the slot minimum LSN may
    be initially updated by the newly created slot, only to be subsequently
    overwritten by the backend running ReplicationSlotsComputeRequiredLSN() with an
    more recent LSN. This scenario could lead to the premature removal of WALs
    reserved by the new slot during a checkpoint, resulting in the newly created
    slot being invalidated.
    The steps to reproduce are as follows:
    1. Create a slot 'advtest' for later advancement.
    select pg_create_logical_replication_slot('advtest', 'test_decoding');
    2. Start a backend to create a slot (s) but block it before updating the
    restart_lsn in ReplicationSlotReserveWal().
        select pg_create_logical_replication_slot('s', 'test_decoding');
    3. start another backend to generate some new WAL files and advance the
    slot (advtest) to the latest position but block it from updating the LSN in
    XLogSetReplicationSlotMinimumLSN()
        select pg_switch_wal();
        select pg_switch_wal();
        SELECT pg_log_standby_snapshot();
        SELECT pg_log_standby_snapshot();
        select pg_replication_slot_advance('advtest', pg_current_wal_lsn());
        select pg_replication_slot_advance('advtest', pg_current_wal_lsn());
    4. Release the backend to create slot (s).
    5. execute checkpoint but block it before calling XLogGetReplicationSlotMinimumLSN()
    6. release the advancement backend and then the LSN will be set to a new position.
    7. release the checkpoint and the WALs required by the slot (s) are removed.
    This issue is similar to the concurrent slot_xmin update issue fixed in commit
    2a5225b, so I think it's better to apply a similar fix, e.g., we can acquire an
    exclusive ReplicationSlotControlLock when updating slot.restart_lsn during WAL
    reservation. Additionally, XLogSetReplicationSlotMinimumLSN() is placed under
    the protection of the ReplicationSlotControlLock. This serializes the update of
    slot.restart_lsn and the computation of the minimum LSN in other backends,
    ensuring that a more recent minimum LSN isn't computed while an older one is
    still being reserved.
    The above fix is implemented in the attached patch 0001.
    [1] https://www.postgresql.org/message-id/flat/TY4PR01MB16907DCA80DBC3E77CE6B203294C9A%40TY4PR01MB16907.jpnprd01.prod.outlook.com
    Best Regards,
    Hou zj