BUG: Cascading standby fails to reconnect after falling back to archive recovery

  • Jump to comment-1
    Marco Nenciarini<marco.nenciarini@enterprisedb.com>
    Jan 28, 2026, 5:03 PM UTC
    Hi hackers,
    I've encountered a bug in PostgreSQL's streaming replication where cascading
    standbys fail to reconnect after falling back to archive recovery. The issue
    occurs when the upstream standby uses archive-only recovery.
    The standby requests streaming from the wrong WAL position (next segment
    boundary
    instead of the current position), causing connection failures with this
    error:
    ERROR: requested starting point 0/A000000 is ahead of the WAL flush
    position of this server 0/9000000
    Attached are two shell scripts that reliably reproduce the issue on
    PostgreSQL
    17.x and 18.x:
    1. reproducerrestartupstream_portable.sh - triggers by restarting upstream
    2. reproducercascaderestart_portable.sh - triggers by restarting the
    cascade
    The scripts set up this topology:
    - Primary with archiving enabled
    - Standby using only archive recovery (no streaming from primary)
    - Cascading standby streaming from the archive-only standby
    When the cascade loses its streaming connection and falls back to archive
    recovery,
    it cannot reconnect. The issue appears to be in xlogrecovery.c around line
    3880,
    where the position passed to RequestXLogStreaming() determines which segment
    boundary is requested.
    The cascade restart reproducer shows that even restarting the cascade itself
    triggers the bug, which affects routine maintenance operations.
    Scripts require PostgreSQL binaries in PATH and use ports 15432-15434.
    Best regards,
    Marco
    • Jump to comment-1
      Xuneng Zhou<xunengzhou@gmail.com>
      Jan 29, 2026, 12:22 PM UTC
      Hi Marco,
      On Thu, Jan 29, 2026 at 1:03 AM Marco Nenciarini
      <marco.nenciarini@enterprisedb.com> wrote:

      Hi hackers,

      I've encountered a bug in PostgreSQL's streaming replication where cascading
      standbys fail to reconnect after falling back to archive recovery. The issue
      occurs when the upstream standby uses archive-only recovery.

      The standby requests streaming from the wrong WAL position (next segment boundary
      instead of the current position), causing connection failures with this error:

      ERROR: requested starting point 0/A000000 is ahead of the WAL flush
      position of this server 0/9000000

      Attached are two shell scripts that reliably reproduce the issue on PostgreSQL
      17.x and 18.x:

      1. reproducerrestartupstream_portable.sh - triggers by restarting upstream
      2. reproducercascaderestart_portable.sh - triggers by restarting the cascade

      The scripts set up this topology:
      - Primary with archiving enabled
      - Standby using only archive recovery (no streaming from primary)
      - Cascading standby streaming from the archive-only standby

      When the cascade loses its streaming connection and falls back to archive recovery,
      it cannot reconnect. The issue appears to be in xlogrecovery.c around line 3880,
      where the position passed to RequestXLogStreaming() determines which segment
      boundary is requested.

      The cascade restart reproducer shows that even restarting the cascade itself
      triggers the bug, which affects routine maintenance operations.

      Scripts require PostgreSQL binaries in PATH and use ports 15432-15434.

      Best regards,
      Marco
      Thanks for your report. I can reliably reproduce the issue on HEAD
      using your scripts. I’ve analyzed the problem and am proposing a patch
      to fix it.
      --- Analysis
      When a cascading standby streams from an archive-only upstream:
      1. The upstream's GetStandbyFlushRecPtr() returns only replay position
      (no received-but-not-replayed buffer since there's no walreceiver)
      2. When streaming ends and the cascade falls back to archive recovery,
      it can restore WAL segments from its own archive access
      3. The cascade's read position (RecPtr) advances beyond what the
      upstream has replayed
      4. On reconnect, the cascade requests streaming from RecPtr, which the
      upstream rejects as "ahead of flush position"
      --- Proposed Fix
      Track the last confirmed flush position from streaming
      (lastStreamedFlush) and clamp the streaming start request when it
      exceeds that position:
      - Same timeline: clamp to lastStreamedFlush if RecPtr > lastStreamedFlush
      - Timeline switch: fall back to timeline switchpoint as safe boundary
      This ensures the cascade requests from a position the upstream
      definitely has, rather than assuming the upstream can serve whatever
      the cascade restored locally from archive.
      I’m not a fan of using sleep in TAP tests, but I haven’t found a
      better way to reproduce this behavior yet.
      --
      Best,
      Xuneng
      • Jump to comment-1
        Fujii Masao<masao.fujii@gmail.com>
        Jan 30, 2026, 3:13 AM UTC
        On Thu, Jan 29, 2026 at 9:22 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
        Thanks for your report. I can reliably reproduce the issue on HEAD
        using your scripts. I’ve analyzed the problem and am proposing a patch
        to fix it.

        --- Analysis
        When a cascading standby streams from an archive-only upstream:

        1. The upstream's GetStandbyFlushRecPtr() returns only replay position
        (no received-but-not-replayed buffer since there's no walreceiver)
        2. When streaming ends and the cascade falls back to archive recovery,
        it can restore WAL segments from its own archive access
        3. The cascade's read position (RecPtr) advances beyond what the
        upstream has replayed
        4. On reconnect, the cascade requests streaming from RecPtr, which the
        upstream rejects as "ahead of flush position"

        --- Proposed Fix

        Track the last confirmed flush position from streaming
        (lastStreamedFlush) and clamp the streaming start request when it
        exceeds that position:
        I haven't read the patch yet, but doesn't lastStreamedFlush represent
        the same LSN as tliRecPtr or replayLSN (the arguments to
        WaitForWALToBecomeAvailable())? If so, we may not need to introduce
        a new variable to track this LSN.
        The choice of which LSN is used as the replication start point has varied
        over time to handle corner cases (for example, commit 06687198018).
        That makes me wonder whether we should first better understand
        why WaitForWALToBecomeAvailable() currently uses RecPtr as
        the starting point.
        BTW, with v1 patch, I was able to reproduce the issue using the following steps:
        --------------------------------------------
        initdb -D data
        mkdir arch
        cat <<EOF >> data/postgresql.conf
        archive_mode = on
        archive_command = 'cp %p ../arch/%f'
        restore_command = 'cp ../arch/%f %p'
        EOF
        pg_ctl -D data start
        pg_basebackup -D sby1 -c fast
        cp -a sby1 sby2
        cat <<EOF >> sby1/postgresql.conf
        port = 5433
        EOF
        touch sby1/standby.signal
        pg_ctl -D sby1 start
        cat <<EOF >> sby2/postgresql.conf
        port = 5434
        primary_conninfo = 'port=5433'
        EOF
        touch sby2/standby.signal
        pg_ctl -D sby2 start
        pgbench -i -s2
        pg_ctl -D sby2 restart
        --------------------------------------------
        In this case, after restarting the standby connecting to another
        (cascading) standby, I observed the following error.
        FATAL: could not receive data from WAL stream: ERROR: requested
        starting point 0/04000000 is ahead of the WAL flush position of this
        server 0/03FFE8D0
        Regards,
        --
        Fujii Masao
        • Jump to comment-1
          Xuneng Zhou<xunengzhou@gmail.com>
          Jan 30, 2026, 6:01 AM UTC
          Hi Fujii'san,
          Thanks for looking into this.
          On Fri, Jan 30, 2026 at 11:12 AM Fujii Masao <masao.fujii@gmail.com> wrote:
          On Thu, Jan 29, 2026 at 9:22 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
          Thanks for your report. I can reliably reproduce the issue on HEAD
          using your scripts. I’ve analyzed the problem and am proposing a patch
          to fix it.

          --- Analysis
          When a cascading standby streams from an archive-only upstream:

          1. The upstream's GetStandbyFlushRecPtr() returns only replay position
          (no received-but-not-replayed buffer since there's no walreceiver)
          2. When streaming ends and the cascade falls back to archive recovery,
          it can restore WAL segments from its own archive access
          3. The cascade's read position (RecPtr) advances beyond what the
          upstream has replayed
          4. On reconnect, the cascade requests streaming from RecPtr, which the
          upstream rejects as "ahead of flush position"

          --- Proposed Fix

          Track the last confirmed flush position from streaming
          (lastStreamedFlush) and clamp the streaming start request when it
          exceeds that position:

          I haven't read the patch yet, but doesn't lastStreamedFlush represent
          the same LSN as tliRecPtr or replayLSN (the arguments to
          WaitForWALToBecomeAvailable())? If so, we may not need to introduce
          a new variable to track this LSN.
          I think they refer to different types of LSNs. I don’t have access to my
          computer at the moment, but I’ll look into it and get back to you shortly.
          The choice of which LSN is used as the replication start point has varied
          over time to handle corner cases (for example, commit 06687198018).
          That makes me wonder whether we should first better understand
          why WaitForWALToBecomeAvailable() currently uses RecPtr as
          the starting point.

          BTW, with v1 patch, I was able to reproduce the issue using the following
          steps:

          --------------------------------------------
          initdb -D data
          mkdir arch
          cat <<EOF >> data/postgresql.conf
          archive_mode = on
          archive_command = 'cp %p ../arch/%f'
          restore_command = 'cp ../arch/%f %p'
          EOF
          pg_ctl -D data start
          pg_basebackup -D sby1 -c fast
          cp -a sby1 sby2
          cat <<EOF >> sby1/postgresql.conf
          port = 5433
          EOF
          touch sby1/standby.signal
          pg_ctl -D sby1 start
          cat <<EOF >> sby2/postgresql.conf
          port = 5434
          primary_conninfo = 'port=5433'
          EOF
          touch sby2/standby.signal
          pg_ctl -D sby2 start
          pgbench -i -s2
          pg_ctl -D sby2 restart
          --------------------------------------------

          In this case, after restarting the standby connecting to another
          (cascading) standby, I observed the following error.

          FATAL: could not receive data from WAL stream: ERROR: requested
          starting point 0/04000000 is ahead of the WAL flush position of this
          server 0/03FFE8D0

          Regards,

          --
          Fujii Masao
          Best,
          Xuneng
    • Jump to comment-1
      Fujii Masao<masao.fujii@gmail.com>
      Jan 29, 2026, 11:33 AM UTC
      On Thu, Jan 29, 2026 at 2:03 AM Marco Nenciarini
      <marco.nenciarini@enterprisedb.com> wrote:

      Hi hackers,

      I've encountered a bug in PostgreSQL's streaming replication where cascading
      standbys fail to reconnect after falling back to archive recovery. The issue
      occurs when the upstream standby uses archive-only recovery.

      The standby requests streaming from the wrong WAL position (next segment boundary
      instead of the current position), causing connection failures with this error:

      ERROR: requested starting point 0/A000000 is ahead of the WAL flush
      position of this server 0/9000000
      Thanks for the report!
      I was also able to reproduce this issue on the master branch.
      Interestingly, I couldn't reproduce it on v11 using the same test case.
      This makes me wonder whether the issue was introduced in v12 or later.
      Do you see the same behavior in your environment?
      Regards,
      --
      Fujii Masao