Replication to standby broke with WAL file corruption

  • Jump to comment-1
    Ishan joshi<ishanjoshi@live.com>
    Mar 13, 2026, 10:41 AM UTC
    Hi Team,
    I found an issue with PG v16.9 patroni setup where our standby node replication and disaster replication site replication broken with below error. It looks like WAL corruption which later part of archive file.
    CONTEXT: WAL redo at 184F3/F248B6F0 for Heap/LOCK: xmax: 2818115117, off:35, infobits: [LOCKONLY, EXCLLOCK], flags: 0x00; blkref #0: rel
    1663/33195/410203483, blk 25329"
    PANIC: WAL contains references to invalid pages"
    CONTEXT: WAL redo at 184F3/F248B6F0 for Heap/LOCK: xmax: 2818115117, off:35, infobits: [LOCKONLY, EXCLLOCK], flags: 0x00; blkref #0: rel1663/33195/410203483, blk 25329"
    WARNING: page 25329 of relation base/33195/410203483 does not exist"
    INFO: no action. I am (pg-patroni-node1-0), a secondary, and following a leader (pg-patroni-node2-0)"
    [61]LOG: terminating any other active server processes"
    [61]LOG: startup process (PID 72) was terminated by signal 6: Aborted"
    [61]LOG: shutting down due to startup process failure"
    [61]LOG: database system is shut down"
    INFO: establishing a new patroni heartbeat connection to postgres"
    INFO: Lock owner: pg-patroni-node2-0; I am pg-patroni-node1-0"
    WARNING: Retry got exception: connection problems"
    WARNING: Failed to determine PostgreSQL state from the connection, fallingback to cached role"
    INFO: Error communicating with PostgreSQL. Will try again later"
    WARNING: Postgresql is not running."
    Primary db was not impacted, however standby node and DR site replication broken, I tried to reinit with latest backup + archive loading from pgbackrest backup but it fails with same error once the corrupt wal/archive file applying the changes. I had to reinit with pgbasebackup with 40TB database which took about 45 hrs of time.
    As I understand the transcation create table ->performed DML and then drop the table or transaction could be rollback that makes RACE condition in WAL file creation and got failed while applying the same in standby/DR site.
    Looks like bug. Any suggestion for this scenario.
    Thanks & Regards,
    Ishan Joshi
    • Jump to comment-1
      Tomas Vondra<tomas@vondra.me>
      Mar 15, 2026, 11:09 PM UTC
      On 3/13/26 11:41, Ishan joshi wrote:
      Hi Team,

      I found an issue with PG v16.9 patroni setup where our standby node
      replication and disaster replication site replication broken with below
      error. It looks like WAL corruption which later part of archive file.


      CONTEXT:  WAL redo at 184F3/F248B6F0 for Heap/LOCK: xmax: 2818115117,
      off:35, infobits: [LOCKONLY, EXCLLOCK], flags: 0x00; blkref #0: rel
      1663/33195/410203483, blk 25329"
      PANIC:  WAL contains references to invalid pages"
      CONTEXT:  WAL redo at 184F3/F248B6F0 for Heap/LOCK: xmax: 2818115117,
      off:35, infobits: [LOCKONLY, EXCLLOCK], flags: 0x00; blkref #0:
      rel1663/33195/410203483, blk 25329"
      WARNING:  page 25329 of relation base/33195/410203483 does not exist"
      INFO: no action. I am (pg-patroni-node1-0), a secondary, and following a
      leader (pg-patroni-node2-0)"
      [61]LOG:  terminating any other active server processes"
      [61]LOG:  startup process (PID 72) was terminated by signal 6: Aborted"
      [61]LOG:  shutting down due to startup process failure"
      [61]LOG:  database system is shut down"
      INFO: establishing a new patroni heartbeat connection to postgres"
      INFO: Lock owner: pg-patroni-node2-0; I am pg-patroni-node1-0"
      WARNING: Retry got exception: connection problems"
      WARNING: Failed to determine PostgreSQL state from the connection,
      fallingback to cached role"
      INFO: Error communicating with PostgreSQL. Will try again later"
      WARNING: Postgresql is not running."


      Primary db was not impacted, however standby node and DR site
      replication broken, I tried to reinit with latest backup + archive
      loading from pgbackrest backup but it fails with same error once the
      corrupt wal/archive file applying the changes. I had to reinit with
      pgbasebackup with 40TB database which took about 45 hrs of time.
      As I understand the transcation create table ->performed DML and then
      drop the table or transaction could be rollback that makes RACE
      condition in WAL file creation and got failed while applying the same in
      standby/DR site.
      It's hard to say what caused this, but it might be interesting to look
      at the WAL using pg_waldump. First at the WAL segment containing the
      record triggering the failure, and then also at WAL segments before that
      containing references to relation 1663/33195/410203483 (and especially
      page 25329).
      It is interesting this succeeded on a primary, but failed on standby.
      Is there anything special about the relation 1663/33195/410203483? Do
      you know if it's a regular / temporary table, etc?
      regards
      --
      Tomas Vondra
      • Jump to comment-1
        Ishan joshi<ishanjoshi@live.com>
        Mar 16, 2026, 6:05 AM UTC
        Thanks Tomas for reply.
        1663/33195/410203483 is table created by user through some transaction, However the transcation got broke and rollback. Which makes the table drop from the primary and it is not impacted. However the WAL file seems to be corrupt at this point where the transaction carrying create table->DML -> rollback, DML is logged first and the same is applying to standby and DR where the table is not created. Looks like RACE condition while writing WAL file.
        This is common scenario, if transaction got broken, it should rollback the transaction and the sequence of the transaction should be logged in WAL file. In this case, DML operation comes before table creation in WAL which broke the replication.
        Thanks & Regards,
        Ishan Joshi
        ________________________________
        From: Tomas Vondra <tomas@vondra.me>
        Sent: 16 March 2026 04:39
        To: Ishan joshi <ishanjoshi@live.com>; pgsql-general@lists.postgresql.org <pgsql-general@lists.postgresql.org>
        Subject: Re: Replication to standby broke with WAL file corruption
        On 3/13/26 11:41, Ishan joshi wrote:
        Hi Team,

        I found an issue with PG v16.9 patroni setup where our standby node
        replication and disaster replication site replication broken with below
        error. It looks like WAL corruption which later part of archive file.


        CONTEXT: WAL redo at 184F3/F248B6F0 for Heap/LOCK: xmax: 2818115117,
        off:35, infobits: [LOCKONLY, EXCLLOCK], flags: 0x00; blkref #0: rel
        1663/33195/410203483, blk 25329"
        PANIC: WAL contains references to invalid pages"
        CONTEXT: WAL redo at 184F3/F248B6F0 for Heap/LOCK: xmax: 2818115117,
        off:35, infobits: [LOCKONLY, EXCLLOCK], flags: 0x00; blkref #0:
        rel1663/33195/410203483, blk 25329"
        WARNING: page 25329 of relation base/33195/410203483 does not exist"
        INFO: no action. I am (pg-patroni-node1-0), a secondary, and following a
        leader (pg-patroni-node2-0)"
        [61]LOG: terminating any other active server processes"
        [61]LOG: startup process (PID 72) was terminated by signal 6: Aborted"
        [61]LOG: shutting down due to startup process failure"
        [61]LOG: database system is shut down"
        INFO: establishing a new patroni heartbeat connection to postgres"
        INFO: Lock owner: pg-patroni-node2-0; I am pg-patroni-node1-0"
        WARNING: Retry got exception: connection problems"
        WARNING: Failed to determine PostgreSQL state from the connection,
        fallingback to cached role"
        INFO: Error communicating with PostgreSQL. Will try again later"
        WARNING: Postgresql is not running."


        Primary db was not impacted, however standby node and DR site
        replication broken, I tried to reinit with latest backup + archive
        loading from pgbackrest backup but it fails with same error once the
        corrupt wal/archive file applying the changes. I had to reinit with
        pgbasebackup with 40TB database which took about 45 hrs of time.
        As I understand the transcation create table ->performed DML and then
        drop the table or transaction could be rollback that makes RACE
        condition in WAL file creation and got failed while applying the same in
        standby/DR site.
        It's hard to say what caused this, but it might be interesting to look
        at the WAL using pg_waldump. First at the WAL segment containing the
        record triggering the failure, and then also at WAL segments before that
        containing references to relation 1663/33195/410203483 (and especially
        page 25329).
        It is interesting this succeeded on a primary, but failed on standby.
        Is there anything special about the relation 1663/33195/410203483? Do
        you know if it's a regular / temporary table, etc?
        regards
        --
        Tomas Vondra