[PATCH] Fix stale relation close in sequence synchronization

  • Jump to comment-1
    Ayush Tiwari<ayushtiwari.slg01@gmail.com>
    Apr 28, 2026, 11:24 AM UTC
    Hi,
    I found a crash in the logical replication sequence synchronization worker
    when the publisher returns NULL sequence data for a sequence after at least
    one sequence in the same sync batch has already been processed.
    One way to reproduce this is to use a subscription that connects to the
    publisher as a replication role that can read one published sequence but
    cannot read another one. pggetsequence_data() returns NULLs for the
    inaccessible sequence. In getandvalidateseqinfo(), that path returns
    COPYSEQ_SKIPPED before assigning a new value to the Relation output
    argument.
    copy_sequences() then still sees the Relation pointer left from the previous
    row and calls table_close() on it again. On a cassert build, this trips:
    TRAP: failed Assert("rel->rd_refcnt > 0");  "relcache.c" file
    The attached patch clears the output Relation pointer at the start of
    getandvalidateseqinfo() and clears the local pointer in copy_sequences()
    after closing it. That keeps early returns from reusing a relation from a
    previous row.
    The patch also adds a TAP test to 036_sequences.pl.
    Thoughts?
    Regards,
    Ayush
    • Jump to comment-1
      Hayato Kuroda (Fujitsu)<kuroda.hayato@fujitsu.com>
      Apr 28, 2026, 12:14 PM UTC
      Dear Ayush,
      I found a crash in the logical replication sequence synchronization worker
      when the publisher returns NULL sequence data for a sequence after at least
      one sequence in the same sync batch has already been processed.
      Good catch. I confirmed the HEAD can crash with your added test.
      The attached patch clears the output Relation pointer at the start of
      getandvalidateseqinfo() and clears the local pointer in copy_sequences()
      after closing it. That keeps early returns from reusing a relation from a
      previous row.
      To confirm; can't we declare the sequence_rel in the inner-loop? My first
      impression was the bug caused by the wrong lifetime. Are there any other
      thoughts around here?
      Added Vignesh in CC because he was a primary author.
      Best regards,
      Hayato Kuroda
      FUJITSU LIMITED
      • Jump to comment-1
        Ayush Tiwari<ayushtiwari.slg01@gmail.com>
        Apr 28, 2026, 12:34 PM UTC
        Hi,
        On Tue, 28 Apr 2026 at 17:44, Hayato Kuroda (Fujitsu) <
        kuroda.hayato@fujitsu.com> wrote:
        Dear Ayush,
        I found a crash in the logical replication sequence synchronization
        worker
        when the publisher returns NULL sequence data for a sequence after at
        least
        one sequence in the same sync batch has already been processed.

        Good catch. I confirmed the HEAD can crash with your added test.
        Thanks for checking and confirming the crash.
        The attached patch clears the output Relation pointer at the start of
        getandvalidateseqinfo() and clears the local pointer in
        copy_sequences()
        after closing it. That keeps early returns from reusing a relation from a
        previous row.

        To confirm; can't we declare the sequence_rel in the inner-loop? My first
        impression was the bug caused by the wrong lifetime. Are there any other
        thoughts around here?
        I agree that declaring sequence_rel in the tuple-processing loop is
        cleaner. The relation belongs only to the current publisher result row,
        so limiting the variable's lifetime makes the intended ownership clearer
        and prevents any value from carrying over to the next row.
        I have kept the initialization of the output argument in
        getandvalidateseqinfo(), so every return path leaves it in a defined
        state. In v2, the caller-side pointer is declared inside the inner loop,
        and the explicit reset after table_close() is no longer needed.
        Attached is v2 with that change.
        Regards,
        Ayush
        • Jump to comment-1
          vignesh C<vignesh21@gmail.com>
          Apr 28, 2026, 1:18 PM UTC
          On Tue, 28 Apr 2026 at 18:04, Ayush Tiwari <ayushtiwari.slg01@gmail.com> wrote:

          Hi,

          On Tue, 28 Apr 2026 at 17:44, Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote:

          Dear Ayush,
          I found a crash in the logical replication sequence synchronization worker
          when the publisher returns NULL sequence data for a sequence after at least
          one sequence in the same sync batch has already been processed.

          Good catch. I confirmed the HEAD can crash with your added test.


          Thanks for checking and confirming the crash.

          The attached patch clears the output Relation pointer at the start of
          getandvalidateseqinfo() and clears the local pointer in copy_sequences()
          after closing it. That keeps early returns from reusing a relation from a
          previous row.

          To confirm; can't we declare the sequence_rel in the inner-loop? My first
          impression was the bug caused by the wrong lifetime. Are there any other
          thoughts around here?


          I agree that declaring sequence_rel in the tuple-processing loop is
          cleaner. The relation belongs only to the current publisher result row,
          so limiting the variable's lifetime makes the intended ownership clearer
          and prevents any value from carrying over to the next row.

          I have kept the initialization of the output argument in
          getandvalidateseqinfo(), so every return path leaves it in a defined
          state. In v2, the caller-side pointer is declared inside the inner loop,
          and the explicit reset after table_close() is no longer needed.

          Attached is v2 with that change.
          Thanks for finding and reporting the issue. I was able to reproduce
          the issue with the steps you provided.
          Few comments:
          1) Here "SELECT nextval('regressnoselect');" and "REVOKE ALL ON
          SEQUENCE regressnoselect FROM PUBLIC;" is not required for this test
          case:
          +$nodepublisher->safepsql(
          + 'postgres', qq(
          + CREATE ROLE regressseqrepl LOGIN REPLICATION;
          + CREATE SEQUENCE regressnoselect;
          + SELECT nextval('regressnoselect');
          + GRANT USAGE ON SCHEMA public TO regressseqrepl;
          + GRANT SELECT ON ALL SEQUENCES IN SCHEMA public TO regressseqrepl;
          + REVOKE ALL ON SEQUENCE regressnoselect FROM PUBLIC;
          + REVOKE ALL ON SEQUENCE regressnoselect FROM regressseqrepl;
          +));
          2) Since the comment about “dropped concurrently” has been removed,
          could you merge that context into the new wording:
          /*
          - * last_value can be NULL if the sequence was dropped concurrently (see
          - * pggetsequence_data()).
          + * The sequence data can be NULL if it is not accessible on the publisher
          + * (see pggetsequence_data()).
          */
          Something like:
          /*
          * The sequence data can be NULL due to insufficient privileges or if
          * the sequence was dropped concurrently (see pggetsequence_data()).
          */
          3) Can we change this:
          ##########
          # A NULL sequence data row from the publisher must not make the subscriber
          # close the previously synchronized sequence relation again.
          ##########
          To something like:
          ##########
          # Ensure that insufficient privileges on the publisher for a sequence
          # do not disrupt the subscriber. The subscriber should log a warning
          # and continue retrying.
          ##########
          Regards,
          Vignesh
          • Jump to comment-1
            Ayush Tiwari<ayushtiwari.slg01@gmail.com>
            Apr 28, 2026, 1:35 PM UTC
            Hi,
            Thanks for reviewing and confirming the issue.
            On Tue, 28 Apr 2026 at 18:48, vignesh C <vignesh21@gmail.com> wrote:

            Thanks for finding and reporting the issue. I was able to reproduce
            the issue with the steps you provided.
            Few comments:
            1) Here "SELECT nextval('regressnoselect');" and "REVOKE ALL ON
            SEQUENCE regressnoselect FROM PUBLIC;" is not required for this test
            case:
            +$nodepublisher->safepsql(
            + 'postgres', qq(
            + CREATE ROLE regressseqrepl LOGIN REPLICATION;
            + CREATE SEQUENCE regressnoselect;
            + SELECT nextval('regressnoselect');
            + GRANT USAGE ON SCHEMA public TO regressseqrepl;
            + GRANT SELECT ON ALL SEQUENCES IN SCHEMA public TO regressseqrepl;
            + REVOKE ALL ON SEQUENCE regressnoselect FROM PUBLIC;
            + REVOKE ALL ON SEQUENCE regressnoselect FROM regressseqrepl;
            +));
            Agreed. Removed both statements from the test.
            2) Since the comment about “dropped concurrently” has been removed,
            could you merge that context into the new wording:
            Done. The comment now mentions both insufficient privileges and
            concurrent drops.

            3) Can we change this:
            ##########
            # A NULL sequence data row from the publisher must not make the subscriber
            # close the previously synchronized sequence relation again.
            ##########

            To something like:
            ##########
            # Ensure that insufficient privileges on the publisher for a sequence
            # do not disrupt the subscriber. The subscriber should log a warning
            # and continue retrying.
            ##########
            Done. I used that wording for the TAP test block comment.
            Attached is v3 with these changes.
            Regards,
            Ayush
            • Jump to comment-1
              vignesh C<vignesh21@gmail.com>
              Apr 30, 2026, 4:30 AM UTC
              On Tue, 28 Apr 2026 at 19:05, Ayush Tiwari <ayushtiwari.slg01@gmail.com> wrote:

              Hi,

              Thanks for reviewing and confirming the issue.

              Attached is v3 with these changes.
              Few comments:
              1) Since we are setting the variable to NULL for every sequence now,
              this is not required:
              @@ -246,6 +246,8 @@ getandvalidateseqinfo(TupleTableSlot *slot,
              Relation *sequence_rel,
                  Form_pg_sequence local_seq;
                  LogicalRepSequenceInfo *seqinfo_local;
              + *sequence_rel = NULL;
              +
                  *seqidx = DatumGetInt32(slot_getattr(slot, ++col, &isnull));
                  Assert(!isnull);
              2) Creating a subscription is costly as it has more work to do as it
              has to sync all relations and requires more processes to be started,
              instead shall we use "ALTER SUBSCRIPTION ... SET CONNECTION" followed
              by "ALTER SUBSCRIPTION ... REFRESH SEQUENCES"
              +$nodesubscriber->safepsql(
              + 'postgres',
              + "CREATE SUBSCRIPTION regressseqsubnoselect CONNECTION
              '$publisherlimitedconnstr' PUBLICATION regressseqpub WITH
              (disableonerror = true)"
              +);
              3) You can use sequence name as regress_s5 to be consistent with the
              other sequence names nearby or alternatively you can change the privs
              of an already existing sequence:
              + CREATE ROLE regressseqrepl LOGIN REPLICATION;
              + CREATE SEQUENCE regressnoselect;
              + GRANT USAGE ON SCHEMA public TO regressseqrepl;
              4) I feel this is not required:
              +$result = $nodesubscriber->safepsql('postgres', 'SELECT 1');
              +is($result, '1',
              + 'subscriber remains running after publisher returns NULL
              sequence data');
              Regards,
              Vignesh
              • Jump to comment-1
                Ayush Tiwari<ayushtiwari.slg01@gmail.com>
                Apr 30, 2026, 4:53 AM UTC
                Hi,
                Thanks for the review.
                On Thu, 30 Apr 2026 at 10:00, vignesh C <vignesh21@gmail.com> wrote:


                Few comments:
                1) Since we are setting the variable to NULL for every sequence now,
                this is not required:
                @@ -246,6 +246,8 @@ getandvalidateseqinfo(TupleTableSlot *slot,
                Relation *sequence_rel,
                Formpgsequence local_seq;
                LogicalRepSequenceInfo *seqinfo_local;

                + *sequence_rel = NULL;
                +
                *seqidx = DatumGetInt32(slot_getattr(slot, ++col, &isnull));
                Assert(!isnull);
                I had added it as defence-in-depth, but yeah can be removed.

                2) Creating a subscription is costly as it has more work to do as it
                has to sync all relations and requires more processes to be started,
                instead shall we use "ALTER SUBSCRIPTION ... SET CONNECTION" followed
                by "ALTER SUBSCRIPTION ... REFRESH SEQUENCES"
                +$nodesubscriber->safepsql(
                + 'postgres',
                + "CREATE SUBSCRIPTION regressseqsubnoselect CONNECTION
                '$publisherlimitedconnstr' PUBLICATION regressseqpub WITH
                (disableonerror = true)"
                +);
                Makes sense.

                3) You can use sequence name as regress_s5 to be consistent with the
                other sequence names nearby or alternatively you can change the privs
                of an already existing sequence:
                + CREATE ROLE regressseqrepl LOGIN REPLICATION;
                + CREATE SEQUENCE regressnoselect;
                + GRANT USAGE ON SCHEMA public TO regressseqrepl;
                Sounds good.

                4) I feel this is not required:
                +$result = $nodesubscriber->safepsql('postgres', 'SELECT 1');
                +is($result, '1',
                + 'subscriber remains running after publisher returns NULL
                sequence data');
                Yeah, you are right.
                I've made these changes, attaching patch v4.
                Please review and let me know.
                Regards,
                Ayush
                • Jump to comment-1
                  vignesh C<vignesh21@gmail.com>
                  Apr 30, 2026, 5:57 AM UTC
                  On Thu, 30 Apr 2026 at 10:23, Ayush Tiwari <ayushtiwari.slg01@gmail.com> wrote:

                  Hi,

                  Thanks for the review.
                  On Thu, 30 Apr 2026 at 10:00, vignesh C <vignesh21@gmail.com> wrote:



                  Few comments:
                  1) Since we are setting the variable to NULL for every sequence now,
                  this is not required:
                  @@ -246,6 +246,8 @@ getandvalidateseqinfo(TupleTableSlot *slot,
                  Relation *sequence_rel,
                  Formpgsequence local_seq;
                  LogicalRepSequenceInfo *seqinfo_local;

                  + *sequence_rel = NULL;
                  +
                  *seqidx = DatumGetInt32(slot_getattr(slot, ++col, &isnull));
                  Assert(!isnull);


                  I had added it as defence-in-depth, but yeah can be removed.


                  2) Creating a subscription is costly as it has more work to do as it
                  has to sync all relations and requires more processes to be started,
                  instead shall we use "ALTER SUBSCRIPTION ... SET CONNECTION" followed
                  by "ALTER SUBSCRIPTION ... REFRESH SEQUENCES"
                  +$nodesubscriber->safepsql(
                  + 'postgres',
                  + "CREATE SUBSCRIPTION regressseqsubnoselect CONNECTION
                  '$publisherlimitedconnstr' PUBLICATION regressseqpub WITH
                  (disableonerror = true)"
                  +);


                  Makes sense.


                  3) You can use sequence name as regress_s5 to be consistent with the
                  other sequence names nearby or alternatively you can change the privs
                  of an already existing sequence:
                  + CREATE ROLE regressseqrepl LOGIN REPLICATION;
                  + CREATE SEQUENCE regressnoselect;
                  + GRANT USAGE ON SCHEMA public TO regressseqrepl;


                  Sounds good.


                  4) I feel this is not required:
                  +$result = $nodesubscriber->safepsql('postgres', 'SELECT 1');
                  +is($result, '1',
                  + 'subscriber remains running after publisher returns NULL
                  sequence data');


                  Yeah, you are right.

                  I've made these changes, attaching patch v4.
                  Please review and let me know.
                  Thanks for the updated patch. After this error occurs, it currently
                  emits the message 'missing sequence on publisher
                  ("public.regressnoselect")'. This can be misleading, as it suggests
                  the sequence does not exist, whereas the actual cause may be
                  insufficient privileges. As a result, users may find it difficult to
                  diagnose and resolve the issue.
                  Consider updating the error reported in reportsequenceerrors to
                  convey that the failure could be due to either a missing sequence or
                  insufficient privileges (e.g., "missing sequence or insufficient
                  privilege on publisher"). While making this change, also update the
                  missingseqsidx variable name and the associated comments to
                  accurately reflect the broader scope of the condition.
                  Regards,
                  Vignesh
                  • Jump to comment-1
                    Ayush Tiwari<ayushtiwari.slg01@gmail.com>
                    Apr 30, 2026, 6:46 AM UTC
                    Hi,
                    On Thu, 30 Apr 2026 at 11:27, vignesh C <vignesh21@gmail.com> wrote:

                    Consider updating the error reported in reportsequenceerrors to
                    convey that the failure could be due to either a missing sequence or
                    insufficient privileges (e.g., "missing sequence or insufficient
                    privilege on publisher"). While making this change, also update the
                    missingseqsidx variable name and the associated comments to
                    accurately reflect the broader scope of the condition.
                    Agreed. I have updated the warning to say "missing or inaccessible
                    sequence on publisher", so it covers both cases where the sequence is
                    actually missing and cases where the publisher-side role cannot access
                    its sequence data.
                    I also renamed the related variables from missing* to unavailable*,
                    and updated the associated comments and DEBUG message to avoid implying
                    that the sequence is necessarily absent on the publisher.
                    The TAP expectations for both the dropped-sequence case and the
                    insufficient-privileges case have been updated to match the new wording.
                    Attached is v5 with these changes.
                    Thoughts?
                    Regards,
                    Ayush
                    • Jump to comment-1
                      Amit Kapila<amit.kapila16@gmail.com>
                      Apr 30, 2026, 11:20 AM UTC
                      On Thu, Apr 30, 2026 at 12:16 PM Ayush Tiwari
                      <ayushtiwari.slg01@gmail.com> wrote:
                      On Thu, 30 Apr 2026 at 11:27, vignesh C <vignesh21@gmail.com> wrote:


                      Consider updating the error reported in reportsequenceerrors to
                      convey that the failure could be due to either a missing sequence or
                      insufficient privileges (e.g., "missing sequence or insufficient
                      privilege on publisher"). While making this change, also update the
                      missingseqsidx variable name and the associated comments to
                      accurately reflect the broader scope of the condition.


                      Agreed.
                      Sequence sync can also be skipped if the same sequence is a temporary
                      sequence from some other session on publisher. So, the change in v5
                      was also incomplete if we want to make it more generic than now. I
                      have pushed v4 for now but if you or Vignesh would like to make the
                      message more generic, feel free to propose a new patch.
                      --
                      With Regards,
                      Amit Kapila.