promotion related handling in pg_sync_replication_slots()

  • Jump to comment-1
    shveta malik<shveta.malik@gmail.com>
    Apr 4, 2024, 11:35 AM UTC
    Hello hackers,
    Currently, promotion related handling is missing in the slot sync SQL
    function pgsyncreplication_slots(). Here is the background on how
    it is done in slot sync worker:
    During promotion, the startup process in order to shut down the
    slot-sync worker, sets the 'stopSignaled' flag, sends the shut-down
    signal, and waits for slot sync worker to exit. Meanwhile if the
    postmaster has not noticed the promotion yet, it may end up restarting
    slot sync worker. In such a case, the worker exits if 'stopSignaled'
    is set.
    Since there is a chance that the user (or any of his scripts/tools)
    may execute SQL function pgsyncreplication_slots() in parallel to
    promotion, such handling is needed in this SQL function as well, The
    attached patch attempts to implement the same. Changes are:
    1) If pgsyncreplication_slots() is already running when the
    promotion is triggered, ShutDownSlotSync() checks the
    'SlotSyncCtx->syncing' flag as well and waits for it to become false
    i.e. waits till parallel running SQL function is finished.
    2) If pgsyncreplication_slots() is invoked when promotion is
    already in progress, pgsyncreplication_slots() respects the
    'stopSignaled' flag set by the startup process and becomes a no-op.
    thanks
    Shveta
    • Jump to comment-1
      Amit Kapila<amit.kapila16@gmail.com>
      Apr 4, 2024, 11:47 AM UTC
      On Thu, Apr 4, 2024 at 5:05 PM shveta malik <shveta.malik@gmail.com> wrote:

      Hello hackers,

      Currently, promotion related handling is missing in the slot sync SQL
      function pgsyncreplication_slots(). Here is the background on how
      it is done in slot sync worker:
      During promotion, the startup process in order to shut down the
      slot-sync worker, sets the 'stopSignaled' flag, sends the shut-down
      signal, and waits for slot sync worker to exit. Meanwhile if the
      postmaster has not noticed the promotion yet, it may end up restarting
      slot sync worker. In such a case, the worker exits if 'stopSignaled'
      is set.

      Since there is a chance that the user (or any of his scripts/tools)
      may execute SQL function pgsyncreplication_slots() in parallel to
      promotion, such handling is needed in this SQL function as well, The
      attached patch attempts to implement the same.
      Thanks for the report and patch. I'll look into it.
      --
      With Regards,
      Amit Kapila.
      • Jump to comment-1
        shveta malik<shveta.malik@gmail.com>
        Apr 5, 2024, 5:01 AM UTC
        On Thu, Apr 4, 2024 at 5:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
        On Thu, Apr 4, 2024 at 5:05 PM shveta malik <shveta.malik@gmail.com> wrote:

        Hello hackers,

        Currently, promotion related handling is missing in the slot sync SQL
        function pgsyncreplication_slots(). Here is the background on how
        it is done in slot sync worker:
        During promotion, the startup process in order to shut down the
        slot-sync worker, sets the 'stopSignaled' flag, sends the shut-down
        signal, and waits for slot sync worker to exit. Meanwhile if the
        postmaster has not noticed the promotion yet, it may end up restarting
        slot sync worker. In such a case, the worker exits if 'stopSignaled'
        is set.

        Since there is a chance that the user (or any of his scripts/tools)
        may execute SQL function pgsyncreplication_slots() in parallel to
        promotion, such handling is needed in this SQL function as well, The
        attached patch attempts to implement the same.

        Thanks for the report and patch. I'll look into it.
        Please find v2. Changes are:
        1) Rebased the patch as there was conflict due to recent commit 6f132ed.
        2) Added an Assert in updatesyncedslotsinactivesince() to ensure
        that the slot does not have active_pid.
        3) Improved commit msg and comments.
        thanks
        Shveta
        • Jump to comment-1
          Amit Kapila<amit.kapila16@gmail.com>
          Apr 6, 2024, 6:19 AM UTC
          On Fri, Apr 5, 2024 at 10:31 AM shveta malik <shveta.malik@gmail.com> wrote:

          Please find v2. Changes are:
          1) Rebased the patch as there was conflict due to recent commit 6f132ed.
          2) Added an Assert in updatesyncedslotsinactivesince() to ensure
          that the slot does not have active_pid.
          3) Improved commit msg and comments.
          Few comments:
          ==============
          1.
          void
          SyncReplicationSlots(WalReceiverConn *wrconn)
          {
          + /*
          + * Startup process signaled the slot sync to stop, so if meanwhile user
          + * has invoked slot sync SQL function, simply return.
          + */
          + SpinLockAcquire(&SlotSyncCtx->mutex);
          + if (SlotSyncCtx->stopSignaled)
          + {
          + ereport(LOG,
          + errmsg("skipping slot synchronization as slot sync shutdown is
          signaled during promotion"));
          +
          + SpinLockRelease(&SlotSyncCtx->mutex);
          + return;
          + }
          + SpinLockRelease(&SlotSyncCtx->mutex);
          There is a race condition with this code. Say during promotion
          ShutDownSlotSync() is just before setting this flag and the user has
          invoked pgsyncreplication_slots() and passed this check but still
          didn't set the SlotSyncCtx->syncing flag. So, now, the promotion would
          recognize that there is slot sync going on in parallel, and slot sync
          wouldn't know that the promotion is in progress.
          The current coding for slot sync worker ensures there is no such race
          by checking the stopSignaled and setting the pid together under
          spinlock. I think we need to move the setting of the syncing flag
          similarly. Once we do that probably checking SlotSyncCtx->syncing
          should be sufficient in ShutDownSlotSync(). If we change the location
          of setting the 'syncing' flag then please ensure its cleanup as we
          currently do in slotsyncfailurecallback().
          2.
          @@ -1395,6 +1395,7 @@ updatesyncedslotsinactivesince(void)
          if (s->in_use && s->data.synced)
          {
          Assert(SlotIsLogical(s));
          + Assert(s->active_pid == 0);
          We can add a comment like: "The slot must not be acquired by any process"
          --
          With Regards,
          Amit Kapila.
          • Jump to comment-1
            shveta malik<shveta.malik@gmail.com>
            Apr 12, 2024, 2:17 AM UTC
            On Sat, Apr 6, 2024 at 11:49 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
            On Fri, Apr 5, 2024 at 10:31 AM shveta malik <shveta.malik@gmail.com> wrote:

            Please find v2. Changes are:
            1) Rebased the patch as there was conflict due to recent commit 6f132ed.
            2) Added an Assert in updatesyncedslotsinactivesince() to ensure
            that the slot does not have active_pid.
            3) Improved commit msg and comments.

            Few comments:
            ==============
            1.
            void
            SyncReplicationSlots(WalReceiverConn *wrconn)
            {
            + /*
            + * Startup process signaled the slot sync to stop, so if meanwhile user
            + * has invoked slot sync SQL function, simply return.
            + */
            + SpinLockAcquire(&SlotSyncCtx->mutex);
            + if (SlotSyncCtx->stopSignaled)
            + {
            + ereport(LOG,
            + errmsg("skipping slot synchronization as slot sync shutdown is
            signaled during promotion"));
            +
            + SpinLockRelease(&SlotSyncCtx->mutex);
            + return;
            + }
            + SpinLockRelease(&SlotSyncCtx->mutex);

            There is a race condition with this code. Say during promotion
            ShutDownSlotSync() is just before setting this flag and the user has
            invoked pgsyncreplication_slots() and passed this check but still
            didn't set the SlotSyncCtx->syncing flag. So, now, the promotion would
            recognize that there is slot sync going on in parallel, and slot sync
            wouldn't know that the promotion is in progress.
            Did you mean that now, the promotion would not recognize...
            I see, I will fix this.
            The current coding for slot sync worker ensures there is no such race
            by checking the stopSignaled and setting the pid together under
            spinlock. I think we need to move the setting of the syncing flag
            similarly. Once we do that probably checking SlotSyncCtx->syncing
            should be sufficient in ShutDownSlotSync(). If we change the location
            of setting the 'syncing' flag then please ensure its cleanup as we
            currently do in slotsyncfailurecallback().
            Sure, let me review.
            2.
            @@ -1395,6 +1395,7 @@ updatesyncedslotsinactivesince(void)
            if (s->in_use && s->data.synced)
            {
            Assert(SlotIsLogical(s));
            + Assert(s->active_pid == 0);

            We can add a comment like: "The slot must not be acquired by any process"

            --
            With Regards,
            Amit Kapila.
            • Jump to comment-1
              Amit Kapila<amit.kapila16@gmail.com>
              Apr 12, 2024, 10:37 AM UTC
              On Fri, Apr 12, 2024 at 7:47 AM shveta malik <shveta.malik@gmail.com> wrote:
              On Sat, Apr 6, 2024 at 11:49 AM Amit Kapila <amit.kapila16@gmail.com> wrote:


              Few comments:
              ==============
              1.
              void
              SyncReplicationSlots(WalReceiverConn *wrconn)
              {
              + /*
              + * Startup process signaled the slot sync to stop, so if meanwhile user
              + * has invoked slot sync SQL function, simply return.
              + */
              + SpinLockAcquire(&SlotSyncCtx->mutex);
              + if (SlotSyncCtx->stopSignaled)
              + {
              + ereport(LOG,
              + errmsg("skipping slot synchronization as slot sync shutdown is
              signaled during promotion"));
              +
              + SpinLockRelease(&SlotSyncCtx->mutex);
              + return;
              + }
              + SpinLockRelease(&SlotSyncCtx->mutex);

              There is a race condition with this code. Say during promotion
              ShutDownSlotSync() is just before setting this flag and the user has
              invoked pgsyncreplication_slots() and passed this check but still
              didn't set the SlotSyncCtx->syncing flag. So, now, the promotion would
              recognize that there is slot sync going on in parallel, and slot sync
              wouldn't know that the promotion is in progress.

              Did you mean that now, the promotion would not recognize...
              Right.
              I see, I will fix this.
              Thanks.
              --
              With Regards,
              Amit Kapila.
        • Jump to comment-1
          Bharath Rupireddy<bharath.rupireddyforpostgres@gmail.com>
          Apr 6, 2024, 6:55 AM UTC
          On Fri, Apr 5, 2024 at 10:31 AM shveta malik <shveta.malik@gmail.com> wrote:

          Please find v2. Changes are:
          Thanks for the patch. Here are some comments.
          1. Can we have a clear saying in the shmem variable who's syncing at
          the moment? Is it a slot sync worker or a backend via SQL function?
          Perhaps turn "bool syncing;" to "SlotSyncSource sync_source;"
          typedef enum SlotSyncSource
          {
          SLOT_SYNC_NONE,
          SLOT_SYNC_WORKER,
          SLOT_SYNC_BACKEND,
          } SlotSyncSource;
          Then, the check in ShutDownSlotSync can be:
          + /*
          + * Return if neither the slot sync worker is running nor the function
          + * pgsyncreplication_slots() is executing.
          + */
          + if ((SlotSyncCtx->pid == InvalidPid) &&
          SlotSyncCtx->syncsource != SLOTSYNC_BACKEND)
              {
          2.
          SyncReplicationSlots(WalReceiverConn *wrconn)
          {
          + /*
          + * Startup process signaled the slot sync to stop, so if meanwhile user
          + * has invoked slot sync SQL function, simply return.
          + */
          + SpinLockAcquire(&SlotSyncCtx->mutex);
          + if (SlotSyncCtx->stopSignaled)
          + {
          + ereport(LOG,
          + errmsg("skipping slot synchronization as slot sync
          shutdown is signaled during promotion"));
          +
          Unless I'm missing something, I think this can't detect if the backend
          via SQL function is already half-way through syncing in
          synchronizeoneslot. So, better move this check to (or also have it
          there) slot sync loop that calls synchronizeoneslot. To avoid
          spinlock acquisitions, we can perhaps do this check in when we acquire
          the spinlock for synced flag.
          /* Search for the named slot */
          if ((slot = SearchNamedReplicationSlot(remote_slot->name, true)))
          {
              bool        synced;
          
              SpinLockAcquire(&slot->mutex);
              synced = slot->data.synced;
              << get SlotSyncCtx->stopSignaled here >>
              SpinLockRelease(&slot->mutex);
          
              << do the slot sync skip check here if (stopSignaled) >>
          3. Can we have a test or steps at least to check the consequences
          manually one can get if slot syncing via SQL function is happening
          during the promotion?
          IIUC, we need to ensure there is no backend acquiring it and
          performing sync while the slot sync worker is shutting down/standby
          promotion is occuring. Otherwise, some of the slots can get resynced
          and some are not while we are shutting down the slot sync worker as
          part of the standby promotion which might leave the slots in an
          inconsistent state.
          --
          Bharath Rupireddy
          PostgreSQL Contributors Team
          RDS Open Source Databases
          Amazon Web Services: https://aws.amazon.com
          • Jump to comment-1
            shveta malik<shveta.malik@gmail.com>
            Apr 12, 2024, 2:28 AM UTC
            On Sat, Apr 6, 2024 at 12:25 PM Bharath Rupireddy
            <bharath.rupireddyforpostgres@gmail.com> wrote:
            On Fri, Apr 5, 2024 at 10:31 AM shveta malik <shveta.malik@gmail.com> wrote:

            Please find v2. Changes are:

            Thanks for the patch. Here are some comments.
            Thanks for reviewing.

            1. Can we have a clear saying in the shmem variable who's syncing at
            the moment? Is it a slot sync worker or a backend via SQL function?
            Perhaps turn "bool syncing;" to "SlotSyncSource sync_source;"

            typedef enum SlotSyncSource
            {
            SLOTSYNCNONE,
            SLOTSYNCWORKER,
            SLOTSYNCBACKEND,
            } SlotSyncSource;

            Then, the check in ShutDownSlotSync can be:

            + /*
            + * Return if neither the slot sync worker is running nor the function
            + * pgsyncreplication_slots() is executing.
            + */
            + if ((SlotSyncCtx->pid == InvalidPid) &&
            SlotSyncCtx->syncsource != SLOTSYNC_BACKEND)
            {

            2.
            SyncReplicationSlots(WalReceiverConn *wrconn)
            {
            + /*
            + * Startup process signaled the slot sync to stop, so if meanwhile user
            + * has invoked slot sync SQL function, simply return.
            + */
            + SpinLockAcquire(&SlotSyncCtx->mutex);
            + if (SlotSyncCtx->stopSignaled)
            + {
            + ereport(LOG,
            + errmsg("skipping slot synchronization as slot sync
            shutdown is signaled during promotion"));
            +

            Unless I'm missing something, I think this can't detect if the backend
            via SQL function is already half-way through syncing in
            synchronizeoneslot. So, better move this check to (or also have it
            there) slot sync loop that calls synchronizeoneslot. To avoid
            spinlock acquisitions, we can perhaps do this check in when we acquire
            the spinlock for synced flag.
            If the sync via SQL function is already half-way, then promotion
            should wait for it to finish. I don't think it is a good idea to move
            the check to synchronizeoneslot(). The sync-call should either not
            start (if it noticed the promotion) or finish the sync and then let
            promotion proceed. But I would like to know others' opinion on this.

            / Search for the named slot /
            if ((slot = SearchNamedReplicationSlot(remote_slot->name, true)))
            {
            bool synced;
            SpinLockAcquire(&slot->mutex);
            synced = slot->data.synced;
            << get SlotSyncCtx->stopSignaled here >>
            SpinLockRelease(&slot->mutex);
            << do the slot sync skip check here if (stopSignaled) >>

            3. Can we have a test or steps at least to check the consequences
            manually one can get if slot syncing via SQL function is happening
            during the promotion?

            IIUC, we need to ensure there is no backend acquiring it and
            performing sync while the slot sync worker is shutting down/standby
            promotion is occuring. Otherwise, some of the slots can get resynced
            and some are not while we are shutting down the slot sync worker as
            part of the standby promotion which might leave the slots in an
            inconsistent state.
            I do not think that we can reach a state (exception is some error
            scenario) where some of the slots are synced while the rest are not
            during a particular sync-cycle only because promotion is going in
            parallel. (And yes we need to fix the race-condition stated by Amit
            up-thread for this statement to be true.)
            thanks
            Shveta
            • Jump to comment-1
              Amit Kapila<amit.kapila16@gmail.com>
              Apr 12, 2024, 10:48 AM UTC
              On Fri, Apr 12, 2024 at 7:57 AM shveta malik <shveta.malik@gmail.com> wrote:

              On Sat, Apr 6, 2024 at 12:25 PM Bharath Rupireddy
              <bharath.rupireddyforpostgres@gmail.com> wrote:
              On Fri, Apr 5, 2024 at 10:31 AM shveta malik <shveta.malik@gmail.com> wrote:

              Please find v2. Changes are:

              Thanks for the patch. Here are some comments.

              Thanks for reviewing.

              1. Can we have a clear saying in the shmem variable who's syncing at
              the moment? Is it a slot sync worker or a backend via SQL function?
              Perhaps turn "bool syncing;" to "SlotSyncSource sync_source;"

              typedef enum SlotSyncSource
              {
              SLOTSYNCNONE,
              SLOTSYNCWORKER,
              SLOTSYNCBACKEND,
              } SlotSyncSource;

              Then, the check in ShutDownSlotSync can be:

              + /*
              + * Return if neither the slot sync worker is running nor the function
              + * pgsyncreplication_slots() is executing.
              + */
              + if ((SlotSyncCtx->pid == InvalidPid) &&
              SlotSyncCtx->syncsource != SLOTSYNC_BACKEND)
              {
              I don't know if this will be help, especially after fixing the race
              condition I mentioned. But otherwise, also, at this stage it doesn't
              seem helpful to add the source of sync explicitly.
              2.
              SyncReplicationSlots(WalReceiverConn *wrconn)
              {
              + /*
              + * Startup process signaled the slot sync to stop, so if meanwhile user
              + * has invoked slot sync SQL function, simply return.
              + */
              + SpinLockAcquire(&SlotSyncCtx->mutex);
              + if (SlotSyncCtx->stopSignaled)
              + {
              + ereport(LOG,
              + errmsg("skipping slot synchronization as slot sync
              shutdown is signaled during promotion"));
              +

              Unless I'm missing something, I think this can't detect if the backend
              via SQL function is already half-way through syncing in
              synchronizeoneslot. So, better move this check to (or also have it
              there) slot sync loop that calls synchronizeoneslot. To avoid
              spinlock acquisitions, we can perhaps do this check in when we acquire
              the spinlock for synced flag.

              If the sync via SQL function is already half-way, then promotion
              should wait for it to finish. I don't think it is a good idea to move
              the check to synchronizeoneslot(). The sync-call should either not
              start (if it noticed the promotion) or finish the sync and then let
              promotion proceed. But I would like to know others' opinion on this.
              Agreed.
              --
              With Regards,
              Amit Kapila.
              • Jump to comment-1
                shveta malik<shveta.malik@gmail.com>
                Apr 12, 2024, 11:56 AM UTC
                On Fri, Apr 12, 2024 at 4:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
                On Fri, Apr 12, 2024 at 7:57 AM shveta malik <shveta.malik@gmail.com> wrote:

                On Sat, Apr 6, 2024 at 12:25 PM Bharath Rupireddy
                <bharath.rupireddyforpostgres@gmail.com> wrote:
                On Fri, Apr 5, 2024 at 10:31 AM shveta malik <shveta.malik@gmail.com> wrote:

                Please find v2. Changes are:

                Thanks for the patch. Here are some comments.

                Thanks for reviewing.

                1. Can we have a clear saying in the shmem variable who's syncing at
                the moment? Is it a slot sync worker or a backend via SQL function?
                Perhaps turn "bool syncing;" to "SlotSyncSource sync_source;"

                typedef enum SlotSyncSource
                {
                SLOTSYNCNONE,
                SLOTSYNCWORKER,
                SLOTSYNCBACKEND,
                } SlotSyncSource;

                Then, the check in ShutDownSlotSync can be:

                + /*
                + * Return if neither the slot sync worker is running nor the function
                + * pgsyncreplication_slots() is executing.
                + */
                + if ((SlotSyncCtx->pid == InvalidPid) &&
                SlotSyncCtx->syncsource != SLOTSYNC_BACKEND)
                {

                I don't know if this will be help, especially after fixing the race
                condition I mentioned. But otherwise, also, at this stage it doesn't
                seem helpful to add the source of sync explicitly.
                Agreed.
                Please find v3 addressing race-condition and one other comment.
                Up-thread it was suggested that, probably, checking
                SlotSyncCtx->syncing should be sufficient in ShutDownSlotSync(). On
                re-thinking, it might not be. Slot sync worker sets and resets
                'syncing' with each sync-cycle, and thus we need to rely on worker's
                pid in ShutDownSlotSync(), as there could be a window where promotion
                is triggered and 'syncing' is not set for worker, while the worker is
                still running. This implementation of setting and resetting syncing
                with each sync-cycle looks better as compared to setting syncing
                during the entire life-cycle of the worker. So, I did not change it.
                To fix the race condition, I moved the setting of the 'syncing' flag
                together with the 'stopSignaled' check under the same spinLock for the
                SQL function. OTOH, for worker, I feel it is good to check
                'stopSignaled' at the beginning itself, while retaining the
                setting/resetting of 'syncing' at a later stage during the actual sync
                cycle. This makes handling for SQL function and worker slightly
                different. And thus to achieve this, I had to take the 'syncing' flag
                handling out of synchronize_slots() and move it to both worker and SQL
                function by introducing 2 new functions checkandsetsyncingflag()
                and resetsyncingflag().
                I am analyzing if there are better ways to achieve this, any
                suggestions are welcome.
                thanks
                Shveta
                • Jump to comment-1
                  Amit Kapila<amit.kapila16@gmail.com>
                  Apr 15, 2024, 8:59 AM UTC
                  On Fri, Apr 12, 2024 at 5:25 PM shveta malik <shveta.malik@gmail.com> wrote:

                  Please find v3 addressing race-condition and one other comment.

                  Up-thread it was suggested that, probably, checking
                  SlotSyncCtx->syncing should be sufficient in ShutDownSlotSync(). On
                  re-thinking, it might not be. Slot sync worker sets and resets
                  'syncing' with each sync-cycle, and thus we need to rely on worker's
                  pid in ShutDownSlotSync(), as there could be a window where promotion
                  is triggered and 'syncing' is not set for worker, while the worker is
                  still running. This implementation of setting and resetting syncing
                  with each sync-cycle looks better as compared to setting syncing
                  during the entire life-cycle of the worker. So, I did not change it.
                  To retain this we need to have different handling for 'syncing' for
                  workers and function which seems like more maintenance burden than the
                  value it provides. Moreover, in SyncReplicationSlots(), we are calling
                  a function after acquiring spinlock which is not our usual coding
                  practice.
                  One minor comment:
                  * All the fields except 'syncing' are used only by slotsync worker.
                  * 'syncing' is used both by worker and SQL function pgsyncreplication_slots.
                  */
                  typedef struct SlotSyncCtxStruct
                  {
                  pid_t pid;
                  bool stopSignaled;
                  bool syncing;
                  timet laststart_time;
                  slock_t mutex;
                  } SlotSyncCtxStruct;
                  I feel the above comment is no longer valid after this patch. We can
                  probably remove this altogether.
                  --
                  With Regards,
                  Amit Kapila.
                  • Jump to comment-1
                    shveta malik<shveta.malik@gmail.com>
                    Apr 15, 2024, 10:09 AM UTC
                    On Mon, Apr 15, 2024 at 2:29 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
                    On Fri, Apr 12, 2024 at 5:25 PM shveta malik <shveta.malik@gmail.com> wrote:

                    Please find v3 addressing race-condition and one other comment.

                    Up-thread it was suggested that, probably, checking
                    SlotSyncCtx->syncing should be sufficient in ShutDownSlotSync(). On
                    re-thinking, it might not be. Slot sync worker sets and resets
                    'syncing' with each sync-cycle, and thus we need to rely on worker's
                    pid in ShutDownSlotSync(), as there could be a window where promotion
                    is triggered and 'syncing' is not set for worker, while the worker is
                    still running. This implementation of setting and resetting syncing
                    with each sync-cycle looks better as compared to setting syncing
                    during the entire life-cycle of the worker. So, I did not change it.

                    To retain this we need to have different handling for 'syncing' for
                    workers and function which seems like more maintenance burden than the
                    value it provides. Moreover, in SyncReplicationSlots(), we are calling
                    a function after acquiring spinlock which is not our usual coding
                    practice.
                    Okay. Changed it to consistent handling. Now both worker and SQL
                    function set 'syncing' when they start and reset it when they exit.
                    One minor comment:
                    * All the fields except 'syncing' are used only by slotsync worker.
                    * 'syncing' is used both by worker and SQL function pgsyncreplication_slots.
                    */
                    typedef struct SlotSyncCtxStruct
                    {
                    pid_t pid;
                    bool stopSignaled;
                    bool syncing;
                    timet laststart_time;
                    slock_t mutex;
                    } SlotSyncCtxStruct;

                    I feel the above comment is no longer valid after this patch. We can
                    probably remove this altogether.
                    Yes, changed.
                    Please find v4 addressing the above comments.
                    thanks
                    Shveta
                    • Jump to comment-1
                      Zhijie Hou (Fujitsu)<houzj.fnst@fujitsu.com>
                      Apr 16, 2024, 3:57 AM UTC
                      On Monday, April 15, 2024 6:09 PM shveta malik <shveta.malik@gmail.com> wrote:

                      Please find v4 addressing the above comments.
                      Thanks for the patch.
                      Here are few comments:
                      1.
                      + ereport(ERROR,
                      + errmsg("promotion in progress, can not synchronize replication slots"));
                      + }
                      I think an errcode is needed.
                      The style of the error message seems a bit unnatural to me. I suggest:
                      "cannot synchronize replication slots when standby promotion is ongoing"
                      2.
                      + if (worker_pid != InvalidPid)
                      + Assert(SlotSyncCtx->pid == InvalidPid);
                      We could merge the checks into one Assert().
                      Assert(SlotSyncCtx->pid == InvalidPid || worker_pid == InvalidPid);
                      3.
                      - pqsignal(SIGINT, SignalHandlerForShutdownRequest);
                      I realized that we should register this before setting SlotSyncCtx->pid,
                      otherwise if the standby is promoted after setting pid and before registering
                      signal handle function, the slotsync worker could miss to handle SIGINT sent by
                      startup process(ShutDownSlotSync). This is an existing issue for slotsync
                      worker, but maybe we could fix it together with the patch.
                      Best Regards,
                      Hou zj
                      • Jump to comment-1
                        shveta malik<shveta.malik@gmail.com>
                        Apr 16, 2024, 4:30 AM UTC
                        On Tue, Apr 16, 2024 at 9:27 AM Zhijie Hou (Fujitsu)
                        <houzj.fnst@fujitsu.com> wrote:
                        On Monday, April 15, 2024 6:09 PM shveta malik <shveta.malik@gmail.com> wrote:

                        Please find v4 addressing the above comments.

                        Thanks for the patch.

                        Here are few comments:
                        Thanks for reviewing the patch.

                        1.

                        + ereport(ERROR,
                        + errmsg("promotion in progress, can not synchronize replication slots"));
                        + }

                        I think an errcode is needed.

                        The style of the error message seems a bit unnatural to me. I suggest:
                        "cannot synchronize replication slots when standby promotion is ongoing"
                        Modified.

                        2.

                        + if (worker_pid != InvalidPid)
                        + Assert(SlotSyncCtx->pid == InvalidPid);

                        We could merge the checks into one Assert().
                        Assert(SlotSyncCtx->pid == InvalidPid || worker_pid == InvalidPid);
                        Modified.

                        3.

                        - pqsignal(SIGINT, SignalHandlerForShutdownRequest);
                        I realized that we should register this before setting SlotSyncCtx->pid,
                        otherwise if the standby is promoted after setting pid and before registering
                        signal handle function, the slotsync worker could miss to handle SIGINT sent by
                        startup process(ShutDownSlotSync). This is an existing issue for slotsync
                        worker, but maybe we could fix it together with the patch.
                        Yes, it seems like a problem. Fixed it. Also to be consistent, moved
                        other signal handlers' registration as well before we set pid.
                        Please find v5 addressing above comments.
                        thanks
                        Shveta
                        • Jump to comment-1
                          Bertrand Drouvot<bertranddrouvot.pg@gmail.com>
                          Apr 16, 2024, 6:52 AM UTC
                          Hi,
                          On Tue, Apr 16, 2024 at 10:00:04AM +0530, shveta malik wrote:
                          Please find v5 addressing above comments.
                          Thanks!
                          @@ -1634,9 +1677,14 @@ SyncReplicationSlots(WalReceiverConn *wrconn)
                          {
                              PG_ENSURE_ERROR_CLEANUP(slotsync_failure_callback, PointerGetDatum(wrconn));
                              {
                          + checkflagsandsetsync_info(InvalidPid);
                          +
                          Given the fact that if the sync worker is running it won't be possible to trigger
                          a manual sync with pgsyncreplication_slots(), what about also checking the
                          "syncreplicationslots" value at the start of SyncReplicationSlots() and
                          emmit an error if syncreplicationslots is set to on? (The message could explicitly
                          states that it's not possible to use the function if syncreplicationslots is
                          set to on).
                          Regards,
                          --
                          Bertrand Drouvot
                          PostgreSQL Contributors Team
                          RDS Open Source Databases
                          Amazon Web Services: https://aws.amazon.com
                          • Jump to comment-1
                            Zhijie Hou (Fujitsu)<houzj.fnst@fujitsu.com>
                            Apr 16, 2024, 8:25 AM UTC
                            On Tuesday, April 16, 2024 2:52 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote:
                            Hi,
                            On Tue, Apr 16, 2024 at 10:00:04AM +0530, shveta malik wrote:
                            Please find v5 addressing above comments.

                            Thanks!

                            @@ -1634,9 +1677,14 @@ SyncReplicationSlots(WalReceiverConn *wrconn) {
                            PGENSUREERRORCLEANUP(slotsyncfailure_callback,
                            PointerGetDatum(wrconn));
                            {
                            + checkflagsandsetsync_info(InvalidPid);
                            +

                            Given the fact that if the sync worker is running it won't be possible to trigger a
                            manual sync with pgsyncreplication_slots(), what about also checking the
                            "syncreplicationslots" value at the start of SyncReplicationSlots() and emmit
                            an error if syncreplicationslots is set to on? (The message could explicitly
                            states that it's not possible to use the function if syncreplicationslots is set to
                            on).
                            I personally feel adding the additional check for syncreplicationslots may
                            not improve the situation here. Because the GUC syncreplicationslots can
                            change at any point, the GUC could be false when performing this addition check
                            and is set to true immediately after the check, so It could not simplify the logic
                            anyway.
                            Best Regards,
                            Hou zj
                            • Jump to comment-1
                              shveta malik<shveta.malik@gmail.com>
                              Apr 16, 2024, 8:37 AM UTC
                              On Tue, Apr 16, 2024 at 1:55 PM Zhijie Hou (Fujitsu)
                              <houzj.fnst@fujitsu.com> wrote:
                              On Tuesday, April 16, 2024 2:52 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote:


                              Hi,
                              On Tue, Apr 16, 2024 at 10:00:04AM +0530, shveta malik wrote:
                              Please find v5 addressing above comments.

                              Thanks!

                              @@ -1634,9 +1677,14 @@ SyncReplicationSlots(WalReceiverConn *wrconn) {
                              PGENSUREERRORCLEANUP(slotsyncfailure_callback,
                              PointerGetDatum(wrconn));
                              {
                              + checkflagsandsetsync_info(InvalidPid);
                              +

                              Given the fact that if the sync worker is running it won't be possible to trigger a
                              manual sync with pgsyncreplication_slots(), what about also checking the
                              "syncreplicationslots" value at the start of SyncReplicationSlots() and emmit
                              an error if syncreplicationslots is set to on? (The message could explicitly
                              states that it's not possible to use the function if syncreplicationslots is set to
                              on).

                              I personally feel adding the additional check for syncreplicationslots may
                              not improve the situation here. Because the GUC syncreplicationslots can
                              change at any point, the GUC could be false when performing this addition check
                              and is set to true immediately after the check, so It could not simplify the logic
                              anyway.
                              +1.
                              I feel doc and "cannot synchronize replication slots concurrently"
                              check should suffice.
                              In the scenario which Hou-San pointed out, if after performing the
                              GUC check in SQL function, this GUC is enabled immediately and say
                              worker is started sooner than the function could get chance to sync,
                              in that case as well, SQL function will ultimately get error "cannot
                              synchronize replication slots concurrently", even though GUC is
                              enabled. Thus, I feel we should stick with samer error in all
                              scenarios.
                              thanks
                              Shveta
                              • Jump to comment-1
                                Bertrand Drouvot<bertranddrouvot.pg@gmail.com>
                                Apr 16, 2024, 9:22 AM UTC
                                Hi,
                                On Tue, Apr 16, 2024 at 02:06:45PM +0530, shveta malik wrote:
                                On Tue, Apr 16, 2024 at 1:55 PM Zhijie Hou (Fujitsu)
                                <houzj.fnst@fujitsu.com> wrote:
                                I personally feel adding the additional check for syncreplicationslots may
                                not improve the situation here. Because the GUC syncreplicationslots can
                                change at any point, the GUC could be false when performing this addition check
                                and is set to true immediately after the check, so It could not simplify the logic
                                anyway.

                                +1.
                                I feel doc and "cannot synchronize replication slots concurrently"
                                check should suffice.

                                In the scenario which Hou-San pointed out, if after performing the
                                GUC check in SQL function, this GUC is enabled immediately and say
                                worker is started sooner than the function could get chance to sync,
                                in that case as well, SQL function will ultimately get error "cannot
                                synchronize replication slots concurrently", even though GUC is
                                enabled. Thus, I feel we should stick with samer error in all
                                scenarios.
                                Okay, fine by me, let's forget about checking syncreplicationslots then.
                                Regards,
                                --
                                Bertrand Drouvot
                                PostgreSQL Contributors Team
                                RDS Open Source Databases
                                Amazon Web Services: https://aws.amazon.com
                                • Jump to comment-1
                                  shveta malik<shveta.malik@gmail.com>
                                  Apr 18, 2024, 7:05 AM UTC
                                  On Tue, Apr 16, 2024 at 2:52 PM Bertrand Drouvot
                                  <bertranddrouvot.pg@gmail.com> wrote:

                                  Hi,

                                  On Tue, Apr 16, 2024 at 02:06:45PM +0530, shveta malik wrote:
                                  On Tue, Apr 16, 2024 at 1:55 PM Zhijie Hou (Fujitsu)
                                  <houzj.fnst@fujitsu.com> wrote:
                                  I personally feel adding the additional check for syncreplicationslots may
                                  not improve the situation here. Because the GUC syncreplicationslots can
                                  change at any point, the GUC could be false when performing this addition check
                                  and is set to true immediately after the check, so It could not simplify the logic
                                  anyway.

                                  +1.
                                  I feel doc and "cannot synchronize replication slots concurrently"
                                  check should suffice.

                                  In the scenario which Hou-San pointed out, if after performing the
                                  GUC check in SQL function, this GUC is enabled immediately and say
                                  worker is started sooner than the function could get chance to sync,
                                  in that case as well, SQL function will ultimately get error "cannot
                                  synchronize replication slots concurrently", even though GUC is
                                  enabled. Thus, I feel we should stick with samer error in all
                                  scenarios.

                                  Okay, fine by me, let's forget about checking syncreplicationslots then.
                                  Thanks.
                                  While reviewing and testing this patch further, we encountered 2
                                  race-conditions which needs to be handled:
                                  1) For slot sync worker, the order of cleanup execution was a) first
                                  reset 'syncing' flag (slotsyncfailurecallback) b) then reset pid and
                                  syncing (slotsyncworkeronexit). But in ShutDownSlotSync(), we rely
                                  only on the 'syncing' flag for wait-exit logic. So it may so happen
                                  that in the window between these two callbacks, ShutDownSlotSync()
                                  proceeds and calls updatesyncedslotsinactivesince() which may then
                                  hit assert Assert((SlotSyncCtx->pid == InvalidPid).
                                  2) Another problem as described by Hou-San off-list:
                                  When the slotsync worker error out after acquiring a slot, it will
                                  first call slotsyncworkeronexit() and then
                                  ReplicationSlotShmemExit(), so in the window between these two
                                  callbacks, it's possible that the SlotSyncCtx->syncing
                                  SlotSyncCtx->pid has been reset but the slot->active_pid is still
                                  valid. The Assert will be broken in this.
                                  @@ -1471,6 +1503,9 @@ updatesyncedslotsinactivesince(void)
                                      {
                                          Assert(SlotIsLogical(s));
                                  + / The slot must not be acquired by any process /
                                  + Assert(s->active_pid == 0);
                                  +
                                  To fix above issues, these changes have been made in v7:
                                  1) For worker, replaced slotsyncfailurecallback() with
                                  slotsyncworkerdisconnect() so that the latter only disconnects and
                                  thus slotsyncworkeronexit() does pid cleanup followed by syncing
                                  flag cleanup. This will make ShutDownSlotSync()'s wait exit reliably.
                                  2) To fix second problem, changes are:
                                  2.1) For worker, moved slotsyncworkeronexit() registration before
                                  BaseInit() (BaseInit is the one doing ReplicationSlotShmemExit
                                  registration). If we do this change in order of registration, then
                                  order of cleanup for worker will be a) slotsyncworkerdisconnect() b)
                                  ReplicationSlotShmemExit() c) slotsyncworkeronexit(). This order
                                  ensures that the worker is actually done with slots release and
                                  cleanup before it marks itself as done syncing.
                                  2.2) For SQL function, did ReplicationSlotRelease() and
                                  ReplicationSlotCleanup() as first step in slotsyncfailurecallback().
                                  While doing change 2.2, it occurred to us, that it would be a clean
                                  solution to do ReplicationSlotCleanup() even on successful execution
                                  of SQL function. It seems better that the temporary slots are
                                  cleaned-up when SQL function exists, as we do not know when the user
                                  will run this SQL function again and thus leaving temp slots for
                                  longer does not seem a good idea. Thoughts?
                                  thanks
                                  Shveta
                                  • Jump to comment-1
                                    shveta malik<shveta.malik@gmail.com>
                                    Apr 18, 2024, 12:06 PM UTC
                                    On Thu, Apr 18, 2024 at 12:35 PM shveta malik <shveta.malik@gmail.com> wrote:

                                    To fix above issues, these changes have been made in v7:
                                    Please find v8 attached. Changes are:
                                    1) It fixes ShutDownSlotSync() issue, where we perform
                                    kill(SlotSyncCtx->pid). There are chances that after we release
                                    spin-lock and before we perform kill, slot-sync worker has error-ed
                                    out and has set SlotSyncCtx->pid to InvalidPid (-1) already. And thus
                                    kill(-1) could result in abnormal process kills on some platforms.
                                    Now, we get pid under spin-lock and then use it to perform kill to
                                    avoid pid=-1 kill. This is on a similar line of how ShutdownWalRcv()
                                    does it.
                                    2) Improved comments in code.
                                    3) Updated commit message with new fixes. I had missed to update it in
                                    the previous version.
                                    thanks
                                    Shveta
                                    • Jump to comment-1
                                      Bertrand Drouvot<bertranddrouvot.pg@gmail.com>
                                      Apr 19, 2024, 5:23 AM UTC
                                      Hi,
                                      On Thu, Apr 18, 2024 at 05:36:05PM +0530, shveta malik wrote:
                                      Please find v8 attached. Changes are:
                                      Thanks!
                                      A few comments:
                                      1 ===
                                      @@ -1440,7 +1461,7 @@ ReplSlotSyncWorkerMain(char *startupdata, sizet startupdatalen)
                                           * slotsync_worker_onexit() but that will need the connection to be made
                                           * global and we want to avoid introducing global for this purpose.
                                           */
                                      - beforeshmemexit(slotsyncfailurecallback, PointerGetDatum(wrconn));
                                      + beforeshmemexit(slotsyncworkerdisconnect, PointerGetDatum(wrconn));
                                      The comment above this change still states "Register the failure callback once
                                      we have the connection", I think it has to be reworded a bit now that v8 is
                                      making use of slotsyncworkerdisconnect().
                                      2 ===
                                      + * Register slotsyncworkeronexit() before we register
                                      + * ReplicationSlotShmemExit() in BaseInit(), to ensure that during exit of
                                      + * slot sync worker, ReplicationSlotShmemExit() is called first, followed
                                      + * by slotsyncworkeronexit(). Startup process during promotion waits for
                                      Worth to mention in shmemexit() (where it "while (--beforeshmemexitindex >= 0)"
                                      or before the shmem_exit() definition) that ReplSlotSyncWorkerMain() relies on
                                      this LIFO behavior? (not sure if there is other "strong" LIFO requirement in
                                      other part of the code).
                                      3 ===
                                      + * Startup process during promotion waits for slot sync to finish and it
                                      + * does that by checking the 'syncing' flag.
                                      worth to mention ShutDownSlotSync()?
                                      4 ===
                                      I did a few tests manually (launching ShutDownSlotSync() through gdb / with and
                                      without sync worker and with / without pgsyncreplication_slots() running
                                      concurrently) and it looks like it works as designed.
                                      Having said that, the logic that is in place to take care of the corner cases
                                      described up-thread seems reasonable to me.
                                      Regards,
                                      --
                                      Bertrand Drouvot
                                      PostgreSQL Contributors Team
                                      RDS Open Source Databases
                                      Amazon Web Services: https://aws.amazon.com
                                      • Jump to comment-1
                                        shveta malik<shveta.malik@gmail.com>
                                        Apr 19, 2024, 6:07 AM UTC
                                        On Fri, Apr 19, 2024 at 10:53 AM Bertrand Drouvot
                                        <bertranddrouvot.pg@gmail.com> wrote:

                                        Hi,

                                        On Thu, Apr 18, 2024 at 05:36:05PM +0530, shveta malik wrote:
                                        Please find v8 attached. Changes are:

                                        Thanks!

                                        A few comments:
                                        Thanks for reviewing.
                                        1 ===

                                        @@ -1440,7 +1461,7 @@ ReplSlotSyncWorkerMain(char *startupdata, sizet startupdatalen)
                                        * slotsyncworkeronexit() but that will need the connection to be made
                                        * global and we want to avoid introducing global for this purpose.
                                        */
                                        - beforeshmemexit(slotsyncfailurecallback, PointerGetDatum(wrconn));
                                        + beforeshmemexit(slotsyncworkerdisconnect, PointerGetDatum(wrconn));

                                        The comment above this change still states "Register the failure callback once
                                        we have the connection", I think it has to be reworded a bit now that v8 is
                                        making use of slotsyncworkerdisconnect().

                                        2 ===

                                        + * Register slotsyncworkeronexit() before we register
                                        + * ReplicationSlotShmemExit() in BaseInit(), to ensure that during exit of
                                        + * slot sync worker, ReplicationSlotShmemExit() is called first, followed
                                        + * by slotsyncworkeronexit(). Startup process during promotion waits for
                                        Worth to mention in shmemexit() (where it "while (--beforeshmemexitindex >= 0)"
                                        or before the shmem_exit() definition) that ReplSlotSyncWorkerMain() relies on
                                        this LIFO behavior? (not sure if there is other "strong" LIFO requirement in
                                        other part of the code).
                                        I see other modules as well relying on LIFO behavior.
                                        Please see applyparallelworker.c where
                                        'beforeshmemexit(pa_shutdown)' is needed to be done after
                                        'beforeshmemexit(logicalrepworkeronexit)' (commit id 3d144c6).
                                        Also in postinit.c, I see such comments atop
                                        'beforeshmemexit(ShutdownPostgres, 0)'.
                                        I feel we can skip adding this specific comment about
                                        ReplSlotSyncWorkerMain() in ipc.c, as none of the other modules has
                                        also not added any. I will address the rest of your comments in the
                                        next version.
                                        3 ===

                                        + * Startup process during promotion waits for slot sync to finish and it
                                        + * does that by checking the 'syncing' flag.

                                        worth to mention ShutDownSlotSync()?

                                        4 ===

                                        I did a few tests manually (launching ShutDownSlotSync() through gdb / with and
                                        without sync worker and with / without pgsyncreplication_slots() running
                                        concurrently) and it looks like it works as designed.
                                        Thanks for testing it.
                                        Having said that, the logic that is in place to take care of the corner cases
                                        described up-thread seems reasonable to me.
                                        thanks
                                        Shveta
                                        • Jump to comment-1
                                          shveta malik<shveta.malik@gmail.com>
                                          Apr 19, 2024, 8:22 AM UTC
                                          On Fri, Apr 19, 2024 at 11:37 AM shveta malik <shveta.malik@gmail.com> wrote:

                                          On Fri, Apr 19, 2024 at 10:53 AM Bertrand Drouvot
                                          <bertranddrouvot.pg@gmail.com> wrote:

                                          Hi,

                                          On Thu, Apr 18, 2024 at 05:36:05PM +0530, shveta malik wrote:
                                          Please find v8 attached. Changes are:

                                          Thanks!

                                          A few comments:

                                          Thanks for reviewing.
                                          1 ===

                                          @@ -1440,7 +1461,7 @@ ReplSlotSyncWorkerMain(char *startupdata, sizet startupdatalen)
                                          * slotsyncworkeronexit() but that will need the connection to be made
                                          * global and we want to avoid introducing global for this purpose.
                                          */
                                          - beforeshmemexit(slotsyncfailurecallback, PointerGetDatum(wrconn));
                                          + beforeshmemexit(slotsyncworkerdisconnect, PointerGetDatum(wrconn));

                                          The comment above this change still states "Register the failure callback once
                                          we have the connection", I think it has to be reworded a bit now that v8 is
                                          making use of slotsyncworkerdisconnect().

                                          2 ===

                                          + * Register slotsyncworkeronexit() before we register
                                          + * ReplicationSlotShmemExit() in BaseInit(), to ensure that during exit of
                                          + * slot sync worker, ReplicationSlotShmemExit() is called first, followed
                                          + * by slotsyncworkeronexit(). Startup process during promotion waits for
                                          Worth to mention in shmemexit() (where it "while (--beforeshmemexitindex >= 0)"
                                          or before the shmem_exit() definition) that ReplSlotSyncWorkerMain() relies on
                                          this LIFO behavior? (not sure if there is other "strong" LIFO requirement in
                                          other part of the code).

                                          I see other modules as well relying on LIFO behavior.
                                          Please see applyparallelworker.c where
                                          'beforeshmemexit(pa_shutdown)' is needed to be done after
                                          'beforeshmemexit(logicalrepworkeronexit)' (commit id 3d144c6).
                                          Also in postinit.c, I see such comments atop
                                          'beforeshmemexit(ShutdownPostgres, 0)'.
                                          I feel we can skip adding this specific comment about
                                          ReplSlotSyncWorkerMain() in ipc.c, as none of the other modules has
                                          also not added any. I will address the rest of your comments in the
                                          next version.
                                          3 ===

                                          + * Startup process during promotion waits for slot sync to finish and it
                                          + * does that by checking the 'syncing' flag.

                                          worth to mention ShutDownSlotSync()?

                                          4 ===

                                          I did a few tests manually (launching ShutDownSlotSync() through gdb / with and
                                          without sync worker and with / without pgsyncreplication_slots() running
                                          concurrently) and it looks like it works as designed.

                                          Thanks for testing it.
                                          Having said that, the logic that is in place to take care of the corner cases
                                          described up-thread seems reasonable to me.
                                          Please find v9 with the above comments addressed.
                                          thanks
                                          Shveta
                                          • Jump to comment-1
                                            Amit Kapila<amit.kapila16@gmail.com>
                                            Apr 22, 2024, 11:41 AM UTC
                                            On Fri, Apr 19, 2024 at 1:52 PM shveta malik <shveta.malik@gmail.com> wrote:

                                            Please find v9 with the above comments addressed.
                                            I have made minor modifications in the comments and a function name.
                                            Please see the attached top-up patch. Apart from this, the patch looks
                                            good to me.
                                            --
                                            With Regards,
                                            Amit Kapila.
                                            • Jump to comment-1
                                              shveta malik<shveta.malik@gmail.com>
                                              Apr 22, 2024, 12:02 PM UTC
                                              On Mon, Apr 22, 2024 at 5:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
                                              On Fri, Apr 19, 2024 at 1:52 PM shveta malik <shveta.malik@gmail.com> wrote:

                                              Please find v9 with the above comments addressed.

                                              I have made minor modifications in the comments and a function name.
                                              Please see the attached top-up patch. Apart from this, the patch looks
                                              good to me.
                                              Thanks for the patch, the changes look good Amit. Please find the merged patch.
                                              thanks
                                              Shveta
                                              • Jump to comment-1
                                                Masahiko Sawada<sawada.mshk@gmail.com>
                                                Apr 22, 2024, 1:34 PM UTC
                                                On Mon, Apr 22, 2024 at 9:02 PM shveta malik <shveta.malik@gmail.com> wrote:
                                                On Mon, Apr 22, 2024 at 5:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
                                                On Fri, Apr 19, 2024 at 1:52 PM shveta malik <shveta.malik@gmail.com> wrote:

                                                Please find v9 with the above comments addressed.

                                                I have made minor modifications in the comments and a function name.
                                                Please see the attached top-up patch. Apart from this, the patch looks
                                                good to me.

                                                Thanks for the patch, the changes look good Amit. Please find the merged patch.
                                                I've reviewed the patch and have some comments:
                                                ---
                                                /*
                                                - * Early initialization.
                                                + * Register slotsyncworkeronexit() before we register
                                                + * ReplicationSlotShmemExit() in BaseInit(), to ensure that during the
                                                + * exit of the slot sync worker, ReplicationSlotShmemExit() is called
                                                + * first, followed by slotsyncworkeronexit(). The startup process during
                                                + * promotion invokes ShutDownSlotSync() which waits for slot sync to
                                                + * finish and it does that by checking the 'syncing' flag. Thus worker
                                                + * must be done with the slots' release and cleanup before it marks itself
                                                + * as finished syncing.
                                                 */
                                                I'm slightly worried that we register the slotsyncworkeronexit()
                                                callback before BaseInit(), because it could be a blocker when we want
                                                to add more work in the callback, for example sending the stats.
                                                ---
                                                    synchronize_slots(wrconn);
                                                +
                                                + / Cleanup the temporary slots /
                                                + ReplicationSlotCleanup();
                                                +
                                                + / We are done with sync, so reset sync flag /
                                                + resetsyncingflag();
                                                I think it ends up removing other temp slots that are created by the
                                                same backend process using
                                                pgcreate{physical,logicalreplicationslots() function, which could
                                                be a large side effect of this function for users. Also, if users want
                                                to have a process periodically calling pgsyncreplication_slots()
                                                instead of the slotsync worker, it doesn't support a case where we
                                                create a temp not-ready slot and turn it into a persistent slot if
                                                it's ready for sync.
                                                Regards,
                                                --
                                                Masahiko Sawada
                                                Amazon Web Services: https://aws.amazon.com
                                                • Jump to comment-1
                                                  Amit Kapila<amit.kapila16@gmail.com>
                                                  Apr 23, 2024, 3:37 AM UTC
                                                  On Mon, Apr 22, 2024 at 7:04 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
                                                  On Mon, Apr 22, 2024 at 9:02 PM shveta malik <shveta.malik@gmail.com> wrote:

                                                  Thanks for the patch, the changes look good Amit. Please find the merged patch.

                                                  I've reviewed the patch and have some comments:

                                                  ---
                                                  /*
                                                  - * Early initialization.
                                                  + * Register slotsyncworkeronexit() before we register
                                                  + * ReplicationSlotShmemExit() in BaseInit(), to ensure that during the
                                                  + * exit of the slot sync worker, ReplicationSlotShmemExit() is called
                                                  + * first, followed by slotsyncworkeronexit(). The startup process during
                                                  + * promotion invokes ShutDownSlotSync() which waits for slot sync to
                                                  + * finish and it does that by checking the 'syncing' flag. Thus worker
                                                  + * must be done with the slots' release and cleanup before it marks itself
                                                  + * as finished syncing.
                                                  */

                                                  I'm slightly worried that we register the slotsyncworkeronexit()
                                                  callback before BaseInit(), because it could be a blocker when we want
                                                  to add more work in the callback, for example sending the stats.
                                                  The other possibility is that we do slot release/clean up in the
                                                  slotsyncworkeronexit() call itself and then we can do it after
                                                  BaseInit(). Do you have any other/better idea for this?
                                                  ---
                                                  synchronize_slots(wrconn);
                                                  +
                                                  + / Cleanup the temporary slots /
                                                  + ReplicationSlotCleanup();
                                                  +
                                                  + / We are done with sync, so reset sync flag /
                                                  + resetsyncingflag();

                                                  I think it ends up removing other temp slots that are created by the
                                                  same backend process using
                                                  pgcreate{physical,logicalreplicationslots() function, which could
                                                  be a large side effect of this function for users.
                                                  True, I think here we should either remove only temporary and synced
                                                  marked slots. The other possibility is to create slots as RS_EPHEMERAL
                                                  initially when called from the SQL function but that doesn't sound
                                                  like a neat approach.
                                                  Also, if users want
                                                  to have a process periodically calling pgsyncreplication_slots()
                                                  instead of the slotsync worker, it doesn't support a case where we
                                                  create a temp not-ready slot and turn it into a persistent slot if
                                                  it's ready for sync.
                                                  True, but eventually the API should be able to directly create the
                                                  persistent slots and anyway this can happen only for the first time
                                                  (till the slots are created and marked persistent) and one who wants
                                                  to use this function periodically should be able to see regular syncs.
                                                  OTOH, leaving temp slots created via this API could remain as-is after
                                                  promotion and we need to document for users to remove such slots. Now,
                                                  we can do that if we want but I think it is better to clean up such
                                                  slots rather than putting the onus on users to remove them after
                                                  promotion.
                                                  --
                                                  With Regards,
                                                  Amit Kapila.
                                                  • Jump to comment-1
                                                    Masahiko Sawada<sawada.mshk@gmail.com>
                                                    Apr 24, 2024, 5:44 AM UTC
                                                    On Tue, Apr 23, 2024 at 12:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
                                                    On Mon, Apr 22, 2024 at 7:04 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
                                                    On Mon, Apr 22, 2024 at 9:02 PM shveta malik <shveta.malik@gmail.com> wrote:

                                                    Thanks for the patch, the changes look good Amit. Please find the merged patch.

                                                    I've reviewed the patch and have some comments:

                                                    ---
                                                    /*
                                                    - * Early initialization.
                                                    + * Register slotsyncworkeronexit() before we register
                                                    + * ReplicationSlotShmemExit() in BaseInit(), to ensure that during the
                                                    + * exit of the slot sync worker, ReplicationSlotShmemExit() is called
                                                    + * first, followed by slotsyncworkeronexit(). The startup process during
                                                    + * promotion invokes ShutDownSlotSync() which waits for slot sync to
                                                    + * finish and it does that by checking the 'syncing' flag. Thus worker
                                                    + * must be done with the slots' release and cleanup before it marks itself
                                                    + * as finished syncing.
                                                    */

                                                    I'm slightly worried that we register the slotsyncworkeronexit()
                                                    callback before BaseInit(), because it could be a blocker when we want
                                                    to add more work in the callback, for example sending the stats.

                                                    The other possibility is that we do slot release/clean up in the
                                                    slotsyncworkeronexit() call itself and then we can do it after
                                                    BaseInit().
                                                    This approach sounds clearer and safer to me. The current approach
                                                    relies on the callback registration order of
                                                    ReplicationSlotShmemExit(). If it changes in the future, we will
                                                    silently have the same problem. Every slot sync related work should be
                                                    done before allowing someone to touch synced slots by clearing the
                                                    'syncing' flag.
                                                    ---
                                                    synchronize_slots(wrconn);
                                                    +
                                                    + / Cleanup the temporary slots /
                                                    + ReplicationSlotCleanup();
                                                    +
                                                    + / We are done with sync, so reset sync flag /
                                                    + resetsyncingflag();

                                                    I think it ends up removing other temp slots that are created by the
                                                    same backend process using
                                                    pgcreate{physical,logicalreplicationslots() function, which could
                                                    be a large side effect of this function for users.

                                                    True, I think here we should either remove only temporary and synced
                                                    marked slots. The other possibility is to create slots as RS_EPHEMERAL
                                                    initially when called from the SQL function but that doesn't sound
                                                    like a neat approach.
                                                    Also, if users want
                                                    to have a process periodically calling pgsyncreplication_slots()
                                                    instead of the slotsync worker, it doesn't support a case where we
                                                    create a temp not-ready slot and turn it into a persistent slot if
                                                    it's ready for sync.

                                                    True, but eventually the API should be able to directly create the
                                                    persistent slots and anyway this can happen only for the first time
                                                    (till the slots are created and marked persistent) and one who wants
                                                    to use this function periodically should be able to see regular syncs.
                                                    I agree that we remove temp-and-synced slots created via the API at
                                                    the end of the API . We end up creating and dropping slots in every
                                                    API call but since the pgsyncreplication_slots() function is a kind
                                                    of debug-purpose function and it will not be common to call this
                                                    function regularly instead of using the slot sync worker, we can live
                                                    with such overhead.
                                                    Regards,
                                                    --
                                                    Masahiko Sawada
                                                    Amazon Web Services: https://aws.amazon.com
                                                  • Jump to comment-1
                                                    shveta malik<shveta.malik@gmail.com>
                                                    Apr 24, 2024, 4:58 AM UTC
                                                    On Tue, Apr 23, 2024 at 9:07 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
                                                    On Mon, Apr 22, 2024 at 7:04 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
                                                    On Mon, Apr 22, 2024 at 9:02 PM shveta malik <shveta.malik@gmail.com> wrote:

                                                    Thanks for the patch, the changes look good Amit. Please find the merged patch.

                                                    I've reviewed the patch and have some comments:
                                                    Thanks for the comments.
                                                    ---
                                                    /*
                                                    - * Early initialization.
                                                    + * Register slotsyncworkeronexit() before we register
                                                    + * ReplicationSlotShmemExit() in BaseInit(), to ensure that during the
                                                    + * exit of the slot sync worker, ReplicationSlotShmemExit() is called
                                                    + * first, followed by slotsyncworkeronexit(). The startup process during
                                                    + * promotion invokes ShutDownSlotSync() which waits for slot sync to
                                                    + * finish and it does that by checking the 'syncing' flag. Thus worker
                                                    + * must be done with the slots' release and cleanup before it marks itself
                                                    + * as finished syncing.
                                                    */

                                                    I'm slightly worried that we register the slotsyncworkeronexit()
                                                    callback before BaseInit(), because it could be a blocker when we want
                                                    to add more work in the callback, for example sending the stats.

                                                    The other possibility is that we do slot release/clean up in the
                                                    slotsyncworkeronexit() call itself and then we can do it after
                                                    BaseInit(). Do you have any other/better idea for this?
                                                    I have currently implemented it this way in v11.
                                                    ---
                                                    synchronize_slots(wrconn);
                                                    +
                                                    + / Cleanup the temporary slots /
                                                    + ReplicationSlotCleanup();
                                                    +
                                                    + / We are done with sync, so reset sync flag /
                                                    + resetsyncingflag();

                                                    I think it ends up removing other temp slots that are created by the
                                                    same backend process using
                                                    pgcreate{physical,logicalreplicationslots() function, which could
                                                    be a large side effect of this function for users.
                                                    Yes, this is a problem. Thanks for catching it.

                                                    True, I think here we should either remove only temporary and synced
                                                    marked slots. The other possibility is to create slots as RS_EPHEMERAL
                                                    initially when called from the SQL function but that doesn't sound
                                                    like a neat approach.
                                                    Modified the logic to remove only synced temporary slots during
                                                    SQL-function exit.
                                                    Please find v11 with above changes.
                                                    thanks
                                                    Shveta
                                                    • Jump to comment-1
                                                      Amit Kapila<amit.kapila16@gmail.com>
                                                      Apr 25, 2024, 10:52 AM UTC
                                                      On Wed, Apr 24, 2024 at 10:28 AM shveta malik <shveta.malik@gmail.com> wrote:

                                                      Modified the logic to remove only synced temporary slots during
                                                      SQL-function exit.

                                                      Please find v11 with above changes.
                                                      LGTM, so pushed!
                                                      --
                                                      With Regards,
                                                      Amit Kapila.
                                          • Jump to comment-1
                                            Zhijie Hou (Fujitsu)<houzj.fnst@fujitsu.com>
                                            Apr 22, 2024, 12:31 AM UTC
                                            On Friday, April 19, 2024 4:22 PM shveta malik <shveta.malik@gmail.com> wrote:
                                            On Fri, Apr 19, 2024 at 11:37 AM shveta malik <shveta.malik@gmail.com> wrote:

                                            On Fri, Apr 19, 2024 at 10:53 AM Bertrand Drouvot
                                            <bertranddrouvot.pg@gmail.com> wrote:

                                            Hi,

                                            On Thu, Apr 18, 2024 at 05:36:05PM +0530, shveta malik wrote:
                                            Please find v8 attached. Changes are:

                                            Thanks!

                                            A few comments:

                                            Thanks for reviewing.
                                            1 ===

                                            @@ -1440,7 +1461,7 @@ ReplSlotSyncWorkerMain(char *startup_data,
                                            sizet startupdata_len)
                                            * slotsyncworkeronexit() but that will need the connection to be
                                            made
                                            * global and we want to avoid introducing global for this purpose.
                                            */
                                            - beforeshmemexit(slotsyncfailurecallback,
                                            PointerGetDatum(wrconn));
                                            + beforeshmemexit(slotsyncworkerdisconnect,
                                            + PointerGetDatum(wrconn));

                                            The comment above this change still states "Register the failure
                                            callback once we have the connection", I think it has to be reworded
                                            a bit now that v8 is making use of slotsyncworkerdisconnect().

                                            2 ===

                                            + * Register slotsyncworkeronexit() before we register
                                            + * ReplicationSlotShmemExit() in BaseInit(), to ensure that during
                                            exit of
                                            + * slot sync worker, ReplicationSlotShmemExit() is called first,
                                            followed
                                            + * by slotsyncworkeronexit(). Startup process during
                                            + promotion waits for

                                            Worth to mention in shmem_exit() (where it "while
                                            (--beforeshmemexit_index >= 0)"
                                            or before the shmem_exit() definition) that ReplSlotSyncWorkerMain()
                                            relies on this LIFO behavior? (not sure if there is other "strong"
                                            LIFO requirement in other part of the code).

                                            I see other modules as well relying on LIFO behavior.
                                            Please see applyparallelworker.c where
                                            'beforeshmemexit(pa_shutdown)' is needed to be done after
                                            'beforeshmemexit(logicalrepworkeronexit)' (commit id 3d144c6).
                                            Also in postinit.c, I see such comments atop
                                            'beforeshmemexit(ShutdownPostgres, 0)'.
                                            I feel we can skip adding this specific comment about
                                            ReplSlotSyncWorkerMain() in ipc.c, as none of the other modules has
                                            also not added any. I will address the rest of your comments in the
                                            next version.
                                            3 ===

                                            + * Startup process during promotion waits for slot sync to finish
                                            and it
                                            + * does that by checking the 'syncing' flag.

                                            worth to mention ShutDownSlotSync()?

                                            4 ===

                                            I did a few tests manually (launching ShutDownSlotSync() through gdb
                                            / with and without sync worker and with / without
                                            pgsyncreplication_slots() running
                                            concurrently) and it looks like it works as designed.

                                            Thanks for testing it.
                                            Having said that, the logic that is in place to take care of the
                                            corner cases described up-thread seems reasonable to me.

                                            Please find v9 with the above comments addressed.
                                            Thanks, the patch looks good to me. I also tested a few concurrent
                                            promotion/function execution cases and didn't find issues.
                                            Best Regards,
                                            Hou zj
                    • Jump to comment-1
                      Bertrand Drouvot<bertranddrouvot.pg@gmail.com>
                      Apr 15, 2024, 12:51 PM UTC
                      Hi,
                      On Mon, Apr 15, 2024 at 03:38:53PM +0530, shveta malik wrote:
                      On Mon, Apr 15, 2024 at 2:29 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
                      On Fri, Apr 12, 2024 at 5:25 PM shveta malik <shveta.malik@gmail.com> wrote:

                      Please find v3 addressing race-condition and one other comment.

                      Up-thread it was suggested that, probably, checking
                      SlotSyncCtx->syncing should be sufficient in ShutDownSlotSync(). On
                      re-thinking, it might not be. Slot sync worker sets and resets
                      'syncing' with each sync-cycle, and thus we need to rely on worker's
                      pid in ShutDownSlotSync(), as there could be a window where promotion
                      is triggered and 'syncing' is not set for worker, while the worker is
                      still running. This implementation of setting and resetting syncing
                      with each sync-cycle looks better as compared to setting syncing
                      during the entire life-cycle of the worker. So, I did not change it.

                      To retain this we need to have different handling for 'syncing' for
                      workers and function which seems like more maintenance burden than the
                      value it provides. Moreover, in SyncReplicationSlots(), we are calling
                      a function after acquiring spinlock which is not our usual coding
                      practice.

                      Okay. Changed it to consistent handling.
                      Thanks for the patch!
                      Now both worker and SQL
                      function set 'syncing' when they start and reset it when they exit.
                      It means that it's not possible anymore to trigger a manual sync if
                      syncreplicationslots is on. Indeed that would trigger:
                      postgres=# select pgsyncreplication_slots();
                      ERROR: cannot synchronize replication slots concurrently
                      That looks like an issue to me, thoughts?
                      Regards,
                      --
                      Bertrand Drouvot
                      PostgreSQL Contributors Team
                      RDS Open Source Databases
                      Amazon Web Services: https://aws.amazon.com
                      • Jump to comment-1
                        Amit Kapila<amit.kapila16@gmail.com>
                        Apr 15, 2024, 1:00 PM UTC
                        On Mon, Apr 15, 2024 at 6:21 PM Bertrand Drouvot
                        <bertranddrouvot.pg@gmail.com> wrote:

                        On Mon, Apr 15, 2024 at 03:38:53PM +0530, shveta malik wrote:
                        On Mon, Apr 15, 2024 at 2:29 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
                        On Fri, Apr 12, 2024 at 5:25 PM shveta malik <shveta.malik@gmail.com> wrote:

                        Please find v3 addressing race-condition and one other comment.

                        Up-thread it was suggested that, probably, checking
                        SlotSyncCtx->syncing should be sufficient in ShutDownSlotSync(). On
                        re-thinking, it might not be. Slot sync worker sets and resets
                        'syncing' with each sync-cycle, and thus we need to rely on worker's
                        pid in ShutDownSlotSync(), as there could be a window where promotion
                        is triggered and 'syncing' is not set for worker, while the worker is
                        still running. This implementation of setting and resetting syncing
                        with each sync-cycle looks better as compared to setting syncing
                        during the entire life-cycle of the worker. So, I did not change it.

                        To retain this we need to have different handling for 'syncing' for
                        workers and function which seems like more maintenance burden than the
                        value it provides. Moreover, in SyncReplicationSlots(), we are calling
                        a function after acquiring spinlock which is not our usual coding
                        practice.

                        Okay. Changed it to consistent handling.

                        Thanks for the patch!
                        Now both worker and SQL
                        function set 'syncing' when they start and reset it when they exit.

                        It means that it's not possible anymore to trigger a manual sync if
                        syncreplicationslots is on. Indeed that would trigger:

                        postgres=# select pgsyncreplication_slots();
                        ERROR: cannot synchronize replication slots concurrently

                        That looks like an issue to me, thoughts?
                        This is intentional as of now for the sake of keeping
                        implementation/code simple. It is not difficult to allow them but I am
                        not sure whether we want to add another set of conditions allowing
                        them in parallel. And that too in an unpredictable way as the API will
                        work only for the time slot sync worker is not performing the sync.
                        --
                        With Regards,
                        Amit Kapila.
                        • Jump to comment-1
                          Bertrand Drouvot<bertranddrouvot.pg@gmail.com>
                          Apr 15, 2024, 2:17 PM UTC
                          Hi,
                          On Mon, Apr 15, 2024 at 06:29:49PM +0530, Amit Kapila wrote:
                          On Mon, Apr 15, 2024 at 6:21 PM Bertrand Drouvot
                          <bertranddrouvot.pg@gmail.com> wrote:

                          On Mon, Apr 15, 2024 at 03:38:53PM +0530, shveta malik wrote:
                          Now both worker and SQL
                          function set 'syncing' when they start and reset it when they exit.

                          It means that it's not possible anymore to trigger a manual sync if
                          syncreplicationslots is on. Indeed that would trigger:

                          postgres=# select pgsyncreplication_slots();
                          ERROR: cannot synchronize replication slots concurrently

                          That looks like an issue to me, thoughts?

                          This is intentional as of now for the sake of keeping
                          implementation/code simple. It is not difficult to allow them but I am
                          not sure whether we want to add another set of conditions allowing
                          them in parallel.
                          I think that the ability to launch a manual sync before a switchover would be
                          missed. Except for this case I don't think that's an issue to prevent them to
                          run in parallel.
                          And that too in an unpredictable way as the API will
                          work only for the time slot sync worker is not performing the sync.
                          Yeah but then at least you would know that there is "really" a sync in progress
                          (which is not the case currently with v4, as the sync worker being started is
                          enough to prevent a manual sync even if a sync is not in progress).
                          Regards,
                          --
                          Bertrand Drouvot
                          PostgreSQL Contributors Team
                          RDS Open Source Databases
                          Amazon Web Services: https://aws.amazon.com
                          • Jump to comment-1
                            Amit Kapila<amit.kapila16@gmail.com>
                            Apr 16, 2024, 2:51 AM UTC
                            On Mon, Apr 15, 2024 at 7:47 PM Bertrand Drouvot
                            <bertranddrouvot.pg@gmail.com> wrote:

                            On Mon, Apr 15, 2024 at 06:29:49PM +0530, Amit Kapila wrote:
                            On Mon, Apr 15, 2024 at 6:21 PM Bertrand Drouvot
                            <bertranddrouvot.pg@gmail.com> wrote:

                            On Mon, Apr 15, 2024 at 03:38:53PM +0530, shveta malik wrote:
                            Now both worker and SQL
                            function set 'syncing' when they start and reset it when they exit.

                            It means that it's not possible anymore to trigger a manual sync if
                            syncreplicationslots is on. Indeed that would trigger:

                            postgres=# select pgsyncreplication_slots();
                            ERROR: cannot synchronize replication slots concurrently

                            That looks like an issue to me, thoughts?

                            This is intentional as of now for the sake of keeping
                            implementation/code simple. It is not difficult to allow them but I am
                            not sure whether we want to add another set of conditions allowing
                            them in parallel.

                            I think that the ability to launch a manual sync before a switchover would be
                            missed. Except for this case I don't think that's an issue to prevent them to
                            run in parallel.
                            I think if the slotsync worker is available, it can do that as well.
                            There is no clear use case for allowing them in parallel and I feel it
                            would add more confusion when it can work sometimes but not other
                            times. However, if we receive some report from the field where there
                            is a real demand for such a thing, it should be easy to achieve. For
                            example, I can imagine that we can have sync_state that has values
                            'started', 'in_progress' , and 'finished'. This should allow us to
                            achieve what the current proposed patch is doing along with allowing
                            the API to work in parallel when the syncstate is not 'inprogress'.
                            I think for now let's restrict their usage in parallel and make the
                            promotion behavior consistent both for worker and API.
                            --
                            With Regards,
                            Amit Kapila.
                            • Jump to comment-1
                              Bertrand Drouvot<bertranddrouvot.pg@gmail.com>
                              Apr 16, 2024, 6:33 AM UTC
                              Hi,
                              On Tue, Apr 16, 2024 at 08:21:04AM +0530, Amit Kapila wrote:
                              On Mon, Apr 15, 2024 at 7:47 PM Bertrand Drouvot
                              <bertranddrouvot.pg@gmail.com> wrote:

                              On Mon, Apr 15, 2024 at 06:29:49PM +0530, Amit Kapila wrote:
                              On Mon, Apr 15, 2024 at 6:21 PM Bertrand Drouvot
                              <bertranddrouvot.pg@gmail.com> wrote:

                              On Mon, Apr 15, 2024 at 03:38:53PM +0530, shveta malik wrote:
                              Now both worker and SQL
                              function set 'syncing' when they start and reset it when they exit.

                              It means that it's not possible anymore to trigger a manual sync if
                              syncreplicationslots is on. Indeed that would trigger:

                              postgres=# select pgsyncreplication_slots();
                              ERROR: cannot synchronize replication slots concurrently

                              That looks like an issue to me, thoughts?

                              This is intentional as of now for the sake of keeping
                              implementation/code simple. It is not difficult to allow them but I am
                              not sure whether we want to add another set of conditions allowing
                              them in parallel.

                              I think that the ability to launch a manual sync before a switchover would be
                              missed. Except for this case I don't think that's an issue to prevent them to
                              run in parallel.

                              I think if the slotsync worker is available, it can do that as well.
                              Right, but one has no control as to when the sync is triggered.
                              There is no clear use case for allowing them in parallel and I feel it
                              would add more confusion when it can work sometimes but not other
                              times. However, if we receive some report from the field where there
                              is a real demand for such a thing, it should be easy to achieve. For
                              example, I can imagine that we can have sync_state that has values
                              'started', 'in_progress' , and 'finished'. This should allow us to
                              achieve what the current proposed patch is doing along with allowing
                              the API to work in parallel when the syncstate is not 'inprogress'.

                              I think for now let's restrict their usage in parallel and make the
                              promotion behavior consistent both for worker and API.
                              Okay, let's do it that way. Is it worth to add a few words in the doc related to
                              pgsyncreplication_slots() though? (to mention it can not be used if the sync
                              slot worker is running).
                              Regards,
                              --
                              Bertrand Drouvot
                              PostgreSQL Contributors Team
                              RDS Open Source Databases
                              Amazon Web Services: https://aws.amazon.com
                              • Jump to comment-1
                                shveta malik<shveta.malik@gmail.com>
                                Apr 16, 2024, 8:22 AM UTC
                                On Tue, Apr 16, 2024 at 12:03 PM Bertrand Drouvot
                                <bertranddrouvot.pg@gmail.com> wrote:
                                I think for now let's restrict their usage in parallel and make the
                                promotion behavior consistent both for worker and API.

                                Okay, let's do it that way. Is it worth to add a few words in the doc related to
                                pgsyncreplication_slots() though? (to mention it can not be used if the sync
                                slot worker is running).
                                +1. Please find v6 having the suggested doc changes.
                                thanks
                                Shveta
                              • Jump to comment-1
                                Amit Kapila<amit.kapila16@gmail.com>
                                Apr 16, 2024, 6:38 AM UTC
                                On Tue, Apr 16, 2024 at 12:03 PM Bertrand Drouvot
                                <bertranddrouvot.pg@gmail.com> wrote:

                                On Tue, Apr 16, 2024 at 08:21:04AM +0530, Amit Kapila wrote:
                                There is no clear use case for allowing them in parallel and I feel it
                                would add more confusion when it can work sometimes but not other
                                times. However, if we receive some report from the field where there
                                is a real demand for such a thing, it should be easy to achieve. For
                                example, I can imagine that we can have sync_state that has values
                                'started', 'in_progress' , and 'finished'. This should allow us to
                                achieve what the current proposed patch is doing along with allowing
                                the API to work in parallel when the syncstate is not 'inprogress'.

                                I think for now let's restrict their usage in parallel and make the
                                promotion behavior consistent both for worker and API.

                                Okay, let's do it that way. Is it worth to add a few words in the doc related to
                                pgsyncreplication_slots() though? (to mention it can not be used if the sync
                                slot worker is running).
                                Yes, this makes sense to me.
                                --
                                With Regards,
                                Amit Kapila.