[PATCH] Batched clock sweep to reduce cross-socket atomic contention

  • Jump to comment-1
    Greg Burd<greg@burd.me>
    Apr 25, 2026, 8:09 PM UTC
    Hello hackers,
    A colleague of mine, Jim Mlodgenski, has been poking at NUMA behavior on some of the newer AWS bare-metal instance types (r8i in particular, which exposes 6 NUMA nodes via SNC3 on a 2-socket box), and in the process landed on a very small change to freelist.c that I think is worth showing around. His patch is attached with some tweaks of my own.
    Full disclosure: the exploration that led Jim to this patch idea was done with help from an AI assistant (Kiro); the idea, the benchmarking, and the final shape of the patch are human-driven, but I wanted to be up front about how his investigation started. Happy to discuss that separately if people want to.
    The one-line summary: instead of advancing nextVictimBuffer one buffer at a time via pgatomicfetchaddu32, each backend claims a batch of 64 consecutive buffer IDs from the shared hand and then iterates them privately. Global sweep order is preserved -- every buffer is still
    visited exactly once per complete pass -- but the atomic contention on that one cache line drops by roughly the batch size.
    Why this matters
    ----------------
    On multi-socket boxes under eviction pressure, every backend that needs a victim buffer ends up CAS'ing the same cache line. On a single socket, a locked RMW on that cache line stays warm in L1/L2 and completes in ~20ns. On 2+ sockets, the line bounces over QPI/UPI at ~100-200ns per op, and with hundreds of backends running StrategyGetBuffer() concurrently, the line ping-pongs constantly. It's a textbook NUMA scalability bottleneck, and once shared_buffers is smaller than the working set and the sweep is running continuously, that single atomic is what you hit in a perf profile (elevated bus-cycles, cache-misses on the cache line holding nextVictimBuffer).
    Andres pointed at the same spot in his pgconf.eu 2024 talk, and Tomas called it out in the "Adding basic NUMA awareness" thread [1] -- so this isn't news to anyone who's been looking at this area. What I think is new is a fix that's just this, without any of the surrounding architectural change.
    The framing (credit to Jim): the clock hand is doing two jobs. It coordinates backends so they don't redundantly decrement usage_count on the same buffers and so they eventually visit every buffer in the pool exactly once per pass. It also serializes access to the counter. Coordination is the part we want. Serialization is the part that's killing us on bigger NUMA boxes. Batching keeps the coordination and thins out the serialization.
    How it works
    ------------
    Two per-backend statics, MyBatchPos and MyBatchEnd. When a backend calls ClockSweepTick() and its local batch is exhausted, it does a single fetch-add of CLOCKSWEEPBATCH_SIZE (64) against nextVictimBuffer and now owns that range. Subsequent ticks just bump the local counter.
    Wraparound got a small rewrite. The original code had the backend that crossed NBuffers drive completePasses++ under the spinlock via a CAS loop. With batching, multiple backends can each land a fetch-add that returns a value >= NBuffers in the same pass, so the logic now is: whoever sees a start >= NBuffers takes the spinlock, re-reads the counter, and if it's still out of range does a single CAS to wrap it and bumps completePasses. If somebody else already wrapped, we just release and move on. StrategySyncStart() still sees a consistent (nextVictimBuffer, completePasses) pair.
    The batch size is gated on whether we actually have multiple NUMA nodes. On a single-socket box the atomic is already socket-local, batching just makes backends skip further ahead than they need to, so we fall back to batch size 1 -- which is bit-for-bit the original behavior. The guard:
    if (pg_numa_init() != -1 && pg_numa_get_max_node() >= 1)
        ClockSweepBatchSize = Min(CLOCK_SWEEP_BATCH_SIZE, (uint32) NBuffers);
    else
        ClockSweepBatchSize = 1;
    Min() against NBuffers covers the small-shared_buffers corner so a batch never wraps the pool multiple times in one claim.
    Does batching mess up the meaning of usage_count?
    --------------------------------------------------
    Short answer: no. I want to walk through this because it was my first concern too, and I think it's the question that will come up most on review.
    The clock sweep's usagecount is an access-frequency approximation measured in units of *complete passes*. A buffer with usagecount = N survives N passes without a re-pin. The semantic meaning lives at pass granularity, not at individual-buffer granularity.
    What batching changes: intra-pass temporal ordering. Without batching, with N backends sweeping, decrements are interleaved -- backend A hits B[0], backend B hits B[1], backend C hits B[2]. With batching, backend A hits B[0..63] in a tight local burst, then backend B hits B[64..127], etc. The 64-buffer chunks are decremented in bursts rather than individually.
    Why it doesn't matter:
    1. Every buffer still gets decremented exactly once per complete
     pass.  The invariant the algorithm actually depends on is
     untouched.
    
    2. A buffer's survival window is the time between consecutive
     passes.  That's milliseconds to seconds under load.  Whether
     B[0] gets decremented 50us before or 50us after B[63] within
     the same pass is below the resolution of anything usage_count
     is trying to measure.
    
    3. The bgwriter's feedback loop reads (nextVictimBuffer,
     completePasses, numBufferAllocs) via StrategySyncStart() every
     ~200ms.  nextVictimBuffer still advances at the same *total*
     rate (64 per atomic op, but atomic ops happen 1/64 as often).
     The position it reports can jitter by up to 64 buffers relative
     to the one-at-a-time case, but BgBufferSync()'s smoothed
     estimates operate over thousands of buffers per cycle, so the
     jitter disappears into the averaging.  numBufferAllocs still
     increments once per allocation.  strategy_delta,
     smoothed_alloc, smoothed_density, reusable_buffers_est -- all
     unaffected in any way I can see.
    Table form, because it's easier to argue with:
    PropertyUnpatchedBatched
    Buffers visited per passNBuffersNBuffers
    Decrements per buffer per pass11
    Eviction thresholdusage_count==0usage_count==0
    Max survival (passes)66
    Decrement ordering within a passinterleavedchunked
    bgwriter allocation rate signalaccurateaccurate
    Cross-socket atomic traffic1 per buffer1 per 64
    There is one subtle difference worth naming. When a backend finds a victim at B[5] of its batch, it returns with MyBatchEnd still sitting at B[63]. The next time that backend needs a victim it resumes at B[6], not at wherever the global hand now points. So the backend drains its batch over multiple StrategyGetBuffer() calls rather than all at once. Under heavy load, where batches are consumed in microseconds, this is invisible. Under light load, the implication is that some buffers can sit with slightly stale usage_count for longer than they would have before. But "light load" means "the sweep is barely moving and nothing wants to evict anyway" -- so the effect
    doesn't show up where it would hurt.
    There's also a small positive side-effect: cache locality. The backend that just touched BufferDescriptor[B[0]] has the adjacent descriptors warm in L1/L2. Walking B[0..63] locally is cheaper than walking a striped interleaving where each descriptor was last touched by a different core. I haven't tried to isolate this in perf, but it falls out naturally.
    Benchmarks
    ----------
    Jim ran these; I'm still working on reproducing them locally and will post independent numbers in a follow-up. All bare metal, Linux, huge pages enabled throughout (more on that below), postmaster pinned to node 0 with `numactl --cpunodebind=0` because otherwise stock TPS varied from 31K to 40K depending on which node the postmaster happened to land on at launch -- worth flagging for anyone trying to reproduce.
    Workload is pgbench scale 3000 (~45GB) with shared_buffers=32GB, so the working set always spills and the sweep is hot.
    r8i.metal-96xl (384 vCPUs, 2 sockets, 6 NUMA nodes via SNC3):
    
    pgbench RO:
      Clients   Stock    Patched   Delta
      64        31,457   36,353    +16%
      128       31,678   37,864    +20%
      256       31,510   37,558    +19%
      384       31,431   37,464    +19%
      512       31,329   37,040    +18%
    
    pgbench RW:
      Clients   Stock    Patched   Delta
      64         7,685    7,713     0%
      128       10,420   10,541    +1%
      256       12,393   12,463    +1%
      384       15,317   15,197    -1%
      512       17,930   17,978     0%
    
    m6i.metal (128 vCPUs, 2 sockets, Ice Lake):
    RO +19-20%, RW within noise.
    
    c8i.metal-48xl (192 vCPUs, 1 socket):
    Single-socket -> batch_size=1 -> original code path.  No
    behavioral change.  (I double-checked this one specifically
    because it's the sanity test for the gate.)
    
    HammerDB TPC-C on m6i.metal (1000 warehouses):
    VUs   Stock     Patched   Delta
    128   358,518   349,787    -2%
    256   332,098   330,272    -1%
    384   365,782   377,519    +3%
    512   370,663   386,526    +4%
    No TPC-C regression, which was the thing we were most worried about. An earlier attempt (per-socket partitioned sweep, see below) was -13% on this same workload.
    The general shape is: the scaling curve flattens later. Unpatched, TPS tops out around 128 clients and stays flat up to 512 because backends are spending cycles waiting on the cache line rather than
    doing work. Patched, the curve keeps rising past the point where unpatched plateaus.
    Huge pages caveat: all of the above was run with huge pages on, on large-memory instances (the r8i.96xl has 3TB, so Jim never considered running without them). We have not characterized the non-huge-pages case. That's on my list; I don't expect it to change the conclusion, but I shouldn't speak for data I haven't collected.
    Relationship to Tomas's NUMA series
    -----------------------------------
    Tomas posted a multi-patch NUMA-awareness series in [1] covering buffer interleaving across nodes, partitioned freelists, partitioned clock sweep, PGPROC interleaving, and related pieces. I want to be careful here because I don't think we should frame this patch as competing with that work.
    One thing I found striking as I re-read the thread: in the benchmarks Tomas posted later in the series, most of the benefit comes from partitioning the clock sweep, and the NUMA memory-placement layer on top sometimes runs slower than partitioning alone. His own conclusion, quoted roughly: the benefit mostly comes from just partitioning the clock sweep, and it's largely independent of the NUMA stuff; the NUMA partitioning is often slower.
    That observation is the thing that makes me think batching is worth considering on its own. It's going after the same bottleneck Tomas's partitioning addresses, but:
    - without splitting global eviction visibility (which is where
    cross-partition stealing gets complicated),
    - without requiring NUMA-aware buffer placement (which has huge
    page alignment, descriptor-partition-mid-page, and resize
    complications that are still being worked out in that thread),
    - without touching PGPROC or bgwriter.
    What this patch does not do:
    - place buffers on specific NUMA nodes
    - partition the freelist
    - touch PGPROC
    - add new GUCs
    - change bgwriter
    What this patch does do:
    - target exactly the clock-sweep contention that Tomas's
    partitioning targets, and reduce it by ~64x, in ~30 lines.
    If Tomas's series lands in full, this patch becomes redundant for its primary use case (though even within a partitioned sweep, the per-partition atomic still benefits from batching, so it's arguably a useful primitive either way). If Tomas's series lands incrementally over several cycles -- which the open items in that thread suggest is the realistic path -- this gets us a real chunk of the multi-socket win now.
    This patch is also orthogonal to my earlier thread about removing the freelist entirely [2], but given the proximity to that code Jim agreed that I could propose/steward it here on the list for consideration.
    Open questions / things I'd like feedback on
    --------------------------------------------
    - Batch size. 64 is a round number that worked well in testing, but
    Nathan raised the reasonable point that on small shared_buffers
    with high concurrency, a fixed 64 could be unfortunate.  Options:
    scale with shared_buffers (Min(64, NBuffers / N) for some N), scale
    with max_connections, keep it fixed but let operators tune it, or
    make it a function of NUMA node count.  I don't have a strong
    opinion yet; the Min(batch, NBuffers) cap covers the "obviously
    wrong" corner but doesn't speak to the "several hundred backends
    on a few-MB shared_buffers" shape.  Numbers/ideas/proposals welcome.
    - NUMA detection. The gate uses pgnumainit() /
    pg_numa_get_max_node().  On systems where libnuma isn't available,
    or where get_mempolicy is blocked (some container configurations),
    we fall back to batch size 1.  That's safe but it misses the
    "single socket, many cores, still benefits from fewer atomics"
    case.  Might be worth a way to force-enable, or batching on all
    systems with a smaller batch size when single-socket.  I'd like to
    measure before deciding.
    - Eviction pattern on reads. Nathan also flagged that with batching,
    the buffers a backend ends up pinning in one StrategyGetBuffer()
    call will tend to be contiguous in buffer-id space rather than
    scattered, which is a different allocation pattern than today.
    The usage_count analysis above says this is benign, but if anyone
    has an intuition for a workload where this would be observable
    (e.g., something that cares about the mapping between buffer-id
    and relation locality), I'd like to hear it.
    - nextVictimBuffer wraparound. The current code has a mild overflow
    concern papered over with "highly unlikely and wouldn't be
    particularly harmful".  With batching this is no worse than before,
    but if we're already touching this function, it might be worth
    thinking about whether to tighten it up in the same patch or a
    follow-up.
    - Should the non-NUMA value for this be derived from core counts that
    imply L1/L2 cache layouts or simply default to 8 rather than 1 to
    realize some benefit?
    - Should there be a postgresql.conf setting for this that takes
    precedence?
    I'll run the non-huge-pages variant, reproduce the r8i numbers, poke at the small-shared_buffers corner, and post perf stat output showing the atomic/cache-miss deltas over the next few days. In the meantime, eyeballs and skepticism welcome -- I would especially welcome comments from Andres, who's been in this code recently, and from Tomas, whose series has the most overlap.
    I realize that we're past feature freeze and working on release notes for v19, so the chances of merging this are slim to none. I think this could be considered a "performance bug fix for NUMA systems" in this release, but that is stretching it a bit. It is a big ask at this stage to land a change like this.
    best.
    -greg
    [1] https://www.postgresql.org/message-id/099b9433-2855-4f1b-b421-d078a5d82017@vondra.me
    [2] https://www.postgresql.org/message-id/f0e3c02e-e217-4f04-8dab-1e7e80a228c0@burd.me
    • Jump to comment-1
      Greg Burd<greg@burd.me>
      Apr 27, 2026, 1:27 PM UTC
      On Sat, Apr 25, 2026, at 4:08 PM, Greg Burd wrote:
      Hello hackers,
      Hi again, attached is v2:
      0001 - unchanged, batches clock-sweep to reduce contention
      0002 - changed ComputeClockBatchSize() such that non-NUMA multi-core systems use batches as well and no longer default to batch size 1
      Details below...
      A colleague of mine, Jim Mlodgenski, has been poking at NUMA behavior
      on some of the newer AWS bare-metal instance types (r8i in particular,
      which exposes 6 NUMA nodes via SNC3 on a 2-socket box), and in the
      process landed on a very small change to freelist.c that I think is
      worth showing around. His patch is attached with some tweaks of my own.

      Full disclosure: the exploration that led Jim to this patch idea was
      done with help from an AI assistant (Kiro); the idea, the benchmarking,
      and the final shape of the patch are human-driven, but I wanted to be
      up front about how his investigation started. Happy to discuss that
      separately if people want to.

      The one-line summary: instead of advancing nextVictimBuffer one buffer
      at a time via pgatomicfetchaddu32, each backend claims a batch of
      64 consecutive buffer IDs from the shared hand and then iterates them
      privately. Global sweep order is preserved -- every buffer is still
      visited exactly once per complete pass -- but the atomic contention on
      that one cache line drops by roughly the batch size.


      Why this matters
      ----------------

      On multi-socket boxes under eviction pressure, every backend that needs
      a victim buffer ends up CAS'ing the same cache line. On a single
      socket, a locked RMW on that cache line stays warm in L1/L2 and
      completes in ~20ns. On 2+ sockets, the line bounces over QPI/UPI at
      ~100-200ns per op, and with hundreds of backends running
      StrategyGetBuffer() concurrently, the line ping-pongs constantly. It's
      a textbook NUMA scalability bottleneck, and once shared_buffers is
      smaller than the working set and the sweep is running continuously,
      that single atomic is what you hit in a perf profile (elevated
      bus-cycles, cache-misses on the cache line holding nextVictimBuffer).

      Andres pointed at the same spot in his pgconf.eu 2024 talk, and Tomas
      called it out in the "Adding basic NUMA awareness" thread [1] -- so
      this isn't news to anyone who's been looking at this area. What I
      think is new is a fix that's just this, without any of the surrounding
      architectural change.

      The framing (credit to Jim): the clock hand is doing two jobs. It
      coordinates backends so they don't redundantly decrement usage_count
      on the same buffers and so they eventually visit every buffer in the
      pool exactly once per pass. It also serializes access to the
      counter. Coordination is the part we want. Serialization is the part
      that's killing us on bigger NUMA boxes. Batching keeps the
      coordination and thins out the serialization.


      How it works
      ------------

      Two per-backend statics, MyBatchPos and MyBatchEnd. When a backend
      calls ClockSweepTick() and its local batch is exhausted, it does a
      single fetch-add of CLOCKSWEEPBATCH_SIZE (64) against
      nextVictimBuffer and now owns that range. Subsequent ticks just bump
      the local counter.

      Wraparound got a small rewrite. The original code had the backend that
      crossed NBuffers drive completePasses++ under the spinlock via a CAS
      loop. With batching, multiple backends can each land a fetch-add that
      returns a value >= NBuffers in the same pass, so the logic now is:
      whoever sees a start >= NBuffers takes the spinlock, re-reads the
      counter, and if it's still out of range does a single CAS to wrap it
      and bumps completePasses. If somebody else already wrapped, we just
      release and move on. StrategySyncStart() still sees a consistent
      (nextVictimBuffer, completePasses) pair.

      The batch size is gated on whether we actually have multiple NUMA
      nodes. On a single-socket box the atomic is already socket-local,
      batching just makes backends skip further ahead than they need to, so
      we fall back to batch size 1 -- which is bit-for-bit the original
      behavior. The guard:
      if (pgnumainit() != -1 && pgnumagetmaxnode() >= 1)
      ClockSweepBatchSize = Min(CLOCKSWEEPBATCH_SIZE, (uint32) NBuffers);
      else
      ClockSweepBatchSize = 1;

      Min() against NBuffers covers the small-shared_buffers corner so a
      batch never wraps the pool multiple times in one claim.
      Thinking more about this approach led me to believe that this non-NUMA default is wrong and induces overhead for a very common case.
      Does batching mess up the meaning of usage_count?
      --------------------------------------------------

      Short answer: no. I want to walk through this because it was my first
      concern too, and I think it's the question that will come up most on
      review.

      The clock sweep's usage_count is an access-frequency approximation
      measured in units of complete passes. A buffer with usage_count = N
      survives N passes without a re-pin. The semantic meaning lives at pass
      granularity, not at individual-buffer granularity.

      What batching changes: intra-pass temporal ordering. Without batching,
      with N backends sweeping, decrements are interleaved -- backend A hits
      B[0], backend B hits B[1], backend C hits B[2]. With batching, backend
      A hits B[0..63] in a tight local burst, then backend B hits B[64..127],
      etc. The 64-buffer chunks are decremented in bursts rather than
      individually.

      Why it doesn't matter:

      1. Every buffer still gets decremented exactly once per complete
      pass. The invariant the algorithm actually depends on is
      untouched.

      2. A buffer's survival window is the time between consecutive
      passes. That's milliseconds to seconds under load. Whether
      B[0] gets decremented 50us before or 50us after B[63] within
      the same pass is below the resolution of anything usage_count
      is trying to measure.

      3. The bgwriter's feedback loop reads (nextVictimBuffer,
      completePasses, numBufferAllocs) via StrategySyncStart() every
      ~200ms. nextVictimBuffer still advances at the same total
      rate (64 per atomic op, but atomic ops happen 1/64 as often).
      The position it reports can jitter by up to 64 buffers relative
      to the one-at-a-time case, but BgBufferSync()'s smoothed
      estimates operate over thousands of buffers per cycle, so the
      jitter disappears into the averaging. numBufferAllocs still
      increments once per allocation. strategy_delta,
      smoothedalloc, smootheddensity, reusablebuffersest -- all
      unaffected in any way I can see.

      Table form, because it's easier to argue with:

      Property | Unpatched | Batched
      ----------------------------------+----------------+----------------
      Buffers visited per pass | NBuffers | NBuffers
      Decrements per buffer per pass | 1 | 1
      Eviction threshold | usagecount==0 | usagecount==0
      Max survival (passes) | 6 | 6
      Decrement ordering within a pass | interleaved | chunked
      bgwriter allocation rate signal | accurate | accurate
      Cross-socket atomic traffic | 1 per buffer | 1 per 64

      There is one subtle difference worth naming. When a backend finds a
      victim at B[5] of its batch, it returns with MyBatchEnd still sitting
      at B[63]. The next time that backend needs a victim it resumes at
      B[6], not at wherever the global hand now points. So the backend
      drains its batch over multiple StrategyGetBuffer() calls rather than
      all at once. Under heavy load, where batches are consumed in
      microseconds, this is invisible. Under light load, the implication is
      that some buffers can sit with slightly stale usage_count for longer
      than they would have before. But "light load" means "the sweep is
      barely moving and nothing wants to evict anyway" -- so the effect
      doesn't show up where it would hurt.

      There's also a small positive side-effect: cache locality. The backend
      that just touched BufferDescriptor[B[0]] has the adjacent descriptors
      warm in L1/L2. Walking B[0..63] locally is cheaper than walking a
      striped interleaving where each descriptor was last touched by a
      different core. I haven't tried to isolate this in perf, but it falls
      out naturally.


      Benchmarks
      ----------

      Jim ran these; I'm still working on reproducing them locally and will
      post independent numbers in a follow-up. All bare metal, Linux, huge
      pages enabled throughout (more on that below), postmaster pinned to
      node 0 with `numactl --cpunodebind=0` because otherwise stock TPS
      varied from 31K to 40K depending on which node the postmaster happened
      to land on at launch -- worth flagging for anyone trying to reproduce.

      Workload is pgbench scale 3000 (~45GB) with shared_buffers=32GB, so the
      working set always spills and the sweep is hot.

      r8i.metal-96xl (384 vCPUs, 2 sockets, 6 NUMA nodes via SNC3):

      pgbench RO:
      Clients Stock Patched Delta
      64 31,457 36,353 +16%
      128 31,678 37,864 +20%
      256 31,510 37,558 +19%
      384 31,431 37,464 +19%
      512 31,329 37,040 +18%

      pgbench RW:
      Clients Stock Patched Delta
      64 7,685 7,713 0%
      128 10,420 10,541 +1%
      256 12,393 12,463 +1%
      384 15,317 15,197 -1%
      512 17,930 17,978 0%

      m6i.metal (128 vCPUs, 2 sockets, Ice Lake):
      RO +19-20%, RW within noise.

      c8i.metal-48xl (192 vCPUs, 1 socket):
      Single-socket -> batch_size=1 -> original code path. No
      behavioral change. (I double-checked this one specifically
      because it's the sanity test for the gate.)

      HammerDB TPC-C on m6i.metal (1000 warehouses):
      VUs Stock Patched Delta
      128 358,518 349,787 -2%
      256 332,098 330,272 -1%
      384 365,782 377,519 +3%
      512 370,663 386,526 +4%

      No TPC-C regression, which was the thing we were most worried about. An
      earlier attempt (per-socket partitioned sweep, see below) was -13% on
      this same workload.

      The general shape is: the scaling curve flattens later. Unpatched, TPS
      tops out around 128 clients and stays flat up to 512 because backends
      are spending cycles waiting on the cache line rather than
      doing work. Patched, the curve keeps rising past the point where
      unpatched plateaus.

      Huge pages caveat: all of the above was run with huge pages on, on
      large-memory instances (the r8i.96xl has 3TB, so Jim never considered
      running without them). We have not characterized the non-huge-pages
      case. That's on my list; I don't expect it to change the conclusion,
      but I shouldn't speak for data I haven't collected.


      Relationship to Tomas's NUMA series
      -----------------------------------

      Tomas posted a multi-patch NUMA-awareness series in [1] covering buffer
      interleaving across nodes, partitioned freelists, partitioned clock
      sweep, PGPROC interleaving, and related pieces. I want to be careful
      here because I don't think we should frame this patch as competing with
      that work.

      One thing I found striking as I re-read the thread: in the benchmarks
      Tomas posted later in the series, *most of the benefit comes from
      partitioning the clock sweep*, and the NUMA memory-placement layer on
      top sometimes runs slower than partitioning alone. His own conclusion,
      quoted roughly: the benefit mostly comes from just partitioning the
      clock sweep, and it's largely independent of the NUMA stuff; the NUMA
      partitioning is often slower.

      That observation is the thing that makes me think batching is worth
      considering on its own. It's going after the same bottleneck Tomas's
      partitioning addresses, but:

      - without splitting global eviction visibility (which is where
      cross-partition stealing gets complicated),
      - without requiring NUMA-aware buffer placement (which has huge
      page alignment, descriptor-partition-mid-page, and resize
      complications that are still being worked out in that thread),
      - without touching PGPROC or bgwriter.

      What this patch does not do:
      - place buffers on specific NUMA nodes
      - partition the freelist
      - touch PGPROC
      - add new GUCs
      - change bgwriter

      What this patch does do:
      - target exactly the clock-sweep contention that Tomas's
      partitioning targets, and reduce it by ~64x, in ~30 lines.

      If Tomas's series lands in full, this patch becomes redundant for its
      primary use case (though even within a partitioned sweep, the
      per-partition atomic still benefits from batching, so it's arguably a
      useful primitive either way). If Tomas's series lands incrementally
      over several cycles -- which the open items in that thread suggest is
      the realistic path -- this gets us a real chunk of the multi-socket win
      now.

      This patch is also orthogonal to my earlier thread about removing the
      freelist entirely [2], but given the proximity to that code Jim agreed
      that I could propose/steward it here on the list for consideration.


      Open questions / things I'd like feedback on
      --------------------------------------------

      - Batch size. 64 is a round number that worked well in testing, but
      Nathan raised the reasonable point that on small shared_buffers
      with high concurrency, a fixed 64 could be unfortunate. Options:
      scale with shared_buffers (Min(64, NBuffers / N) for some N), scale
      with max_connections, keep it fixed but let operators tune it, or
      make it a function of NUMA node count. I don't have a strong
      opinion yet; the Min(batch, NBuffers) cap covers the "obviously
      wrong" corner but doesn't speak to the "several hundred backends
      on a few-MB shared_buffers" shape. Numbers/ideas/proposals welcome.

      - NUMA detection. The gate uses pgnumainit() /
      pgnumagetmaxnode(). On systems where libnuma isn't available,
      or where get_mempolicy is blocked (some container configurations),
      we fall back to batch size 1. That's safe but it misses the
      "single socket, many cores, still benefits from fewer atomics"
      case. Might be worth a way to force-enable, or batching on all
      systems with a smaller batch size when single-socket. I'd like to
      measure before deciding.

      - Eviction pattern on reads. Nathan also flagged that with batching,
      the buffers a backend ends up pinning in one StrategyGetBuffer()
      call will tend to be contiguous in buffer-id space rather than
      scattered, which is a different allocation pattern than today.
      The usage_count analysis above says this is benign, but if anyone
      has an intuition for a workload where this would be observable
      (e.g., something that cares about the mapping between buffer-id
      and relation locality), I'd like to hear it.

      - nextVictimBuffer wraparound. The current code has a mild overflow
      concern papered over with "highly unlikely and wouldn't be
      particularly harmful". With batching this is no worse than before,
      but if we're already touching this function, it might be worth
      thinking about whether to tighten it up in the same patch or a
      follow-up.

      - Should the non-NUMA value for this be derived from core counts that
      imply L1/L2 cache layouts or simply default to 8 rather than 1 to
      realize some benefit?
      So, I'm answering my own question here. Yes, it should. Ideas below.
      - Should there be a postgresql.conf setting for this that takes
      precedence?


      I'll run the non-huge-pages variant, reproduce the r8i numbers, poke at
      the small-shared_buffers corner, and post perf stat output showing the
      atomic/cache-miss deltas over the next few days. In the meantime,
      eyeballs and skepticism welcome -- I would especially welcome comments
      from Andres, who's been in this code recently, and from Tomas, whose
      series has the most overlap.

      I realize that we're past feature freeze and working on release notes
      for v19, so the chances of merging this are slim to none. I think this
      could be considered a "performance bug fix for NUMA systems" in this
      release, but that is stretching it a bit. It is a big ask at this
      stage to land a change like this.

      best.

      -greg

      [1]
      https://www.postgresql.org/message-id/099b9433-2855-4f1b-b421-d078a5d82017@vondra.me
      [2]
      https://www.postgresql.org/message-id/f0e3c02e-e217-4f04-8dab-1e7e80a228c0@burd.me
      Attachments:
      * v1-0001-Reduce-clock-sweep-atomic-contention-by-claiming-.patch
      ComputeClockBatchSize() has two phases: select a base batch from hardware topology, then cap it to prevent over-claiming.
      Phase 1: Base batch from topology
      int ncpus = pg_get_online_cpus();
      int numa_nodes = (pg_numa_init() != -1) ? pg_numa_get_max_node() + 1 : 1;
      
      if (numa_nodes > 1)        base_batch = 64;
      else if (ncpus > 16)       base_batch = 32;
      else if (ncpus > 8)        base_batch = 16;
      else if (ncpus > 4)        base_batch = 8;
      else                       base_batch = 1;
      The reasoning at each tier:
      - NUMA (multi-socket): Atomic ops cross the interconnect (QPI/UPI/Infinity
      Fabric). Round-trip latency is ~100-300ns vs ~10-40ns intra-socket.
      Batch=64 amortizes that heavily.
      
      - >16 cores, single socket: Still significant L3 contention, many cores
      competing for the same cache line. Batch=32 cuts atomic ops by 32x.
      
      - 9-16 cores: Moderate contention. Batch=16.
      
      - 5-8 cores: Light contention. Batch=8.
      
      - <=4 cores: Almost no contention. Batch=1 (no batching). The overhead of 
      batching logic isn't worth it, and there's a fairness tradeoff - batching
      means one backend "owns" a range of buffers temporarily, which matters 
      more when there are few buffers per backend.
      Phase 2: Cap to prevent over-claiming
      max_batch = (MaxBackends > 0)
        ? pool_nbuffers / (2 * MaxBackends)
        : pool_nbuffers / 200;
      if (max_batch < 1)
        max_batch = 1;
      
      return Min(base_batch, Min(max_batch, pool_nbuffers));
      The cap ensures that if every backend simultaneously claims a batch, the total claimed doesn't exceed half the pool:
      batch_size * MaxBackends <= pool_nbuffers / 2
      Why half? If all backends claimed the entire pool simultaneously, they'd each be sweeping overlapping ranges, thus wasting work and defeating the purpose. Keeping total claims under 50% of the pool means at any instant, at most half the buffers are "in flight" being evaluated by backends, and the other half are available for normal operation.
      For a small dynamic pool (say 4096 buffers with MaxBackends=200), the cap computes to 4096 / 400 = 10, which overrides any larger basebatch. For the default pool with sharedbuffers = 8GB (1M buffers) and MaxBackends=200, the cap is 1000000 / 400 = 2500 which is well above the max basebatch of 64, so the basebatch wins.
      The pool_nbuffers floor at the end handles the degenerate case of a pool smaller than the batch size.
      The Tradeoff
      Larger batches reduce atomic contention but increase sweep unevenness, one backend might sweep through "cold" buffers while another's batch happens to land on "hot" ones. The tiered approach balances this: batch aggressively only when the hardware topology makes contention the dominant cost (NUMA, many-core), and stay conservative on small systems where fairness matters more.
      I think this is better because:
      1. The original patch only batched on multi-socket NUMA systems. The new algorithm also provides atomic contention benefits on large single-socket systems (>16 cores) where L3 cache contention matters.
      2. Conservative on small systems: Systems with ≤4 cores get batch_size=1 (original behavior) since batching overhead outweighs contention benefits and fairness matters more.
      3. Prevents pathological over-claiming: The cap mechanism prevents scenarios where many backends claim huge batches relative to a small buffer pool.
      Based on the algorithm, here's what different systems would get:
          System         CPUs   NUMA      Total RAM   Shr Buf   Batch Size  Atomic Reduction                                                                          
      ==================  ====  ========  ===========  ========  ==========  ================                                                                           
      r8i.metal-96xl       384     multi      3072GB   2457.6GB          64    64x                                                                                      
      m6i.metal            128     multi       512GB    409.6GB          64    64x                                                                                      
      c8i.metal-48xl       192  1 socket       192GB    153.6GB          32    32x                                                                                      
      Large server          64     multi       256GB    204.8GB          64    64x
      Medium server         32  1 socket        64GB     51.2GB          32    32x                                                                                      
      Small server          16  1 socket        32GB     25.6GB          16    16x                                                                                      
      Developer machine      8  1 socket        16GB     12.8GB           8    8x
      Small VM               4  1 socket         4GB      3.2GB           1    no change
      Overloaded VM          8  1 socket         4GB      3.2GB           8    8x
      best.
      -greg
    • Jump to comment-1
      Andres Freund<andres@anarazel.de>
      Apr 27, 2026, 2:15 PM UTC
      Hi,
      Thanks for looking into this.
      On 2026-04-25 16:08:02 -0400, Greg Burd wrote:
      Does batching mess up the meaning of usage_count?
      --------------------------------------------------

      Short answer: no. I want to walk through this because it was my first
      concern too, and I think it's the question that will come up most on review.

      The clock sweep's usage_count is an access-frequency approximation measured
      in units of complete passes. A buffer with usage_count = N survives N
      passes without a re-pin. The semantic meaning lives at pass granularity,
      not at individual-buffer granularity.

      What batching changes: intra-pass temporal ordering. Without batching, with
      N backends sweeping, decrements are interleaved -- backend A hits B[0],
      backend B hits B[1], backend C hits B[2]. With batching, backend A hits
      B[0..63] in a tight local burst, then backend B hits B[64..127], etc. The
      64-buffer chunks are decremented in bursts rather than individually.

      Why it doesn't matter:

      1. Every buffer still gets decremented exactly once per complete
      pass. The invariant the algorithm actually depends on is
      untouched.

      2. A buffer's survival window is the time between consecutive
      passes. That's milliseconds to seconds under load. Whether
      B[0] gets decremented 50us before or 50us after B[63] within
      the same pass is below the resolution of anything usage_count
      is trying to measure.
      I don't think this is true, unfortunately. Sure, if you have a completely
      uniform, IO intensive, workload it is, but that's not all there is. If you
      have a bunch of connections that replace buffers at a low rate and a bunch of
      connections that do so at a high rate, the batches "checked out" by the "low
      rate" connections won't be processed soon. Thus the buffers in that batch
      won't have their usagecount decremented and thus have a stronger "protection"
      against replacement.
      You can argue that may be OK, because it'd be unlikely that the next sweep
      would assign the same buffers to an "low rate" backend again. But that'd be an
      argument you'd have to make and validate.
      I'm somewhat doubtful that batching that's independent of contention and
      independent of the usage rate will work out all that well. If you instead
      went for a partitioned sweep architecture, with balancing between the
      different partitions, you don't have that issue. And you have a building block
      for more numa awareness etc.
      There is one subtle difference worth naming. When a backend finds a victim at B[5] of its batch, it returns with MyBatchEnd still sitting at B[63]. The next time that backend needs a victim it resumes at B[6], not at wherever the global hand now points. So the backend drains its batch over multiple StrategyGetBuffer() calls rather than all at once. Under heavy load, where batches are consumed in microseconds, this is invisible. Under light load, the implication is that some buffers can sit with slightly stale usage_count for longer than they would have before. But "light load" means "the sweep is barely moving and nothing wants to evict anyway" -- so the effect
      doesn't show up where it would hurt.
      As mentioned above, this assumes that the replacement rate is uniform between
      backends, which I think is not uniformly true outside of benchmarks.
      There's also a small positive side-effect: cache locality. The backend that
      just touched BufferDescriptor[B[0]] has the adjacent descriptors warm in
      L1/L2.
      A BufferDesc is 64bytes. With common cacheline sizes and stuff like adjacent
      cacheline prefetching you'll have maybe 2 consecutive BufferDescs in L1/L2.
      Where it might help more is the TLB.
      Benchmarks
      ----------

      Jim ran these; I'm still working on reproducing them locally and will post
      independent numbers in a follow-up. All bare metal, Linux, huge pages
      enabled throughout (more on that below), postmaster pinned to node 0 with
      `numactl --cpunodebind=0` because otherwise stock TPS varied from 31K to 40K
      depending on which node the postmaster happened to land on at launch --
      worth flagging for anyone trying to reproduce.
      That's an odd one that I think you need to investigate separately.
      Workload is pgbench scale 3000 (~45GB) with shared_buffers=32GB, so the
      working set always spills and the sweep is hot.
      Uhm, is this something worth optimizing substantially for? What you're
      measuring here is basically the worst possible way of configuring a database,
      with full double buffering and a lot of memory bandwidth dedicated to copying
      buffers from one place to another. That's maybe a sane setup if you have a lot
      of small databases that you can't configure individually, but that's not the
      case when you run a reasonably large workload on a 384vCPU setup.
      I think to be really convincing you'd have to do this with actual IO involved
      somewhere.
      Relationship to Tomas's NUMA series
      -----------------------------------

      Tomas posted a multi-patch NUMA-awareness series in [1] covering buffer interleaving across nodes, partitioned freelists, partitioned clock sweep, PGPROC interleaving, and related pieces. I want to be careful here because I don't think we should frame this patch as competing with that work.

      One thing I found striking as I re-read the thread: in the benchmarks Tomas
      posted later in the series, *most of the benefit comes from partitioning the
      clock sweep*, and the NUMA memory-placement layer on top sometimes runs
      slower than partitioning alone. His own conclusion, quoted roughly: the
      benefit mostly comes from just partitioning the clock sweep, and it's
      largely independent of the NUMA stuff; the NUMA partitioning is often
      slower.
      That was partially because he measured on something that didn't really have
      significant NUMA effects though...
      That observation is the thing that makes me think batching is worth
      considering on its own. It's going after the same bottleneck Tomas's
      partitioning addresses, but:

      - without splitting global eviction visibility (which is where
      cross-partition stealing gets complicated),
      You are doing that tho.
      - without requiring NUMA-aware buffer placement (which has huge
      page alignment, descriptor-partition-mid-page, and resize
      complications that are still being worked out in that thread),
      You can do the partitioned clock sweep without any of that.
      Greetings,
      Andres Freund
      • Jump to comment-1
        Greg Burd<greg@burd.me>
        Apr 28, 2026, 2:30 PM UTC
        On Mon, Apr 27, 2026, at 10:15 AM, Andres Freund wrote:
        Hi,

        Thanks for looking into this.

        On 2026-04-25 16:08:02 -0400, Greg Burd wrote:
        Does batching mess up the meaning of usage_count?
        --------------------------------------------------

        Short answer: no. I want to walk through this because it was my first
        concern too, and I think it's the question that will come up most on review.

        The clock sweep's usage_count is an access-frequency approximation measured
        in units of complete passes. A buffer with usage_count = N survives N
        passes without a re-pin. The semantic meaning lives at pass granularity,
        not at individual-buffer granularity.

        What batching changes: intra-pass temporal ordering. Without batching, with
        N backends sweeping, decrements are interleaved -- backend A hits B[0],
        backend B hits B[1], backend C hits B[2]. With batching, backend A hits
        B[0..63] in a tight local burst, then backend B hits B[64..127], etc. The
        64-buffer chunks are decremented in bursts rather than individually.

        Why it doesn't matter:

        1. Every buffer still gets decremented exactly once per complete
        pass. The invariant the algorithm actually depends on is
        untouched.

        2. A buffer's survival window is the time between consecutive
        passes. That's milliseconds to seconds under load. Whether
        B[0] gets decremented 50us before or 50us after B[63] within
        the same pass is below the resolution of anything usage_count
        is trying to measure.

        I don't think this is true, unfortunately. Sure, if you have a completely
        uniform, IO intensive, workload it is, but that's not all there is. If you
        have a bunch of connections that replace buffers at a low rate and a bunch of
        connections that do so at a high rate, the batches "checked out" by the "low
        rate" connections won't be processed soon. Thus the buffers in that batch
        won't have their usagecount decremented and thus have a stronger "protection"
        against replacement.

        You can argue that may be OK, because it'd be unlikely that the next sweep
        would assign the same buffers to an "low rate" backend again. But that'd be an
        argument you'd have to make and validate.
        You're right, and my "why it doesn't matter" section overstated things. The uniform-workload assumption was sloppy. Let me try again with the mixed case in mind.
        The scenario I think you're describing: a low-rate backend claims B[0..63], finds a victim at B[5], and then doesn't call StrategyGetBuffer() again for a while -- maybe seconds. During that
        time B[6..63] sit with their current usage_count, undecremented, while high-rate backends are sweeping the rest of the pool at full speed. Those 58 buffers get a free pass they wouldn't have gotten in the interleaved case.
        I can bound the effect but not dismiss it. Each slow backend can hold at most 64 undecremented buffers. With 32GB sharedbuffers (~4.2M buffers), 100 slow backends each holding a full batch means ~6,400 buffers delayed -- 0.15% of the pool. The delay lasts until the backend's next StrategyGetBuffer() call. So the question is whether 0.15% of buffers with temporarily stale usagecount produces a measurable eviction quality difference.
        Two observations that bound it further:
        1. In the current code, a backend only calls ClockSweepTick() when
         it needs a victim.  A low-rate backend barely moves the global
         hand at all.  Without batching, buffers at positions beyond the
         current hand are also "undecremented" -- they just haven't been
         reached yet.  Batching changes *which* specific 64 buffers are
         pending, not the total count of undecremented buffers in the
         pool at any instant.
        
        2. The buffers in a held batch are contiguous in buffer-ID space.
         Since buffer-ID assignment to relation blocks is effectively
         random (driven by eviction order), those 64 buffers are
         scattered across relations.  There's no systematic bias toward
         protecting hot or cold data -- it's a random sample.
        That said, "bounded and random" isn't the same as "zero." One mitigation that's simple to implement: abandon the remaining batch at transaction boundaries. Something like resetting MyBatchEnd = MyBatchPos in AtEOXact or equivalent, so a backend that goes idle between transactions doesn't hold a stale batch across an idle period. That limits the staleness window to the duration of a single transaction, which is when the backend is actively doing work and likely to consume the batch quickly anyway.
        I'd like to measure the mixed-workload case directly. A benchmark with e.g. 50 backends doing heavy sequential scans and 50 doing single-row OLTP, and compare eviction hit rates with and without the patch. Would that be the kind of validation you'd want to see?
        I'm somewhat doubtful that batching that's independent of contention and
        independent of the usage rate will work out all that well. If you instead
        went for a partitioned sweep architecture, with balancing between the
        different partitions, you don't have that issue. And you have a building block
        for more numa awareness etc.
        I agree that partitioned sweep is architecturally more principled and gives you a foundation for deeper NUMA work. I'm not arguing that batching is a better long-term architecture.
        The pragmatic case for batching is: it's ~30 lines, it addresses the identified bottleneck, and it doesn't foreclose on partitioned sweep later. If partitioned sweep lands, batching becomes redundant for its primary use case. If partitioned sweep lands incrementally -- which the open items in that thread suggest is the realistic path -- this gets a chunk of the multi-socket win into users' hands sooner.
        One concrete difference: partitioned sweep needs a stealing mechanism for correctness when partitions are unevenly loaded. Batching avoids that because the "partitions" are ephemeral (one batch cycle) and sequential (global order preserved), so there's no long-lived imbalance to steal from. Whether that simplicity is worth the tradeoff you identified above is a judgment call, and I take your point that the building-block argument favors partitioning.
        I'm also not attached to "batching instead of partitioning." If you think the right move is to focus effort on partitioned sweep, I'm happy to help with that. But if there's appetite for a smaller change that ships sooner, this is what I've got.
        There is one subtle difference worth naming. When a backend finds a victim at B[5] of its batch, it returns with MyBatchEnd still sitting at B[63]. The next time that backend needs a victim it resumes at B[6], not at wherever the global hand now points. So the backend drains its batch over multiple StrategyGetBuffer() calls rather than all at once. Under heavy load, where batches are consumed in microseconds, this is invisible. Under light load, the implication is that some buffers can sit with slightly stale usage_count for longer than they would have before. But "light load" means "the sweep is barely moving and nothing wants to evict anyway" -- so the effect
        doesn't show up where it would hurt.

        As mentioned above, this assumes that the replacement rate is uniform between
        backends, which I think is not uniformly true outside of benchmarks.

        There's also a small positive side-effect: cache locality. The backend that
        just touched BufferDescriptor[B[0]] has the adjacent descriptors warm in
        L1/L2.

        A BufferDesc is 64bytes. With common cacheline sizes and stuff like adjacent
        cacheline prefetching you'll have maybe 2 consecutive BufferDescs in L1/L2.
        Where it might help more is the TLB.
        You're right, I overstated the L1/L2 argument. At 64 bytes per descriptor, adjacent cacheline prefetch gets you at most 2 consecutive descriptors, not 64. TLB is the more plausible benefit -- the batch walks a contiguous virtual address range, which should reduce TLB misses when the descriptor array spans multiple pages. I haven't tried to isolate this in perf and won't claim it until I have numbers.
        Benchmarks
        ----------

        Jim ran these; I'm still working on reproducing them locally and will post
        independent numbers in a follow-up. All bare metal, Linux, huge pages
        enabled throughout (more on that below), postmaster pinned to node 0 with
        `numactl --cpunodebind=0` because otherwise stock TPS varied from 31K to 40K
        depending on which node the postmaster happened to land on at launch --
        worth flagging for anyone trying to reproduce.

        That's an odd one that I think you need to investigate separately.
        Agreed. I'll investigate and report separately. My working hypothesis is that it's related to where shared memory gets physically allocated relative to the postmaster's NUMA node, which then affects all child backends. That's interesting regardless of this patch.
        Workload is pgbench scale 3000 (~45GB) with shared_buffers=32GB, so the
        working set always spills and the sweep is hot.

        Uhm, is this something worth optimizing substantially for? What you're
        measuring here is basically the worst possible way of configuring a database,
        with full double buffering and a lot of memory bandwidth dedicated to copying
        buffers from one place to another. That's maybe a sane setup if you have a lot
        of small databases that you can't configure individually, but that's not the
        case when you run a reasonably large workload on a 384vCPU setup.

        I think to be really convincing you'd have to do this with actual IO involved
        somewhere.
        Fair criticism. The pgbench setup was designed to isolate the clock sweep bottleneck by keeping everything in the OS page cache, but you're right that it doesn't represent how someone would actually run a database on a 384-vCPU box. In production you'd either size shared_buffers to hold the working set (no sweep pressure) or have real storage I/O (where I/O latency dilutes sweep contention).
        The HammerDB TPC-C numbers (which involve I/O and realistic contention patterns) show flat-to-slightly-positive -- no regression, small win at higher concurrency. I think that's the more honest picture of what production looks like. And perhaps "flat to slightly-positive" delta might not be enough juice for the squeeze, especially this late in a cycle.
        For the follow-up I'll run:
        - Working set 2-3x shared_buffers on NVMe, so StrategyGetBuffer()
        calls actually hit storage on some fraction of evictions.
        - A mixed OLTP workload (not just pgbench -S) with varied access
        patterns, to address the uniform-workload concern above.
        - perf stat showing bus-cycles, cache-misses, and L3 contention
        deltas, so the mechanism is visible independent of TPS.
        I should have led with the TPC-C results and framed the pgbench numbers as "here's where the ceiling is under maximum sweep pressure" rather than presenting them as the headline result.
        Relationship to Tomas's NUMA series
        -----------------------------------

        Tomas posted a multi-patch NUMA-awareness series in [1] covering buffer interleaving across nodes, partitioned freelists, partitioned clock sweep, PGPROC interleaving, and related pieces. I want to be careful here because I don't think we should frame this patch as competing with that work.

        One thing I found striking as I re-read the thread: in the benchmarks Tomas
        posted later in the series, *most of the benefit comes from partitioning the
        clock sweep*, and the NUMA memory-placement layer on top sometimes runs
        slower than partitioning alone. His own conclusion, quoted roughly: the
        benefit mostly comes from just partitioning the clock sweep, and it's
        largely independent of the NUMA stuff; the NUMA partitioning is often
        slower.

        That was partially because he measured on something that didn't really have
        significant NUMA effects though...
        Fair point. I shouldn't over-generalize from benchmarks run on hardware that wasn't exercising the NUMA dimension. Retracted.
        That observation is the thing that makes me think batching is worth
        considering on its own. It's going after the same bottleneck Tomas's
        partitioning addresses, but:

        - without splitting global eviction visibility (which is where
        cross-partition stealing gets complicated),

        You are doing that tho.
        You're right. When a backend holds B[0..63], those buffers are effectively invisible to other backends for eviction consideration until the batch is consumed. That is a form of split visibility.
        The difference from a permanent partition is that the split is short-lived (one batch consumption cycle, microseconds under load) and sequential (the next batch picks up where this one left off in the global order). There's no long-lived assignment of buffer ranges to backends, so the kind of structural imbalance that drives the need for cross-partition stealing doesn't arise. But I shouldn't have claimed "without splitting." The honest framing is: "with much more limited
        and transient splitting."
        - without requiring NUMA-aware buffer placement (which has huge
        page alignment, descriptor-partition-mid-page, and resize
        complications that are still being worked out in that thread),

        You can do the partitioned clock sweep without any of that.
        Also correct. The complications I listed are from Tomas's patches 0001 and 0006 (memory interleaving and NUMA-aware buffer-to-node mapping), not from the clock-sweep partitioning patches 0002-0005. Partitioned clock sweep alone doesn't require NUMA-aware buffer placement. I
        conflated the two; apologies.
        Greetings,

        Andres Freund
        To summarize the open items I'm taking away:
        1. Mixed-workload benchmark (high-rate + low-rate backends) to
         measure eviction quality impact of held batches.
        2. I/O-inclusive benchmarks on NVMe with working set > shared_buffers.
        3. Investigate the postmaster NUMA placement variance separately.
        4. Consider batch-abandonment at transaction boundaries as a
         mitigation for the staleness concern.
        5. perf stat data showing the mechanism (bus-cycles, cache-misses).
        I'll post results as I have them. I greatly appreciate your time and thoughtful review.
        best.
        -greg