Randomize B-Tree page split location to avoid oscillating patterns

  • Jump to comment-1
    Dmitry Dolgov<9erthalion6@gmail.com>
    Apr 27, 2026, 4:24 PM UTC
    TL;DR There seems to be a known phenomenon, where a data ingestion into a
    B-Tree produces page splits following an oscillating pattern, which in turn
    affects IO and buffer contention, impacting the performance. It turns out that
    PostgreSQL is not an exception, but it should be possible to randomize a split
    location a bit to mitigate the issue.
    Hi,
    Some time ago while working on models for PostgreSQL performance [1] I've
    stumbled upon an interesting oscillating patters around various B-Tree metrics
    (number of page splits, index size, etc). This turned out to be a known thing
    described in a catchy way as "Waves of misery" [2], and boils down to the fact
    that a fixed split location is usually chosen for a page split -- in this case
    probabilities of page split lead to oscillating solutions under certain
    workloads. Looks like the only pre-condition is that the data to be ingested
    has the same distribution as the data already existing in the tree, in
    particular UUIDs are in a bad position for this. And of course it's possible to
    reproduce this with PostgreSQL.
    Such an oscillation can lead to variability in IO and buffer contention,
    negatively impacting performance. The fillfactor only shifts the waves, but do
    not cancel them. One of the proposed remediation for this is to do suffix
    truncation, which will "spread" split locations across some range. While we do
    column suffix truncation, it turns out to be not enough for many workloads.
    Another option is to randomize the split location, e.g. pick the actual
    location from a range of 20% around the best one. In our case it's easy to do
    randomization based on the split state, as all the possible split locations are
    sorted by delta -- and all what's needed is to add a shift to the lowsplit from
    a range, based on the number of split locations.
    I've done few experiments with this, here is how it looks like:
    * The first one is synthetic: a single column table with integer values, a
    B-Tree index over it with the fillfactor=100, inserting new values one by one
    from a uniform distribution via PGBench. In the graph "split.png" you can see
    the number of page splits over time for the main branch and the patch (named
    "Main" and "Rand" correspondingly), and the oscillating is clear for the
    former one.
    * Another one is following a data schema from one real projects out there, with
    UUIDs as index values and very large records, with the default fillfactor=90.
    The data ingestion is happening in large batches. The graph "split_batch.png"
    represent the data for this case, with the main branch oscillating much more
    than the randomized.
    The unfortunate part is that I couldn't get clear numbers for the performance
    impact. Turns out the disk in my experimental setup is not good enough to get a
    sufficient number of inserts to trigger the issue, and to get nice graphs I was
    running everything either on a RAM disk or on an unlogged table -- in both
    cases it's easy to observe oscillations of page splits, but their impact is not
    large enough since only so much IO is happening.
    But anyway, any thoughts / commentaries on that?
    [1]: https://zenodo.org/records/15786156
    [2]: Glombiewski N., Seeger B., Graefe G. (2019). Waves of Misery After Index
    Creation. BTW 2019. Gesellschaft für Informatik. doi:10.18420/btw2019-06
    • Jump to comment-1
      Peter Geoghegan<pg@bowt.ie>
      Apr 27, 2026, 6:07 PM UTC
      On Mon, Apr 27, 2026 at 12:24 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
      Some time ago while working on models for PostgreSQL performance [1] I've
      stumbled upon an interesting oscillating patters around various B-Tree metrics
      (number of page splits, index size, etc). This turned out to be a known thing
      described in a catchy way as "Waves of misery" [2], and boils down to the fact
      that a fixed split location is usually chosen for a page split -- in this case
      probabilities of page split lead to oscillating solutions under certain
      workloads.
      I'm quite familiar with the paper, and find its argument convincing.
      I'm not sure of the overall importance of the issues that they
      describe, but the phenomenon is obviously real.
      Such an oscillation can lead to variability in IO and buffer contention,
      negatively impacting performance. The fillfactor only shifts the waves, but do
      not cancel them. One of the proposed remediation for this is to do suffix
      truncation, which will "spread" split locations across some range. While we do
      column suffix truncation, it turns out to be not enough for many workloads.
      Another option is to randomize the split location, e.g. pick the actual
      location from a range of 20% around the best one. In our case it's easy to do
      randomization based on the split state, as all the possible split locations are
      sorted by delta -- and all what's needed is to add a shift to the lowsplit from
      a range, based on the number of split locations.
      My interpretation is that randomization can be combined with the usual
      suffix truncation split point choosing heuristics; the paper even
      recommends this at one point. That is, you randomly pick from a small
      number of candidate split points that are all approximately equally
      good according to the traditional criteria. It looks like your patch
      doesn't account for suffix truncation at all, though.
      Your patch should initially look for several split points that are
      approximately equal in quality according to the current criteria -- a
      separate, initial pass. You'd then randomly pick a final split point
      from those gathered during this initial pass -- not from the original
      fillfactormult-sorted list of split points. What you have in v1 will
      make suffix truncation much less effective, which seems unacceptable.
      Another issue is that nbtsort.c doesn't have any ability to pick among
      split points; it focuses entirely on keeping free space balanced. To
      get much benefit from this, I think you'd have to teach nbtsort.c to
      also pick split points using approximately the same algorithm as
      nbtsplitloc.c would (assuming retail inserts in ascending key space
      order). I've been meaning to add proper suffix truncation to the
      CREATE INDEX case.
      I've done few experiments with this, here is how it looks like:

      * The first one is synthetic: a single column table with integer values, a
      B-Tree index over it with the fillfactor=100, inserting new values one by one
      from a uniform distribution via PGBench. In the graph "split.png" you can see
      the number of page splits over time for the main branch and the patch (named
      "Main" and "Rand" correspondingly), and the oscillating is clear for the
      former one.
      I don't think that fillfactor=100 is very interesting in general.
      The unfortunate part is that I couldn't get clear numbers for the performance
      impact. Turns out the disk in my experimental setup is not good enough to get a
      sufficient number of inserts to trigger the issue, and to get nice graphs I was
      running everything either on a RAM disk or on an unlogged table -- in both
      cases it's easy to observe oscillations of page splits, but their impact is not
      large enough since only so much IO is happening.

      But anyway, any thoughts / commentaries on that?
      I wouldn't expect the patch to increase absolute throughput
      significantly, if at all. Its value comes from making the rate of
      splits over time more consistent, for a given fixed workload. You
      might notice a more interesting effect if you look at latency,
      particularly worst case latency.
      --
      Peter Geoghegan
      • Jump to comment-1
        Andres Freund<andres@anarazel.de>
        Apr 27, 2026, 7:50 PM UTC
        Hi,
        On 2026-04-27 14:07:14 -0400, Peter Geoghegan wrote:
        On Mon, Apr 27, 2026 at 12:24 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
        The unfortunate part is that I couldn't get clear numbers for the performance
        impact. Turns out the disk in my experimental setup is not good enough to get a
        sufficient number of inserts to trigger the issue, and to get nice graphs I was
        running everything either on a RAM disk or on an unlogged table -- in both
        cases it's easy to observe oscillations of page splits, but their impact is not
        large enough since only so much IO is happening.

        But anyway, any thoughts / commentaries on that?

        I wouldn't expect the patch to increase absolute throughput
        significantly, if at all. Its value comes from making the rate of
        splits over time more consistent, for a given fixed workload. You
        might notice a more interesting effect if you look at latency,
        particularly worst case latency.
        Possibly somewhat orthogonal, but the worst case latency point reminded me of
        something around this:
        It seems like it's not great for contention avoidance / concurrency that right
        now btree splits will often do IO - to write out a victim buffer and then to
        extend the relation - while holding an exclusive lock on a page.
        Couldn't we instead release the lock on the page, acquire an empty page,
        reacquire the lock, recheck that the split is still needed and, if so, split
        the page without needing to do IO while holding the lock?
        Greetings,
        Andres Freund