Postgres Email Lists

pgsql-hackers

❮

Randomize B-Tree page split location to avoid oscillating patterns

Jump to comment-1
Dmitry Dolgov<9erthalion6@gmail.com>
Apr 27, 2026, 4:24 PM UTC
TL;DR There seems to be a known phenomenon, where a data ingestion into a
B-Tree produces page splits following an oscillating pattern, which in turn
affects IO and buffer contention, impacting the performance. It turns out that
PostgreSQL is not an exception, but it should be possible to randomize a split
location a bit to mitigate the issue.
Hi,
Some time ago while working on models for PostgreSQL performance [1] I've
stumbled upon an interesting oscillating patters around various B-Tree metrics
(number of page splits, index size, etc). This turned out to be a known thing
described in a catchy way as "Waves of misery" [2], and boils down to the fact
that a fixed split location is usually chosen for a page split -- in this case
probabilities of page split lead to oscillating solutions under certain
workloads. Looks like the only pre-condition is that the data to be ingested
has the same distribution as the data already existing in the tree, in
particular UUIDs are in a bad position for this. And of course it's possible to
reproduce this with PostgreSQL.
Such an oscillation can lead to variability in IO and buffer contention,
negatively impacting performance. The fillfactor only shifts the waves, but do
not cancel them. One of the proposed remediation for this is to do suffix
truncation, which will "spread" split locations across some range. While we do
column suffix truncation, it turns out to be not enough for many workloads.
Another option is to randomize the split location, e.g. pick the actual
location from a range of 20% around the best one. In our case it's easy to do
randomization based on the split state, as all the possible split locations are
sorted by delta -- and all what's needed is to add a shift to the lowsplit from
a range, based on the number of split locations.
I've done few experiments with this, here is how it looks like:
* The first one is synthetic: a single column table with integer values, a
```
B-Tree index over it with the fillfactor=100, inserting new values one by one
from a uniform distribution via PGBench. In the graph "split.png" you can see
the number of page splits over time for the main branch and the patch (named
"Main" and "Rand" correspondingly), and the oscillating is clear for the
former one.
```
* Another one is following a data schema from one real projects out there, with
```
UUIDs as index values and very large records, with the default fillfactor=90.
The data ingestion is happening in large batches. The graph "split_batch.png"
represent the data for this case, with the main branch oscillating much more
than the randomized.
```
The unfortunate part is that I couldn't get clear numbers for the performance
impact. Turns out the disk in my experimental setup is not good enough to get a
sufficient number of inserts to trigger the issue, and to get nice graphs I was
running everything either on a RAM disk or on an unlogged table -- in both
cases it's easy to observe oscillations of page splits, but their impact is not
large enough since only so much IO is happening.
But anyway, any thoughts / commentaries on that?
[1]: https://zenodo.org/records/15786156
[2]: Glombiewski N., Seeger B., Graefe G. (2019). Waves of Misery After Index
Creation. BTW 2019. Gesellschaft für Informatik. doi:10.18420/btw2019-06
View in PostgreSQL Archives →
- Jump to comment-1
  Peter Geoghegan<pg@bowt.ie>
  Apr 27, 2026, 6:07 PM UTC
  On Mon, Apr 27, 2026 at 12:24 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
  Some time ago while working on models for PostgreSQL performance [1] I've
  stumbled upon an interesting oscillating patters around various B-Tree metrics
  (number of page splits, index size, etc). This turned out to be a known thing
  described in a catchy way as "Waves of misery" [2], and boils down to the fact
  that a fixed split location is usually chosen for a page split -- in this case
  probabilities of page split lead to oscillating solutions under certain
  workloads.
  I'm quite familiar with the paper, and find its argument convincing.
  I'm not sure of the overall importance of the issues that they
  describe, but the phenomenon is obviously real.
  Such an oscillation can lead to variability in IO and buffer contention,
  negatively impacting performance. The fillfactor only shifts the waves, but do
  not cancel them. One of the proposed remediation for this is to do suffix
  truncation, which will "spread" split locations across some range. While we do
  column suffix truncation, it turns out to be not enough for many workloads.
  Another option is to randomize the split location, e.g. pick the actual
  location from a range of 20% around the best one. In our case it's easy to do
  randomization based on the split state, as all the possible split locations are
  sorted by delta -- and all what's needed is to add a shift to the lowsplit from
  a range, based on the number of split locations.
  My interpretation is that randomization can be combined with the usual
  suffix truncation split point choosing heuristics; the paper even
  recommends this at one point. That is, you randomly pick from a small
  number of candidate split points that are all approximately equally
  good according to the traditional criteria. It looks like your patch
  doesn't account for suffix truncation at all, though.
  Your patch should initially look for several split points that are
  approximately equal in quality according to the current criteria -- a
  separate, initial pass. You'd then randomly pick a final split point
  from those gathered during this initial pass -- not from the original
  fillfactormult-sorted list of split points. What you have in v1 will
  make suffix truncation much less effective, which seems unacceptable.
  Another issue is that nbtsort.c doesn't have any ability to pick among
  split points; it focuses entirely on keeping free space balanced. To
  get much benefit from this, I think you'd have to teach nbtsort.c to
  also pick split points using approximately the same algorithm as
  nbtsplitloc.c would (assuming retail inserts in ascending key space
  order). I've been meaning to add proper suffix truncation to the
  CREATE INDEX case.
  I've done few experiments with this, here is how it looks like:
  
  * The first one is synthetic: a single column table with integer values, a
  B-Tree index over it with the fillfactor=100, inserting new values one by one
  from a uniform distribution via PGBench. In the graph "split.png" you can see
  the number of page splits over time for the main branch and the patch (named
  "Main" and "Rand" correspondingly), and the oscillating is clear for the
  former one.
  I don't think that fillfactor=100 is very interesting in general.
  The unfortunate part is that I couldn't get clear numbers for the performance
  impact. Turns out the disk in my experimental setup is not good enough to get a
  sufficient number of inserts to trigger the issue, and to get nice graphs I was
  running everything either on a RAM disk or on an unlogged table -- in both
  cases it's easy to observe oscillations of page splits, but their impact is not
  large enough since only so much IO is happening.
  
  But anyway, any thoughts / commentaries on that?
  I wouldn't expect the patch to increase absolute throughput
  significantly, if at all. Its value comes from making the rate of
  splits over time more consistent, for a given fixed workload. You
  might notice a more interesting effect if you look at latency,
  particularly worst case latency.
  --
  Peter Geoghegan
  View in PostgreSQL Archives →
  Jump to comment-1
  Andres Freund<andres@anarazel.de>
  Apr 27, 2026, 7:50 PM UTC
  Hi,
  On 2026-04-27 14:07:14 -0400, Peter Geoghegan wrote:
  On Mon, Apr 27, 2026 at 12:24 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
  The unfortunate part is that I couldn't get clear numbers for the performance
  impact. Turns out the disk in my experimental setup is not good enough to get a
  sufficient number of inserts to trigger the issue, and to get nice graphs I was
  running everything either on a RAM disk or on an unlogged table -- in both
  cases it's easy to observe oscillations of page splits, but their impact is not
  large enough since only so much IO is happening.
  
  But anyway, any thoughts / commentaries on that?
  
  I wouldn't expect the patch to increase absolute throughput
  significantly, if at all. Its value comes from making the rate of
  splits over time more consistent, for a given fixed workload. You
  might notice a more interesting effect if you look at latency,
  particularly worst case latency.
  Possibly somewhat orthogonal, but the worst case latency point reminded me of
  something around this:
  It seems like it's not great for contention avoidance / concurrency that right
  now btree splits will often do IO - to write out a victim buffer and then to
  extend the relation - while holding an exclusive lock on a page.
  Couldn't we instead release the lock on the page, acquire an empty page,
  reacquire the lock, recheck that the split is still needed and, if so, split
  the page without needing to do IO while holding the lock?
  Greetings,
  Andres Freund
  View in PostgreSQL Archives →