Postgres Email Lists

pgsql-hackers

pg_stat_io_histogram

Jakub Wartak<jakub.wartak@enterprisedb.com>

Jan 26, 2026, 9:41 AM UTC

I'm proposing that we add pgstatio_histogram that would track/show I/O
latencies profile, so we could quickly identify I/O outliers. From time to
time users complain that 'PostgreSQL is slow or stuck' (usually COMMIT is
slow), when it is quite apparent that it is down to somewhere in the I/O
stack. It is quite easy to prove once one has proper measurement tools in
place and is able to correlate, but it takes IMHO way too much time and
energy to cross-correlate all of that information (iostat -x 1s,
wait events 1s, and so on), especially if one would like to provide rapid
response.

Right now the patch does not include per-backend/PID tracking, hopefully if
there will be interest in this, I'll add it, but I would like to first hear
if that's a good idea. The current implementation uses fast bucket calculation
to avoid overheads and tries to cover most useful range of devices via buckets
(128us..256ms, so that covers both NVMe/SSD/HDD and abnormally high latency
too as from time to time I'm try to help with I/O stuck for seconds,
usually a sign
of some I/O multipath issues, device resetting, or hypervisor woes).

postgres=# select

substring(backend_type,1,8) as backend,object,context,io_type,
bucket_latency_us as lat_us,
round(bucket_latency_us/1000.0, 3) as lat_ms,
bucket_count as count

from pgstatgetiohistogram()
where

bucket_count > 0

order by 1,2,3,4,5;

backend	object	context	io_type	lat_us	lat_ms	count
autovacu	relation	normal	read	128	0.128	54
autovacu	relation	normal	read	256	0.256	7
autovacu	relation	normal	read	512	0.512	1
autovacu	relation	vacuum	read	128	0.128	8
autovacu	relation	vacuum	read	256	0.256	5
backgrou	relation	bulkread	read	128	0.128	658
backgrou	relation	normal	read	128	0.128	5
checkpoi	relation	normal	fsync	2048	2.048	37
checkpoi	relation	normal	fsync	4096	4.096	7
checkpoi	relation	normal	fsync	16384	16.384	4
checkpoi	relation	normal	fsync	32768	32.768	1
checkpoi	relation	normal	fsync	65536	65.536	1
checkpoi	relation	normal	write	128	0.128	2059
checkpoi	relation	normal	write	256	0.256	2
checkpoi	relation	normal	write	512	0.512	1
checkpoi	relation	normal	writeback	128	0.128	64
checkpoi	relation	normal	writeback	256	0.256	1
client b	relation	bulkread	read	128	0.128	675
client b	relation	bulkread	read	256	0.256	1
client b	relation	bulkwrite	extend	128	0.128	260
client b	relation	bulkwrite	extend	512	0.512	1
client b	relation	bulkwrite	write	128	0.128	14404
client b	relation	normal	extend	128	0.128	6
client b	relation	normal	read	128	0.128	273
client b	relation	normal	read	256	0.256	6
client b	relation	vacuum	read	128	0.128	907
client b	relation	vacuum	read	256	0.256	3
client b	relation	vacuum	read	512	0.512	2

Of course most of the I/O calls today are hitting page cache, so one would
expect they'll be < 128us most of the time, but above you can see here degraded
fsync/fdatasync as well (BTW that was achieved via device mapper
delayed device). My hope that above would help tremendously when dealing
with flaky storage, or I/O path issues, or even hypervisors being paused.

Alternative idea I was having would be simply to add logging of slow I/O
outliers, but meh.. then one would to answer all those questions:
what should be the threshold (=>guc?), risk of spamming the log and so on
(and I wouldn't be fond of proposing yet another log_* GUC ;))

Any hints, co-authors, or help are more than welcome!

-J.

View in PostgreSQL Archives →

Jump to comment-1
Andres Freund<andres@anarazel.de>
Jan 26, 2026, 3:08 PM UTC
Hi,
On 2026-01-26 10:40:52 +0100, Jakub Wartak wrote:
I'm proposing that we add pgstatio_histogram that would track/show I/O
latencies profile, so we could quickly identify I/O outliers.
I think that's an interesting feature!
From time to time users complain that 'PostgreSQL is slow or stuck' (usually
COMMIT is slow), when it is quite apparent that it is down to somewhere in
the I/O stack. It is quite easy to prove once one has proper measurement
tools in place and is able to correlate, but it takes IMHO way too much time
and energy to cross-correlate all of that information (iostat -x 1s, wait
events 1s, and so on), especially if one would like to provide rapid
response.
For measuring particularly stuck things, I've been wondering about having a
regular timer that starts to collect more information if stuck in a place for
a while. That would probably end up being lower overhead than constantly
measuring... But would also be a lot more work.
Right now the patch does not include per-backend/PID tracking, hopefully if
there will be interest in this, I'll add it, but I would like to first hear
if that's a good idea. The current implementation uses fast bucket
calculation to avoid overheads and tries to cover most useful range of
devices via buckets (128us..256ms, so that covers both NVMe/SSD/HDD and
abnormally high latency too as from time to time I'm try to help with I/O
stuck for seconds, usually a sign of some I/O multipath issues, device
resetting, or hypervisor woes).
Hm. Isn't 128us a pretty high floor for at least reads and writes? On a good
NVMe disk you'll get < 10us, after all.
I see a few problems with the source of the latency measurements though:
- The latency gathered is that it's quite heavily affected by scheduler
```
noise. If your process isn't scheduled because other processes are busy
doing stuff on the CPU, it's quite possible to get results many orders of
magnitude wrong.
```
- With AIO, you're measuring wait time, and that time can be affected by other
```
IOs in the queue. That will often *drastically* overestimate IO latency
measured this way.
```
I don't see how we can do better absent cooperation from the kernel (by
putting lower-level measurements into io_uring completions, for example)
though. So maybe this is just how it has to be and we ought to just document
it.
postgres=# select
substring(backendtype,1,8) as backend,object,context,iotype,
bucketlatencyus as lat_us,
round(bucketlatencyus/1000.0, 3) as lat_ms,
bucket_count as count
from pgstatgetiohistogram()
where
bucket_count > 0
order by 1,2,3,4,5;
backend | object | context | iotype | latus | lat_ms | count
----------+----------+-----------+-----------+--------+--------+-------
autovacu | relation | normal | read | 128 | 0.128 | 54
Perhaps the latency should be represented as a range?
Of course most of the I/O calls today are hitting page cache, so one would
expect they'll be < 128us most of the time
Have you measured whether overhead is measurable when hitting the page cache?
I'd hope that it doesn't, due to io combing amortizing the cost somewhat. But
it seems worth measuring.
I assume you made pgstatgetioopname() return "hit?" because you don't
expect that to ever be hit?
+static inline int getbucketindex(uint32_t val) {
+#define MINPGSTATIOHIST_LATENCY 127
+ const uint32t maxindex = PGSTATIOHIST_BUCKETS - 1;
+ /*
+ * hopefully calculated to be 25 by the compiler:
+ * clz(127) = clz(01111111b on uint32) = 25
+ */
+ const uint32t minlatencyleadingzeros =
+ pgleadingzerobits32(MINPGSTATIOHISTLATENCY);
+
+ /*
+ * make sure the tmp value at least 127 (our minimum bucket size)
+ * as __builtin_clz might return undefined behavior when operating on 0
+ */
+ uint32t tmp = val | MINPGSTATIOHISTLATENCY;
+ / count leading zeros /
+ int leadingzeros = pgleadingzerobits32(tmp);
+
+ / normalize the index /
+ uint32t index = minlatencyleadingzeros - leading_zeros;
+
+ / clamp it to the maximum /
+ return (index > maxindex) ? maxindex : index;
+}
Wouldn't it be easier to handle the minimum latency by shifting right?
I think we may also need to handle inputs that don't fit a uint32. For things
like a stopped VM or such we could see IOs that that don't fit into a uint32
when measured in microseconds. So perhaps I'd make the input to the bucket
calc 64 bits, then shift to the minimum precision and mask to implement
clamping.
@@ -152,6 +189,10 @@ pgstatcountiooptime(IOObject ioobject, IOContext iocontext, IOOp io_op,
INSTRTIMEADD(PendingIOStats.pendingtimes[ioobject][iocontext][ioop],
io_time);

+ / calculate the bucket_index based on latency in us /
+ bucketindex = getbucketindex(INSTRTIMEGETMICROSEC(io_time));
+ PendingIOStats.pendinghisttimebuckets[ioobject][iocontext][ioop][bucket_index]++;
+
/ Add the per-backend count /
pgstatcountbackendiooptime(ioobject, iocontext, ioop,
It's annoying to have to convert to microseconds here, that's not free :(.
@@ -1356,6 +1356,24 @@ typedef enum iostatcol
IONUMCOLUMNS,
} iostatcol;

+/*
+* When adding a new column to the pgstatio view and the
+* pgstatgetbackendio() function, add a new enum value here above
+* IONUMCOLUMNS.
+*/
+typedef enum histiostat_col
+{
+ HISTIOCOL_INVALID = -1,
+ HISTIOCOLBACKENDTYPE,
+ HISTIOCOL_OBJECT,
+ HISTIOCOL_CONTEXT,
+ HISTIOCOL_IOTYPE,
+ HISTIOCOLBUCKETUS,
+ HISTIOCOL_COUNT,
+ HISTIOCOLRESETTIME,
+ HISTIONUM_COLUMNS
+} historygethistory_state;
Think the IONUMCOLUMNS reference in the comment is a copy-pasto. I don't
think this should be introduced in the middle of the pgstatio implementation.
+/*
+ * pgleadingzero_bits32
+ * Returns the number of leading 0-bits in x, starting at the most significant bit position.
+ * Word must not be 0 (as it is undefined behavior).
+ */
+static inline int
+pgleadingzero_bits32(uint32 word)
Do we really need this in addition to the already existing
pgleftmostone_pos32()? Particularly because that already has an msvc
implementation...
Greetings,
Andres Freund
View in PostgreSQL Archives →
- Jump to comment-1
  Jakub Wartak<jakub.wartak@enterprisedb.com>
  Jan 27, 2026, 12:06 PM UTC
  On Mon, Jan 26, 2026 at 4:08 PM Andres Freund <andres@anarazel.de> wrote:
  
  Hi,
  
  On 2026-01-26 10:40:52 +0100, Jakub Wartak wrote:
  I'm proposing that we add pgstatio_histogram that would track/show I/O
  latencies profile, so we could quickly identify I/O outliers.
  
  I think that's an interesting feature!
  Hi Andres, thanks for kind words and review!
  [..]
  
  For measuring particularly stuck things, I've been wondering about having
  a
  regular timer that starts to collect more information if stuck in a place
  for
  a while. That would probably end up being lower overhead than constantly
  measuring... But it would also be a lot more work.
  Well if something is really stuck, I think the wait events are covering us
  on that,
  aren't they? One can argue if they carry enough information (for me they
  mostly
  do, but I'm trying to squeeze some more stuff into them in a nearby thread
  [1],
  BTW: it's kind of "blocked" due to that 56-bit relfilenode idea/question,
  any thoughts on that?)
  Right now the patch does not include per-backend/PID tracking,
  hopefully if
  there will be interest in this, I'll add it, but I would like to first
  hear
  if that's a good idea. The current implementation uses fast bucket
  calculation to avoid overheads and tries to cover most useful range of
  devices via buckets (128us..256ms, so that covers both NVMe/SSD/HDD and
  abnormally high latency too as from time to time I'm try to help with
  I/O
  stuck for seconds, usually a sign of some I/O multipath issues, device
  resetting, or hypervisor woes).
  
  Hm. Isn't 128us a pretty high floor for at least reads and writes? On a
  good
  NVMe disk you'll get < 10us, after all.
  I was blind and concentrated way too much on the bad-behaving I/O rather
  than good
  I/O - let's call it I/O negativity bias 8)
  Now v2 contains the min bucket lowered to 8us (but max then is just ~131ms,
  I
  didn't want it to use more than 64b total, 16*4b (uint32)=64b and well
  16*8b(uint64)=128b already, so that's why it's capped max at 131072us right
  now).
  I see a few problems with the source of the latency measurements though:
  
  - The latency gathered is that it's quite heavily affected by scheduler
  noise. If your process isn't scheduled because other processes are busy
  doing stuff on the CPU, it's quite possible to get results many orders
  of
  magnitude wrong.
  
  - With AIO, you're measuring wait time, and that time can be affected by
  other
  IOs in the queue. That will often drastically overestimate IO latency
  measured this way.
  
  I don't see how we can do better absent cooperation from the kernel (by
  putting lower-level measurements into io_uring completions, for example)
  though. So maybe this is just how it has to be and we ought to just
  document
  it.
  Right, I think this is a complex topic on it's own, I've added a small
  section into
  the docs. I didn't want to start the thread with undermining my own
  results, but
  indeed I'm getting "bad" numbers. Bad in essence that perceived latency
  numbers do not match with other stuff:
  E.g.checkpointer's fsync/fdatasync latency is awful, although i've been
  using
  this to simulate latency (just 2ms!, but it ended up adding way more):
  dd if=/dev/zero of=/fs bs=1M count=1024 losetup /dev/loop15 /fs echo "0 $(blockdev --getsz /dev/loop15) delay /dev/loop15 0 2" | \ dmsetup create delay mkfs.ext4 /dev/mapper/delay mount /dev/mapper/delay /mnt
  ... and it has e.g. quite interesting effects:
  - lack of "noatime" the impact is clearly visible on fsync
  - even with noatime I'm was getting spikes of latenices above 131ms
  (sic!) with this: select pg_stat_reset_shared(); pgbench -i -s 10 -p 1234 -h /tmp postgres checkpoint;
  - I've created attached bpftrace to see the gap between kernel and uprobes,
  but
  it's not that high, sample of the view
  backend object context io_type lat_us lat_ms count
  checkpoi relation normal fsync 32768 32.768 42
  checkpoi relation normal fsync 65536 65.536 3
  checkpoi relation normal fsync 262144 262.144 1
  vs eBPF, which does not seem to see that, worst case seem to be like
  wasted
  ~20us (gap is between user and kernel)
  [fdatasync] PID 54197 | Kernel: 12943 us | User: 12964 us | Lag: 21us
  [..but usually it's just:]
  [ fsync] PID 52266 | Kernel: 1711 us | User: 1714 us | Lag: 3us [ fsync] PID 52266 | Kernel: 19913 us | User: 19916 us | Lag: 3us [ fsync] PID 52266 | Kernel: 1993 us | User: 1996 us | Lag: 3us [ fsync] PID 52266 | Kernel: 1994 us | User: 1995 us | Lag: 1us [ fsync] PID 52266 | Kernel: 53734 us | User: 53736 us | Lag: 2us [ fsync] PID 52266 | Kernel: 8066 us | User: 8072 us | Lag: 6us [ fsync] PID 52266 | Kernel: 2107 us | User: 2109 us | Lag: 2us [ fsync] PID 52266 | Kernel: 1972 us | User: 1974 us | Lag: 2us
  (this is on 2ms delayed + noatime fs + with CONFIG_HZ=1000 + laptop's
  NVMe
  that's idle).
  - in mdsyncfiletag() we seem to have pgstatcountiooptime() after
  potential
  FileClose(), but I haven't witnessed long close(), it's still
  context-switch
  - I've spotted that power mgmt might be influencing it even further (but
  that's not
  in the docs yet, dunno if I should add it there like the next item on the
  list)
  backend | object | context | iotype | latus | lat_ms | count
  ----------+----------+-----------+-----------+--------+--------+-------
  autovacu | relation | normal | read | 128 | 0.128 | 54
  
  Perhaps the latency should be represented as a range?
  Cool idea, I haven't even thought about this one! From now v2 shows:
  postgres=# select
  substring(backend_type,1,8) as btype, object, context, io_type, bucket_latency_us as lat_us, bucket_count as
  cnt
  from pgstatgetiohistogram()
  where
  bucket_count > 0
  order by 1,2,3,4,5 ;
  btype object context io_type lat_us cnt
  [..]
  checkpoi | relation | normal | write | [0,9) | 33 checkpoi | relation | normal | write | [8,17) | 8 checkpoi | relation | normal | write | [16,33) | 1
  Of course most of the I/O calls today are hitting page cache, so one
  would
  expect they'll be < 128us most of the time
  
  Have you measured whether overhead is measurable when hitting the page
  cache?
  I'd hope that it doesn't, due to io combing amortizing the cost somewhat.
  But
  it seems worth measuring.
  Not yet, I first wanted to hear if I'm not sailing into some plain stupid
  direction somewhere with this idea or implementation (e.g.
  that INSTRTIMEGET_MICROSEC() was a really stupid omission from my side).
  I'll try to perform this test overhead measurement hopefully with v3 once
  we settle on how to do that bit shifting/clz().
  I assume you made pgstatgetioopname() return "hit?" because you don't
  expect that to ever be hit?
  Yes, my patch seems to always return 0 for "hit?". I'll need to investigate
  that
  further.
  +static inline int getbucketindex(uint32_t val) {
  +#define MINPGSTATIOHIST_LATENCY 127
  + const uint32t maxindex = PGSTATIOHIST_BUCKETS - 1;
  + /*
  + * hopefully calculated to be 25 by the compiler:
  + * clz(127) = clz(01111111b on uint32) = 25
  + */
  + const uint32t minlatencyleadingzeros =
  + pgleadingzerobits32(MINPGSTATIOHISTLATENCY);
  +
  + /*
  + * make sure the tmp value at least 127 (our minimum bucket size)
  + * as __builtin_clz might return undefined behavior when
  operating on 0
  + */
  + uint32t tmp = val | MINPGSTATIOHISTLATENCY;
  + / count leading zeros /
  + int leadingzeros = pgleadingzerobits32(tmp);
  +
  + / normalize the index /
  + uint32t index = minlatencyleadingzeros - leading_zeros;
  +
  + / clamp it to the maximum /
  + return (index > maxindex) ? maxindex : index;
  +}
  
  Wouldn't it be easier to handle the minimum latency by shifting right?
  What you seem to suggest seems to be equally width buckets {1,2,3,4..ms} and
  not logarithmic buckets {1,2,4,8..ms} or am I missing something? The patch
  as is
  stands has two ways #ifdef implementations now, with bitwise shifting
  working
  now, but even in objdump on -O2 there's plenty of those jumps because of
  current
  code.
  I think we may also need to handle inputs that don't fit a uint32.
  Fixed.
  [..] For things
  like a stopped VM or such we could see IOs that that don't fit into a
  uint32
  when measured in microseconds. So perhaps I'd make the input to the
  bucket
  calc 64 bits, then shift to the minimum precision and mask to implement
  clamping.
  @@ -152,6 +189,10 @@ pgstatcountiooptime(IOObject io_object,
  IOContext iocontext, IOOp ioop,
  INSTRTIMEADD(PendingIOStats.pendingtimes[ioobject][iocontext][ioop],
  io_time);
  
  + / calculate the bucket_index based on latency in us /
  + bucket_index =
  getbucketindex(INSTRTIMEGETMICROSEC(iotime));
  +
  PendingIOStats.pendinghisttimebuckets[ioobject][iocontext][ioop][bucket_index]++;
  +
  / Add the per-backend count /
  pgstatcountbackendiooptime(ioobject, io_context,
  io_op,
  
  It's annoying to have to convert to microseconds here, that's not free :(.
  Oooops, fixed, getbucketindex() now operates directly on nanos/int64.
  [..]
  + HISTIOCOL_COUNT,
  + HISTIOCOLRESETTIME,
  + HISTIONUM_COLUMNS
  +} historygethistory_state;
  ]> Think the IONUMCOLUMNS reference in the comment is a copy-pasto. I
  don't
  think this should be introduced in the middle of the pgstatio
  implementation.
  Right, I've moved it to just before pgstatiohistogrambuild_tuples().
  +/*
  + * pgleadingzero_bits32
  + * Returns the number of leading 0-bits in x, starting at
  the most significant bit position.
  + * Word must not be 0 (as it is undefined behavior).
  + */
  +static inline int
  +pgleadingzero_bits32(uint32 word)
  
  Do we really need this in addition to the already existing
  pgleftmostone_pos32()? Particularly because that already has an msvc
  implementation...
  Well, I would be all in for removal , but please see above the
  getbucketindex() discussion.
  I've tried to get rid of it (but maybe i misunderstood something), but in
  the end I think it is more
  elegant/faster to have it there so that code in getbucketindex() stays
  more readable, rather
  than throw more bitwise voodoo there(?)
  -J.
  [1] -
  https://www.postgresql.org/message-id/CAKZiRmyZzmOODYS6n8mns9zN4RcS3o9kfrdQDyeRupqaGp9PmQ%40mail.gmail.com
  View in PostgreSQL Archives →
  Jump to comment-1
  Jakub Wartak<jakub.wartak@enterprisedb.com>
  Jan 28, 2026, 11:12 AM UTC
  On Tue, Jan 27, 2026 at 1:06 PM Jakub Wartak <jakub.wartak@enterprisedb.com>
  wrote:
  [..]
  Of course most of the I/O calls today are hitting page cache, so one
  would
  expect they'll be < 128us most of the time
  
  Have you measured whether overhead is measurable when hitting the page
  cache?
  I'd hope that it doesn't, due to io combing amortizing the cost
  somewhat. But
  it seems worth measuring.
  
  Not yet, I first wanted to hear if I'm not sailing into some plain stupid
  direction somewhere with this idea or implementation (e.g.
  that INSTRTIMEGET_MICROSEC() was a really stupid omission from my side).
  
  I'll try to perform this test overhead measurement hopefully with v3 once
  we settle on how to do that bit shifting/clz().
  [..]
  Here's the answer: on properly isolated perf test run (my
  old&legacy&predictiable
  4s32c64t NUMA box, s_b=8GB, DB size 16GB, hugepages, no turboboost, proper
  warmup,
  no THP, cpupower D0, no physical I/O, ~22k pread64() calls/sec combined to
  VFS
  cache)
  and started on just using single NUMA: numactl --membind=0
  --cpunodebind=0
  measured using: pgbench -M prepared -c 4 -j 4 postgres -T 20 -P 1 -S
  master+trackiotimings=on, 60s warmup and then 3x runs
  tps = 44615.603668 tps = 44556.191492 tps = 44813.793981 avg = 44662
  master+trackiotimings=on+patch, , 60s warmup and then 3x runs
  tps = 44441.879384 tps = 44403.101737 tps = 45036.747418 avg = 44627
  so that's like 99.921% (so literally no overhead) and yields picture like:
  postgres=# select bucketlatencyus, bucket_count
  from pgstatio_histogram
  where
  bucket_count > 0 and backend_type = 'client backend' and io_type = 'read' and context = 'normal'
  order by bucketlatencyus;
  bucket_latency_us bucket_count
  [0,9) 273455
  [8,17) 3820379
  [16,33) 29359
  [32,65) 585
  [64,129) 467
  [128,257) 419
  [256,513) 15828
  [512,1025) 89
  So one can also see 0.25..0.5ms bucket being larger there (initial reading
  of the data from physical device) and that's hardware RAID-1 with 2x
  Intel D3-S4510. And if I do pgbuffercacheevictall()+vm dropcache+
  pgstatreset_shared(), I get this picture (note for bulkreads):
  postgres=# select bucketlatencyus, bucket_count
  from pgstatio_histogram
  where
  bucket_count > 0 and backend_type = 'client backend' and context='bulkread';
  bucket_latency_us bucket_count
  [0,9) 102555
  [8,17) 70
  [16,33) 3938
  [32,65) 6763
  [64,129) 5206
  [128,257) 9392
  [256,513) 10959
  [512,1025) 21372
  [1024,2049) 502
  [2048,4097) 34
  [4096,8193) 2
  [8192,16385) 2
  [16384,32769) 7
  So clearly there's a distinction between reading the VFS cache and hitting
  physical I/O.
  Now we'll just probably settle on the proper getbucketindex() impl.
  -J.
  View in PostgreSQL Archives →
  Jump to comment-1
  Andres Freund<andres@anarazel.de>
  Jan 29, 2026, 4:27 PM UTC
  Hi,
  On 2026-01-28 12:12:10 +0100, Jakub Wartak wrote:
  On Tue, Jan 27, 2026 at 1:06 PM Jakub Wartak <jakub.wartak@enterprisedb.com>
  Not yet, I first wanted to hear if I'm not sailing into some plain stupid
  direction somewhere with this idea or implementation (e.g.
  that INSTRTIMEGET_MICROSEC() was a really stupid omission from my side).
  
  I'll try to perform this test overhead measurement hopefully with v3 once
  we settle on how to do that bit shifting/clz().
  
  [..]
  Here's the answer: on properly isolated perf test run (my
  old&legacy&predictiable
  4s32c64t NUMA box, s_b=8GB, DB size 16GB, hugepages, no turboboost, proper
  warmup,
  no THP, cpupower D0, no physical I/O, ~22k pread64() calls/sec combined to
  VFS
  cache)
  and started on just using single NUMA: numactl --membind=0
  --cpunodebind=0
  measured using: pgbench -M prepared -c 4 -j 4 postgres -T 20 -P 1 -S
  
  master+trackiotimings=on, 60s warmup and then 3x runs
  tps = 44615.603668
  tps = 44556.191492
  tps = 44813.793981
  avg = 44662
  
  master+trackiotimings=on+patch, , 60s warmup and then 3x runs
  tps = 44441.879384
  tps = 44403.101737
  tps = 45036.747418
  avg = 44627
  
  so that's like 99.921% (so literally no overhead) and yields picture like:
  I don't think that's a particularly useful assurance, unfortunately:
  1) Using pgbench with an in-memory readonly workload is typically limited by
  context switch overhead and per-statement overhead. After a short while you have at most one IO per statement (the heap page), which obviously isn't going to be affected by a small per-IO overhead.
  2) The per-core memory bandwidth on that old machine, if it's the quite old
  EDB machine I think it is, is so low, that you'd be bottlenecked by memory bandwidth well before you're going to be bottlenecked by actual CPU stuff (which the bucket computation is).
  I think you'd have to test something like pg_prewarm(), with
  iocombinelimit=1, on a modern client CPU (client CPUs typically have much
  higher per-core memory bandwidth than the more scalable server CPUs).
  Greetings,
  Andres Freund
  View in PostgreSQL Archives →
  Jump to comment-1
  Jakub Wartak<jakub.wartak@enterprisedb.com>
  Jan 30, 2026, 1:44 PM UTC
  On Thu, Jan 29, 2026 at 5:27 PM Andres Freund <andres@anarazel.de> wrote:
  
  Hi,
  
  On 2026-01-28 12:12:10 +0100, Jakub Wartak wrote:
  On Tue, Jan 27, 2026 at 1:06 PM Jakub Wartak <jakub.wartak@enterprisedb.com>
  Not yet, I first wanted to hear if I'm not sailing into some plain stupid
  direction somewhere with this idea or implementation (e.g.
  that INSTRTIMEGET_MICROSEC() was a really stupid omission from my side).
  
  I'll try to perform this test overhead measurement hopefully with v3 once
  we settle on how to do that bit shifting/clz().
  
  [..]
  Here's the answer: on properly isolated perf test run (my
  old&legacy&predictiable
  4s32c64t NUMA box, s_b=8GB, DB size 16GB, hugepages, no turboboost, proper
  warmup,
  no THP, cpupower D0, no physical I/O, ~22k pread64() calls/sec combined to
  VFS
  cache)
  and started on just using single NUMA: numactl --membind=0
  --cpunodebind=0
  measured using: pgbench -M prepared -c 4 -j 4 postgres -T 20 -P 1 -S
  
  master+trackiotimings=on, 60s warmup and then 3x runs
  tps = 44615.603668
  tps = 44556.191492
  tps = 44813.793981
  avg = 44662
  
  master+trackiotimings=on+patch, , 60s warmup and then 3x runs
  tps = 44441.879384
  tps = 44403.101737
  tps = 45036.747418
  avg = 44627
  
  so that's like 99.921% (so literally no overhead) and yields picture like:
  
  I don't think that's a particularly useful assurance, unfortunately:
  
  1) Using pgbench with an in-memory readonly workload is typically limited by
  context switch overhead and per-statement overhead. After a short while you
  have at most one IO per statement (the heap page), which obviously isn't
  going to be affected by a small per-IO overhead.
  
  2) The per-core memory bandwidth on that old machine, if it's the quite old
  EDB machine I think it is, is so low, that you'd be bottlenecked by memory
  bandwidth well before you're going to be bottlenecked by actual CPU stuff
  (which the bucket computation is).
  Hi, thanks for having a look!
  That legacy server is mine, but yeah even on same NUMA it can just get
  ~16GB/s AFAIR and just ~4GB between nodes. I've forgot to reply in
  that old NUMA thread
  back then, maybe it's not relevant, but I find it valuable often as
  the bottlenecks
  are easier to hit/notice and there's not that many traps that modern CPUs have
  (+ find having 4 NUMAs/socket in 2U is not that easy today, after all
  it's physical
  box nearby so PMCs are there too - unlike in "cloud", and by having 4 nodes the
  disbalances between nodes/zones are much more cleanly visible than
  just 1 local vs
  1 remote). Somehow I built trust on the results that machine (but I
  still can lift it!
  so probably still shouldn't trust it fully - pun to the "never trust a
  machine you
  can lift" :D)
  I think you'd have to test something like pg_prewarm(), with
  iocombinelimit=1, on a modern client CPU (client CPUs typically have much
  higher per-core memory bandwidth than the more scalable server CPUs).
  Fair point, thanks for explaining in the above chapter. So for much more modern
  Intel(R) Ultra 7 (1s16c32t,that can - according to mlc(1) - can do up
  to 85-90GB/s
  bandwidth and has those ugly P/E-cores, so I've pinned (taskset) the backend
  doing pgprewam to P-Core @ that usually runs @ 5Ghz, but with noturbo
  that's just @ 1.4Ghz). That was on normal build (not debug, so -O2, no casserts,
  gcc 13, kernel 6.14.x, noturbo, sb=8GB with HP, DB scale 600
  (pgbench_accounts @
  7.7GB), performance governor, no THP, ...)
  I'was kind of surprised, but here it goes full disclosure of my results. With
  the patch and trackiotiming=on, iocombinelimit=1, IO_METHOD='SYNC'
  just to care
  about 1 PID, here are timings of pgbuffercacheevict_all() and then
  measured duration of
  select pgprewarm('pgbenchaccounts');
  So initially I've got this picture, eliminated some worst/best too:
  Without patch:
  Time: 4644.346 ms (00:04.644)
  Time: 4612.610 ms (00:04.613)
  Time: 4639.133 ms (00:04.639)
  Time: 4625.020 ms (00:04.625)
  Time: 4636.652 ms (00:04.637)
  Avg: 4631ms
  With the patch:
  Time: 4765.780 ms (00:04.766)
  Time: 4784.308 ms (00:04.784)
  Time: 4754.661 ms (00:04.755)
  Time: 4770.772 ms (00:04.771)
  Time: 4768.232 ms (00:04.768)
  Avg: 4768ms (102.95%)
  With the patch and __builtin_clzl()
  Time: 4750.293 ms (00:04.750)
  Time: 4729.288 ms (00:04.729)
  Time: 4727.820 ms (00:04.728)
  Time: 4729.760 ms (00:04.730)
  Time: 4727.146 ms (00:04.727)
  Avg: 4732ms (102.18%)
  So clearly there was some overhead (I've started getting worried),
  and __builtin_clz() was cheaper slightly cheaper or just too much jitter --
  yes I've got plenty jittering out with this (not-shown, so above are like 5
  results out of 15).
  With v2 patch and __builtinclzl() and default trackio_timing=off (default)
  got back to ~4660m as expected.
  With v2 patch and __builtinclzl() and default iocombine_limit=128kB
  and trackiotiming=off, went back to ~4150ms:
  Time: 4151.942 ms (00:04.152)
  Time: 4133.747 ms (00:04.134)
  Time: 4153.103 ms (00:04.153)
  Time: 4135.199 ms (00:04.135)
  With thje patch and __builtinclzl() and default iocombine_limit=128kB
  trackiotiming=on, was also @ ~4150ms.
  Time: 4152.941 ms (00:04.153)
  Time: 4154.096 ms (00:04.154)
  Time: 4155.119 ms (00:04.155)
  So with "batching" the IOs, the overhead is almost gone. BTW that's with
  current_clocksource says "tsc". After dozens of runs, I've noticed thermals
  starting playing a bigger role than this patch, so i've did idle-set -D0
  and it degraded even further.
  Master, but still got lots of fluctuations, non filtered picture
  Time: 5518.546 ms (00:05.519)
  Time: 5587.675 ms (00:05.588)
  Time: 5512.828 ms (00:05.513)
  Time: 5534.023 ms (00:05.534)
  Time: 5728.125 ms (00:05.728)
  Time: 5731.543 ms (00:05.732)
  Time: 5762.687 ms (00:05.763)
  Time: 5565.607 ms (00:05.566)
  Time: 5498.496 ms (00:05.498)
  Time: 5637.870 ms (00:05.638)
  but if I leave it idle for a while couple minutes then I get:
  Time: 5577.879 ms (00:05.578)
  Time: 5575.648 ms (00:05.576)
  Time: 5548.146 ms (00:05.548)
  Some break and with the patch and __builtin_clzl (it gets lower sometimes than
  master, how can I trust this?!)
  Time: 5504.415 ms (00:05.504)
  Time: 5531.827 ms (00:05.532)
  Time: 5733.146 ms (00:05.733)
  Time: 5511.549 ms (00:05.512)
  So something more happening there , probably with thermals/scheduler than with
  patch. So of course I've done some home work [1][2], I have found even
  Your's rant on
  some of this [1] and truth to be told I'm unable to stabilize those
  deviations on
  this modern client CPU. So i've tried on another much more predictable
  (and non-modern :P) client CPU: Intel Core i5 7600k (1s4c4t) and got much more
  consistent numbers there (those are non-filtered, almost identical variables
  from also same setup(also 6.14.x, same tweaks, also with taskset to 1c) ):
  Master:
  Time: 2592.351 ms (00:02.592)
  Time: 2574.612 ms (00:02.575)
  Time: 2592.530 ms (00:02.593)
  Time: 2575.356 ms (00:02.575)
  Time: 2594.687 ms (00:02.595)
  Avg=2585ms
  Master+patch:
  Time: 2577.610 ms (00:02.578)
  Time: 2585.796 ms (00:02.586)
  Time: 2568.559 ms (00:02.569)
  Time: 2586.199 ms (00:02.586)
  Time: 2567.872 ms (00:02.568)
  Avg=2576ms (below master?!)
  Master+patch__builtin_clzl:
  Time: 2578.083 ms (00:02.578)
  Time: 2586.732 ms (00:02.587)
  Time: 2573.176 ms (00:02.573)
  Time: 2592.048 ms (00:02.592)
  Time: 2575.731 ms (00:02.576)
  Time: 2575.570 ms (00:02.576)
  Avg=2579ms (below master?!)
  Just Master again:
  Time: 2578.838 ms (00:02.579)
  Time: 2588.531 ms (00:02.589)
  Time: 2572.165 ms (00:02.572)
  Time: 2591.528 ms (00:02.592)
  Time: 2572.015 ms (00:02.572)
  Time: 2589.921 ms (00:02.590)
  Time: 2572.124 ms (00:02.572)
  Avg=2580ms
  So to sum-up:
  - it still looks OK to me
  - bigger impact than the patches itself can be thermals on modern-day CPUs(?)
  - older/legacy CPU (desktop one) seems to be less jittering than modern client
  laptop CPU even with the most strict perf. settings (?)
  - worst-case: to spot that ~2% regression one would have to disable
  the io batching,
  enable track_io_timing (that's the not default)
  I'm attaching v3 which has now default switched to __builtin_clzl() which
  works ok for uint64 (not sure if I need to care about __builtin_clzll
  on Windows?).
  Open questions:
  0. Should I pursue more benchmarking or the above results are enough?
  1. __builtin_clzl() or not to __builtin_clzl() that is the question... ?
  2. Should I add per-PID backend stats too or are you having something
  against it?
  3. Shouldn't we fix that mdsyncfiletag() mentioned earlier we seem to have
  pgstat_count_io_op_time() *after* potential FileClose() (as per my
  earlier question)
  -J.
  [1] - https://www.postgresql.org/message-id/20231115180433.p3eeaczbam5zxdz5%40awork3.anarazel.de
  [2] - https://vondra.me/posts/benchmarking-is-hard-sometimes/
  View in PostgreSQL Archives →

bucket_latency_us	bucket_count
[0,9)	273455
[8,17)	3820379
[16,33)	29359
[32,65)	585
[64,129)	467
[128,257)	419
[256,513)	15828
[512,1025)	89

bucket_latency_us	bucket_count
[0,9)	102555
[8,17)	70
[16,33)	3938
[32,65)	6763
[64,129)	5206
[128,257)	9392
[256,513)	10959
[512,1025)	21372
[1024,2049)	502
[2048,4097)	34
[4096,8193)	2
[8192,16385)	2
[16384,32769)	7