pgsql-hackers
❮
pg_stat_io_histogram
- Jump to comment-1Jakub Wartak<jakub.wartak@enterprisedb.com>Jan 26, 2026, 9:41 AM UTCI'm proposing that we add pgstatio_histogram that would track/show I/O
latencies profile, so we could quickly identify I/O outliers. From time to
time users complain that 'PostgreSQL is slow or stuck' (usually COMMIT is
slow), when it is quite apparent that it is down to somewhere in the I/O
stack. It is quite easy to prove once one has proper measurement tools in
place and is able to correlate, but it takes IMHO way too much time and
energy to cross-correlate all of that information (iostat -x 1s,
wait events 1s, and so on), especially if one would like to provide rapid
response.
Right now the patch does not include per-backend/PID tracking, hopefully if
there will be interest in this, I'll add it, but I would like to first hear
if that's a good idea. The current implementation uses fast bucket calculation
to avoid overheads and tries to cover most useful range of devices via buckets
(128us..256ms, so that covers both NVMe/SSD/HDD and abnormally high latency
too as from time to time I'm try to help with I/O stuck for seconds,
usually a sign
of some I/O multipath issues, device resetting, or hypervisor woes).
postgres=# select
from pgstatgetiohistogram()substring(backend_type,1,8) as backend,object,context,io_type, bucket_latency_us as lat_us, round(bucket_latency_us/1000.0, 3) as lat_ms, bucket_count as count
where
order by 1,2,3,4,5;bucket_count > 0Of course most of the I/O calls today are hitting page cache, so one wouldbackend object context io_type lat_us lat_ms count autovacu relation normal read 128 0.128 54 autovacu relation normal read 256 0.256 7 autovacu relation normal read 512 0.512 1 autovacu relation vacuum read 128 0.128 8 autovacu relation vacuum read 256 0.256 5 backgrou relation bulkread read 128 0.128 658 backgrou relation normal read 128 0.128 5 checkpoi relation normal fsync 2048 2.048 37 checkpoi relation normal fsync 4096 4.096 7 checkpoi relation normal fsync 16384 16.384 4 checkpoi relation normal fsync 32768 32.768 1 checkpoi relation normal fsync 65536 65.536 1 checkpoi relation normal write 128 0.128 2059 checkpoi relation normal write 256 0.256 2 checkpoi relation normal write 512 0.512 1 checkpoi relation normal writeback 128 0.128 64 checkpoi relation normal writeback 256 0.256 1 client b relation bulkread read 128 0.128 675 client b relation bulkread read 256 0.256 1 client b relation bulkwrite extend 128 0.128 260 client b relation bulkwrite extend 512 0.512 1 client b relation bulkwrite write 128 0.128 14404 client b relation normal extend 128 0.128 6 client b relation normal read 128 0.128 273 client b relation normal read 256 0.256 6 client b relation vacuum read 128 0.128 907 client b relation vacuum read 256 0.256 3 client b relation vacuum read 512 0.512 2
expect they'll be < 128us most of the time, but above you can see here degraded
fsync/fdatasync as well (BTW that was achieved via device mapper
delayed device). My hope that above would help tremendously when dealing
with flaky storage, or I/O path issues, or even hypervisors being paused.
Alternative idea I was having would be simply to add logging of slow I/O
outliers, but meh.. then one would to answer all those questions:
what should be the threshold (=>guc?), risk of spamming the log and so on
(and I wouldn't be fond of proposing yet another log_* GUC ;))
Any hints, co-authors, or help are more than welcome!
-J.- Jump to comment-1Andres Freund<andres@anarazel.de>Jan 26, 2026, 3:08 PM UTCHi,
On 2026-01-26 10:40:52 +0100, Jakub Wartak wrote:I'm proposing that we add pgstatio_histogram that would track/show I/O
I think that's an interesting feature!
latencies profile, so we could quickly identify I/O outliers.From time to time users complain that 'PostgreSQL is slow or stuck' (usually
For measuring particularly stuck things, I've been wondering about having a
COMMIT is slow), when it is quite apparent that it is down to somewhere in
the I/O stack. It is quite easy to prove once one has proper measurement
tools in place and is able to correlate, but it takes IMHO way too much time
and energy to cross-correlate all of that information (iostat -x 1s, wait
events 1s, and so on), especially if one would like to provide rapid
response.
regular timer that starts to collect more information if stuck in a place for
a while. That would probably end up being lower overhead than constantly
measuring... But would also be a lot more work.Right now the patch does not include per-backend/PID tracking, hopefully if
Hm. Isn't 128us a pretty high floor for at least reads and writes? On a good
there will be interest in this, I'll add it, but I would like to first hear
if that's a good idea. The current implementation uses fast bucket
calculation to avoid overheads and tries to cover most useful range of
devices via buckets (128us..256ms, so that covers both NVMe/SSD/HDD and
abnormally high latency too as from time to time I'm try to help with I/O
stuck for seconds, usually a sign of some I/O multipath issues, device
resetting, or hypervisor woes).
NVMe disk you'll get < 10us, after all.
I see a few problems with the source of the latency measurements though:
- The latency gathered is that it's quite heavily affected by scheduler
- With AIO, you're measuring wait time, and that time can be affected by othernoise. If your process isn't scheduled because other processes are busy doing stuff on the CPU, it's quite possible to get results many orders of magnitude wrong.
I don't see how we can do better absent cooperation from the kernel (byIOs in the queue. That will often *drastically* overestimate IO latency measured this way.
putting lower-level measurements into io_uring completions, for example)
though. So maybe this is just how it has to be and we ought to just document
it.postgres=# select
substring(backendtype,1,8) as backend,object,context,iotype,
bucketlatencyus as lat_us,
round(bucketlatencyus/1000.0, 3) as lat_ms,
bucket_count as count
from pgstatgetiohistogram()
wherebucket_count > 0
order by 1,2,3,4,5;
Perhaps the latency should be represented as a range?
backend | object | context | iotype | latus | lat_ms | count
----------+----------+-----------+-----------+--------+--------+-------
autovacu | relation | normal | read | 128 | 0.128 | 54Of course most of the I/O calls today are hitting page cache, so one would
Have you measured whether overhead is measurable when hitting the page cache?
expect they'll be < 128us most of the time
I'd hope that it doesn't, due to io combing amortizing the cost somewhat. But
it seems worth measuring.
I assume you made pgstatgetioopname() return "hit?" because you don't
expect that to ever be hit?+static inline int getbucketindex(uint32_t val) {
+#define MINPGSTATIOHIST_LATENCY 127
+ const uint32t maxindex = PGSTATIOHIST_BUCKETS - 1;
+ /*
+ * hopefully calculated to be 25 by the compiler:
+ * clz(127) = clz(01111111b on uint32) = 25
+ */
+ const uint32t minlatencyleadingzeros =
+ pgleadingzerobits32(MINPGSTATIOHISTLATENCY);
+
+ /*
+ * make sure the tmp value at least 127 (our minimum bucket size)
+ * as __builtin_clz might return undefined behavior when operating on 0
+ */
+ uint32t tmp = val | MINPGSTATIOHISTLATENCY;+ / count leading zeros /
+ int leadingzeros = pgleadingzerobits32(tmp);
+
+ / normalize the index /
+ uint32t index = minlatencyleadingzeros - leading_zeros;
+
+ / clamp it to the maximum /+ return (index > maxindex) ? maxindex : index;
+}
Wouldn't it be easier to handle the minimum latency by shifting right?
I think we may also need to handle inputs that don't fit a uint32. For things
like a stopped VM or such we could see IOs that that don't fit into a uint32
when measured in microseconds. So perhaps I'd make the input to the bucket
calc 64 bits, then shift to the minimum precision and mask to implement
clamping.@@ -152,6 +189,10 @@ pgstatcountiooptime(IOObject ioobject, IOContext iocontext, IOOp io_op,
It's annoying to have to convert to microseconds here, that's not free :(.
INSTRTIMEADD(PendingIOStats.pendingtimes[ioobject][iocontext][ioop],
io_time);
+ / calculate the bucket_index based on latency in us /
+ bucketindex = getbucketindex(INSTRTIMEGETMICROSEC(io_time));
+ PendingIOStats.pendinghisttimebuckets[ioobject][iocontext][ioop][bucket_index]++;
+
/ Add the per-backend count /
pgstatcountbackendiooptime(ioobject, iocontext, ioop,@@ -1356,6 +1356,24 @@ typedef enum iostatcol
Think the IONUMCOLUMNS reference in the comment is a copy-pasto. I don't
IONUMCOLUMNS,
} iostatcol;
+/*
+* When adding a new column to the pgstatio view and the
+* pgstatgetbackendio() function, add a new enum value here above
+* IONUMCOLUMNS.
+*/
+typedef enum histiostat_col
+{
+ HISTIOCOL_INVALID = -1,
+ HISTIOCOLBACKENDTYPE,
+ HISTIOCOL_OBJECT,
+ HISTIOCOL_CONTEXT,
+ HISTIOCOL_IOTYPE,
+ HISTIOCOLBUCKETUS,
+ HISTIOCOL_COUNT,
+ HISTIOCOLRESETTIME,
+ HISTIONUM_COLUMNS
+} historygethistory_state;
think this should be introduced in the middle of the pgstatio implementation.+/*
Do we really need this in addition to the already existing
+ * pgleadingzero_bits32
+ * Returns the number of leading 0-bits in x, starting at the most significant bit position.
+ * Word must not be 0 (as it is undefined behavior).
+ */
+static inline int
+pgleadingzero_bits32(uint32 word)
pgleftmostone_pos32()? Particularly because that already has an msvc
implementation...
Greetings,
Andres Freund- Jump to comment-1Jakub Wartak<jakub.wartak@enterprisedb.com>Jan 27, 2026, 12:06 PM UTCOn Mon, Jan 26, 2026 at 4:08 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2026-01-26 10:40:52 +0100, Jakub Wartak wrote:I'm proposing that we add pgstatio_histogram that would track/show I/O
latencies profile, so we could quickly identify I/O outliers.
Hi Andres, thanks for kind words and review!
I think that's an interesting feature!
[..]
a
For measuring particularly stuck things, I've been wondering about havingregular timer that starts to collect more information if stuck in a place
fora while. That would probably end up being lower overhead than constantly
Well if something is really stuck, I think the wait events are covering us
measuring... But it would also be a lot more work.
on that,
aren't they? One can argue if they carry enough information (for me they
mostly
do, but I'm trying to squeeze some more stuff into them in a nearby thread
[1],
BTW: it's kind of "blocked" due to that 56-bit relfilenode idea/question,
any thoughts on that?)Right now the patch does not include per-backend/PID tracking,
hopefully ifthere will be interest in this, I'll add it, but I would like to first
hearif that's a good idea. The current implementation uses fast bucket
I/O
calculation to avoid overheads and tries to cover most useful range of
devices via buckets (128us..256ms, so that covers both NVMe/SSD/HDD and
abnormally high latency too as from time to time I'm try to help withstuck for seconds, usually a sign of some I/O multipath issues, device
resetting, or hypervisor woes).
good
Hm. Isn't 128us a pretty high floor for at least reads and writes? On aNVMe disk you'll get < 10us, after all.
I was blind and concentrated way too much on the bad-behaving I/O rather
than good
I/O - let's call it I/O negativity bias 8)
Now v2 contains the min bucket lowered to 8us (but max then is just ~131ms,
I
didn't want it to use more than 64b total, 16*4b (uint32)=64b and well
16*8b(uint64)=128b already, so that's why it's capped max at 131072us right
now).I see a few problems with the source of the latency measurements though:
of
- The latency gathered is that it's quite heavily affected by scheduler
noise. If your process isn't scheduled because other processes are busy
doing stuff on the CPU, it's quite possible to get results many ordersmagnitude wrong.
other
- With AIO, you're measuring wait time, and that time can be affected byIOs in the queue. That will often drastically overestimate IO latency
document
measured this way.
I don't see how we can do better absent cooperation from the kernel (by
putting lower-level measurements into io_uring completions, for example)
though. So maybe this is just how it has to be and we ought to justit.
Right, I think this is a complex topic on it's own, I've added a small
section into
the docs. I didn't want to start the thread with undermining my own
results, but
indeed I'm getting "bad" numbers. Bad in essence that perceived latency
numbers do not match with other stuff:
E.g.checkpointer's fsync/fdatasync latency is awful, although i've been
using
this to simulate latency (just 2ms!, but it ended up adding way more):
... and it has e.g. quite interesting effects:dd if=/dev/zero of=/fs bs=1M count=1024 losetup /dev/loop15 /fs echo "0 $(blockdev --getsz /dev/loop15) delay /dev/loop15 0 2" | \ dmsetup create delay mkfs.ext4 /dev/mapper/delay mount /dev/mapper/delay /mnt
- lack of "noatime" the impact is clearly visible on fsync
- even with noatime I'm was getting spikes of latenices above 131ms
- I've created attached bpftrace to see the gap between kernel and uprobes,(sic!) with this: select pg_stat_reset_shared(); pgbench -i -s 10 -p 1234 -h /tmp postgres checkpoint;
butit's not that high, sample of the viewbackend object context io_type lat_us lat_ms count checkpoi relation normal fsync 32768 32.768 42 checkpoi relation normal fsync 65536 65.536 3 checkpoi relation normal fsync 262144 262.144 1
wastedvs eBPF, which does not seem to see that, worst case seem to be like~20us (gap is between user and kernel)[fdatasync] PID 54197 | Kernel: 12943 us | User: 12964 us | Lag: 21us[..but usually it's just:][ fsync] PID 52266 | Kernel: 1711 us | User: 1714 us | Lag: 3us [ fsync] PID 52266 | Kernel: 19913 us | User: 19916 us | Lag: 3us [ fsync] PID 52266 | Kernel: 1993 us | User: 1996 us | Lag: 3us [ fsync] PID 52266 | Kernel: 1994 us | User: 1995 us | Lag: 1us [ fsync] PID 52266 | Kernel: 53734 us | User: 53736 us | Lag: 2us [ fsync] PID 52266 | Kernel: 8066 us | User: 8072 us | Lag: 6us [ fsync] PID 52266 | Kernel: 2107 us | User: 2109 us | Lag: 2us [ fsync] PID 52266 | Kernel: 1972 us | User: 1974 us | Lag: 2us
NVMe(this is on 2ms delayed + noatime fs + with CONFIG_HZ=1000 + laptop's
- in mdsyncfiletag() we seem to have pgstatcountiooptime() afterthat's idle).
potential
context-switchFileClose(), but I haven't witnessed long close(), it's still
- I've spotted that power mgmt might be influencing it even further (but
that's not
list)in the docs yet, dunno if I should add it there like the next item on thebackend | object | context | iotype | latus | lat_ms | count
----------+----------+-----------+-----------+--------+--------+-------
autovacu | relation | normal | read | 128 | 0.128 | 54
Cool idea, I haven't even thought about this one! From now v2 shows:
Perhaps the latency should be represented as a range?
postgres=# select
cntsubstring(backend_type,1,8) as btype, object, context, io_type, bucket_latency_us as lat_us, bucket_count as
from pgstatgetiohistogram()
where
order by 1,2,3,4,5 ;bucket_count > 0[..]btype object context io_type lat_us cnt checkpoi | relation | normal | write | [0,9) | 33 checkpoi | relation | normal | write | [8,17) | 8 checkpoi | relation | normal | write | [16,33) | 1Of course most of the I/O calls today are hitting page cache, so one
wouldexpect they'll be < 128us most of the time
cache?
Have you measured whether overhead is measurable when hitting the pageI'd hope that it doesn't, due to io combing amortizing the cost somewhat.
Butit seems worth measuring.
Not yet, I first wanted to hear if I'm not sailing into some plain stupid
direction somewhere with this idea or implementation (e.g.
that INSTRTIMEGET_MICROSEC() was a really stupid omission from my side).
I'll try to perform this test overhead measurement hopefully with v3 once
we settle on how to do that bit shifting/clz().I assume you made pgstatgetioopname() return "hit?" because you don't
Yes, my patch seems to always return 0 for "hit?". I'll need to investigate
expect that to ever be hit?
that
further.+static inline int getbucketindex(uint32_t val) {
operating on 0
+#define MINPGSTATIOHIST_LATENCY 127
+ const uint32t maxindex = PGSTATIOHIST_BUCKETS - 1;
+ /*
+ * hopefully calculated to be 25 by the compiler:
+ * clz(127) = clz(01111111b on uint32) = 25
+ */
+ const uint32t minlatencyleadingzeros =
+ pgleadingzerobits32(MINPGSTATIOHISTLATENCY);
+
+ /*
+ * make sure the tmp value at least 127 (our minimum bucket size)
+ * as __builtin_clz might return undefined behavior when+ */
+ uint32t tmp = val | MINPGSTATIOHISTLATENCY;+ / count leading zeros /
+ int leadingzeros = pgleadingzerobits32(tmp);
+
+ / normalize the index /
+ uint32t index = minlatencyleadingzeros - leading_zeros;
+
+ / clamp it to the maximum /+ return (index > maxindex) ? maxindex : index;
+}
What you seem to suggest seems to be equally width buckets {1,2,3,4..ms} and
Wouldn't it be easier to handle the minimum latency by shifting right?
not logarithmic buckets {1,2,4,8..ms} or am I missing something? The patch
as is
stands has two ways #ifdef implementations now, with bitwise shifting
working
now, but even in objdump on -O2 there's plenty of those jumps because of
current
code.I think we may also need to handle inputs that don't fit a uint32.
Fixed.[..] For things
uint32
like a stopped VM or such we could see IOs that that don't fit into awhen measured in microseconds. So perhaps I'd make the input to the
bucketcalc 64 bits, then shift to the minimum precision and mask to implement
clamping.@@ -152,6 +189,10 @@ pgstatcountiooptime(IOObject io_object,
IOContext iocontext, IOOp ioop,INSTRTIMEADD(PendingIOStats.pendingtimes[ioobject][iocontext][ioop],
io_time);
getbucketindex(INSTRTIMEGETMICROSEC(iotime));
+ / calculate the bucket_index based on latency in us /
+ bucket_index =+
PendingIOStats.pendinghisttimebuckets[ioobject][iocontext][ioop][bucket_index]++;+
io_op,
/ Add the per-backend count /
pgstatcountbackendiooptime(ioobject, io_context,
Oooops, fixed, getbucketindex() now operates directly on nanos/int64.
It's annoying to have to convert to microseconds here, that's not free :(.
[..]+ HISTIOCOL_COUNT,
+ HISTIOCOLRESETTIME,
+ HISTIONUM_COLUMNS
+} historygethistory_state;]> Think the IONUMCOLUMNS reference in the comment is a copy-pasto. I
don'tthink this should be introduced in the middle of the pgstatio
implementation.
Right, I've moved it to just before pgstatiohistogrambuild_tuples().+/*
the most significant bit position.
+ * pgleadingzero_bits32
+ * Returns the number of leading 0-bits in x, starting at+ * Word must not be 0 (as it is undefined behavior).
+ */
+static inline int
+pgleadingzero_bits32(uint32 word)
Well, I would be all in for removal , but please see above the
Do we really need this in addition to the already existing
pgleftmostone_pos32()? Particularly because that already has an msvc
implementation...
getbucketindex() discussion.
I've tried to get rid of it (but maybe i misunderstood something), but in
the end I think it is more
elegant/faster to have it there so that code in getbucketindex() stays
more readable, rather
than throw more bitwise voodoo there(?)
-J.
[1] -
https://www.postgresql.org/message-id/CAKZiRmyZzmOODYS6n8mns9zN4RcS3o9kfrdQDyeRupqaGp9PmQ%40mail.gmail.com- Jump to comment-1Jakub Wartak<jakub.wartak@enterprisedb.com>Jan 28, 2026, 11:12 AM UTCOn Tue, Jan 27, 2026 at 1:06 PM Jakub Wartak <jakub.wartak@enterprisedb.com>
wrote:[..]
Of course most of the I/O calls today are hitting page cache, so one
would
expect they'll be < 128us most of the time
Have you measured whether overhead is measurable when hitting the pagecache?
I'd hope that it doesn't, due to io combing amortizing the cost
somewhat. But
it seems worth measuring.
[..]
Not yet, I first wanted to hear if I'm not sailing into some plain stupid
direction somewhere with this idea or implementation (e.g.
that INSTRTIMEGET_MICROSEC() was a really stupid omission from my side).
I'll try to perform this test overhead measurement hopefully with v3 once
we settle on how to do that bit shifting/clz().
Here's the answer: on properly isolated perf test run (my
old&legacy&predictiable
4s32c64t NUMA box, s_b=8GB, DB size 16GB, hugepages, no turboboost, proper
warmup,
no THP, cpupower D0, no physical I/O, ~22k pread64() calls/sec combined to
VFS
cache)
--cpunodebind=0and started on just using single NUMA: numactl --membind=0
master+trackiotimings=on, 60s warmup and then 3x runsmeasured using: pgbench -M prepared -c 4 -j 4 postgres -T 20 -P 1 -S
master+trackiotimings=on+patch, , 60s warmup and then 3x runstps = 44615.603668 tps = 44556.191492 tps = 44813.793981 avg = 44662
so that's like 99.921% (so literally no overhead) and yields picture like:tps = 44441.879384 tps = 44403.101737 tps = 45036.747418 avg = 44627
postgres=# select bucketlatencyus, bucket_count
from pgstatio_histogram
where
order by bucketlatencyus;bucket_count > 0 and backend_type = 'client backend' and io_type = 'read' and context = 'normal'So one can also see 0.25..0.5ms bucket being larger there (initial readingbucket_latency_us bucket_count [0,9) 273455 [8,17) 3820379 [16,33) 29359 [32,65) 585 [64,129) 467 [128,257) 419 [256,513) 15828 [512,1025) 89
of the data from physical device) and that's hardware RAID-1 with 2x
Intel D3-S4510. And if I do pgbuffercacheevictall()+vm dropcache+
pgstatreset_shared(), I get this picture (note for bulkreads):
postgres=# select bucketlatencyus, bucket_count
from pgstatio_histogram
wherebucket_count > 0 and backend_type = 'client backend' and context='bulkread';So clearly there's a distinction between reading the VFS cache and hittingbucket_latency_us bucket_count [0,9) 102555 [8,17) 70 [16,33) 3938 [32,65) 6763 [64,129) 5206 [128,257) 9392 [256,513) 10959 [512,1025) 21372 [1024,2049) 502 [2048,4097) 34 [4096,8193) 2 [8192,16385) 2 [16384,32769) 7
physical I/O.
Now we'll just probably settle on the proper getbucketindex() impl.
-J.- Jump to comment-1Andres Freund<andres@anarazel.de>Jan 29, 2026, 4:27 PM UTCHi,
On 2026-01-28 12:12:10 +0100, Jakub Wartak wrote:On Tue, Jan 27, 2026 at 1:06 PM Jakub Wartak <jakub.wartak@enterprisedb.com>
Not yet, I first wanted to hear if I'm not sailing into some plain stupid
direction somewhere with this idea or implementation (e.g.
that INSTRTIMEGET_MICROSEC() was a really stupid omission from my side).
I'll try to perform this test overhead measurement hopefully with v3 once
we settle on how to do that bit shifting/clz().
I don't think that's a particularly useful assurance, unfortunately:
[..]
Here's the answer: on properly isolated perf test run (my
old&legacy&predictiable
4s32c64t NUMA box, s_b=8GB, DB size 16GB, hugepages, no turboboost, proper
warmup,
no THP, cpupower D0, no physical I/O, ~22k pread64() calls/sec combined to
VFS
cache)
and started on just using single NUMA: numactl --membind=0
--cpunodebind=0
measured using: pgbench -M prepared -c 4 -j 4 postgres -T 20 -P 1 -S
master+trackiotimings=on, 60s warmup and then 3x runs
tps = 44615.603668
tps = 44556.191492
tps = 44813.793981
avg = 44662
master+trackiotimings=on+patch, , 60s warmup and then 3x runs
tps = 44441.879384
tps = 44403.101737
tps = 45036.747418
avg = 44627
so that's like 99.921% (so literally no overhead) and yields picture like:
1) Using pgbench with an in-memory readonly workload is typically limited by
2) The per-core memory bandwidth on that old machine, if it's the quite oldcontext switch overhead and per-statement overhead. After a short while you have at most one IO per statement (the heap page), which obviously isn't going to be affected by a small per-IO overhead.
I think you'd have to test something like pg_prewarm(), withEDB machine I think it is, is so low, that you'd be bottlenecked by memory bandwidth well before you're going to be bottlenecked by actual CPU stuff (which the bucket computation is).
iocombinelimit=1, on a modern client CPU (client CPUs typically have much
higher per-core memory bandwidth than the more scalable server CPUs).
Greetings,
Andres Freund- Jump to comment-1Jakub Wartak<jakub.wartak@enterprisedb.com>Jan 30, 2026, 1:44 PM UTCOn Thu, Jan 29, 2026 at 5:27 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2026-01-28 12:12:10 +0100, Jakub Wartak wrote:On Tue, Jan 27, 2026 at 1:06 PM Jakub Wartak <jakub.wartak@enterprisedb.com>
Not yet, I first wanted to hear if I'm not sailing into some plain stupid
direction somewhere with this idea or implementation (e.g.
that INSTRTIMEGET_MICROSEC() was a really stupid omission from my side).
I'll try to perform this test overhead measurement hopefully with v3 once
we settle on how to do that bit shifting/clz().
[..]
Here's the answer: on properly isolated perf test run (my
old&legacy&predictiable
4s32c64t NUMA box, s_b=8GB, DB size 16GB, hugepages, no turboboost, proper
warmup,
no THP, cpupower D0, no physical I/O, ~22k pread64() calls/sec combined to
VFS
cache)
and started on just using single NUMA: numactl --membind=0
--cpunodebind=0
measured using: pgbench -M prepared -c 4 -j 4 postgres -T 20 -P 1 -S
master+trackiotimings=on, 60s warmup and then 3x runs
tps = 44615.603668
tps = 44556.191492
tps = 44813.793981
avg = 44662
master+trackiotimings=on+patch, , 60s warmup and then 3x runs
tps = 44441.879384
tps = 44403.101737
tps = 45036.747418
avg = 44627
so that's like 99.921% (so literally no overhead) and yields picture like:
Hi, thanks for having a look!
I don't think that's a particularly useful assurance, unfortunately:
1) Using pgbench with an in-memory readonly workload is typically limited by
context switch overhead and per-statement overhead. After a short while you
have at most one IO per statement (the heap page), which obviously isn't
going to be affected by a small per-IO overhead.
2) The per-core memory bandwidth on that old machine, if it's the quite old
EDB machine I think it is, is so low, that you'd be bottlenecked by memory
bandwidth well before you're going to be bottlenecked by actual CPU stuff
(which the bucket computation is).
That legacy server is mine, but yeah even on same NUMA it can just get
~16GB/s AFAIR and just ~4GB between nodes. I've forgot to reply in
that old NUMA thread
back then, maybe it's not relevant, but I find it valuable often as
the bottlenecks
are easier to hit/notice and there's not that many traps that modern CPUs have
(+ find having 4 NUMAs/socket in 2U is not that easy today, after all
it's physical
box nearby so PMCs are there too - unlike in "cloud", and by having 4 nodes the
disbalances between nodes/zones are much more cleanly visible than
just 1 local vs
1 remote). Somehow I built trust on the results that machine (but I
still can lift it!
so probably still shouldn't trust it fully - pun to the "never trust a
machine you
can lift" :D)I think you'd have to test something like pg_prewarm(), with
Fair point, thanks for explaining in the above chapter. So for much more modern
iocombinelimit=1, on a modern client CPU (client CPUs typically have much
higher per-core memory bandwidth than the more scalable server CPUs).
Intel(R) Ultra 7 (1s16c32t,that can - according to mlc(1) - can do up
to 85-90GB/s
bandwidth and has those ugly P/E-cores, so I've pinned (taskset) the backend
doing pgprewam to P-Core @ that usually runs @ 5Ghz, but with noturbo
that's just @ 1.4Ghz). That was on normal build (not debug, so -O2, no casserts,
gcc 13, kernel 6.14.x, noturbo, sb=8GB with HP, DB scale 600
(pgbench_accounts @
7.7GB), performance governor, no THP, ...)
I'was kind of surprised, but here it goes full disclosure of my results. With
the patch and trackiotiming=on, iocombinelimit=1, IO_METHOD='SYNC'
just to care
about 1 PID, here are timings of pgbuffercacheevict_all() and then
measured duration of
select pgprewarm('pgbenchaccounts');
So initially I've got this picture, eliminated some worst/best too:
Without patch:
Time: 4644.346 ms (00:04.644)
Time: 4612.610 ms (00:04.613)
Time: 4639.133 ms (00:04.639)
Time: 4625.020 ms (00:04.625)
Time: 4636.652 ms (00:04.637)
Avg: 4631ms
With the patch:
Time: 4765.780 ms (00:04.766)
Time: 4784.308 ms (00:04.784)
Time: 4754.661 ms (00:04.755)
Time: 4770.772 ms (00:04.771)
Time: 4768.232 ms (00:04.768)
Avg: 4768ms (102.95%)
With the patch and __builtin_clzl()
Time: 4750.293 ms (00:04.750)
Time: 4729.288 ms (00:04.729)
Time: 4727.820 ms (00:04.728)
Time: 4729.760 ms (00:04.730)
Time: 4727.146 ms (00:04.727)
Avg: 4732ms (102.18%)
So clearly there was some overhead (I've started getting worried),
and __builtin_clz() was cheaper slightly cheaper or just too much jitter --
yes I've got plenty jittering out with this (not-shown, so above are like 5
results out of 15).
With v2 patch and __builtinclzl() and default trackio_timing=off (default)
got back to ~4660m as expected.
With v2 patch and __builtinclzl() and default iocombine_limit=128kB
and trackiotiming=off, went back to ~4150ms:
Time: 4151.942 ms (00:04.152)
Time: 4133.747 ms (00:04.134)
Time: 4153.103 ms (00:04.153)
Time: 4135.199 ms (00:04.135)
With thje patch and __builtinclzl() and default iocombine_limit=128kB
trackiotiming=on, was also @ ~4150ms.
Time: 4152.941 ms (00:04.153)
Time: 4154.096 ms (00:04.154)
Time: 4155.119 ms (00:04.155)
So with "batching" the IOs, the overhead is almost gone. BTW that's with
current_clocksource says "tsc". After dozens of runs, I've noticed thermals
starting playing a bigger role than this patch, so i've did idle-set -D0
and it degraded even further.
Master, but still got lots of fluctuations, non filtered picture
Time: 5518.546 ms (00:05.519)
Time: 5587.675 ms (00:05.588)
Time: 5512.828 ms (00:05.513)
Time: 5534.023 ms (00:05.534)
Time: 5728.125 ms (00:05.728)
Time: 5731.543 ms (00:05.732)
Time: 5762.687 ms (00:05.763)
Time: 5565.607 ms (00:05.566)
Time: 5498.496 ms (00:05.498)
Time: 5637.870 ms (00:05.638)
but if I leave it idle for a while couple minutes then I get:
Time: 5577.879 ms (00:05.578)
Time: 5575.648 ms (00:05.576)
Time: 5548.146 ms (00:05.548)
Some break and with the patch and __builtin_clzl (it gets lower sometimes than
master, how can I trust this?!)
Time: 5504.415 ms (00:05.504)
Time: 5531.827 ms (00:05.532)
Time: 5733.146 ms (00:05.733)
Time: 5511.549 ms (00:05.512)
So something more happening there , probably with thermals/scheduler than with
patch. So of course I've done some home work [1][2], I have found even
Your's rant on
some of this [1] and truth to be told I'm unable to stabilize those
deviations on
this modern client CPU. So i've tried on another much more predictable
(and non-modern :P) client CPU: Intel Core i5 7600k (1s4c4t) and got much more
consistent numbers there (those are non-filtered, almost identical variables
from also same setup(also 6.14.x, same tweaks, also with taskset to 1c) ):
Master:
Time: 2592.351 ms (00:02.592)
Time: 2574.612 ms (00:02.575)
Time: 2592.530 ms (00:02.593)
Time: 2575.356 ms (00:02.575)
Time: 2594.687 ms (00:02.595)
Avg=2585ms
Master+patch:
Time: 2577.610 ms (00:02.578)
Time: 2585.796 ms (00:02.586)
Time: 2568.559 ms (00:02.569)
Time: 2586.199 ms (00:02.586)
Time: 2567.872 ms (00:02.568)
Avg=2576ms (below master?!)
Master+patch__builtin_clzl:
Time: 2578.083 ms (00:02.578)
Time: 2586.732 ms (00:02.587)
Time: 2573.176 ms (00:02.573)
Time: 2592.048 ms (00:02.592)
Time: 2575.731 ms (00:02.576)
Time: 2575.570 ms (00:02.576)
Avg=2579ms (below master?!)
Just Master again:
Time: 2578.838 ms (00:02.579)
Time: 2588.531 ms (00:02.589)
Time: 2572.165 ms (00:02.572)
Time: 2591.528 ms (00:02.592)
Time: 2572.015 ms (00:02.572)
Time: 2589.921 ms (00:02.590)
Time: 2572.124 ms (00:02.572)
Avg=2580ms
So to sum-up:
- it still looks OK to me
- bigger impact than the patches itself can be thermals on modern-day CPUs(?)
- older/legacy CPU (desktop one) seems to be less jittering than modern client
- worst-case: to spot that ~2% regression one would have to disablelaptop CPU even with the most strict perf. settings (?)
the io batching,
I'm attaching v3 which has now default switched to __builtin_clzl() whichenable track_io_timing (that's the not default)
works ok for uint64 (not sure if I need to care about __builtin_clzll
on Windows?).
Open questions:
0. Should I pursue more benchmarking or the above results are enough?
1. __builtin_clzl() or not to __builtin_clzl() that is the question... ?
2. Should I add per-PID backend stats too or are you having something
against it?
3. Shouldn't we fix that mdsyncfiletag() mentioned earlier we seem to have
earlier question)pgstat_count_io_op_time() *after* potential FileClose() (as per my
-J.
[1] - https://www.postgresql.org/message-id/20231115180433.p3eeaczbam5zxdz5%40awork3.anarazel.de
[2] - https://vondra.me/posts/benchmarking-is-hard-sometimes/