pg_stat_io_histogram

  • Jump to comment-1
    Jakub Wartak<jakub.wartak@enterprisedb.com>
    Jan 26, 2026, 9:41 AM UTC
    I'm proposing that we add pgstatio_histogram that would track/show I/O
    latencies profile, so we could quickly identify I/O outliers. From time to
    time users complain that 'PostgreSQL is slow or stuck' (usually COMMIT is
    slow), when it is quite apparent that it is down to somewhere in the I/O
    stack. It is quite easy to prove once one has proper measurement tools in
    place and is able to correlate, but it takes IMHO way too much time and
    energy to cross-correlate all of that information (iostat -x 1s,
    wait events 1s, and so on), especially if one would like to provide rapid
    response.
    Right now the patch does not include per-backend/PID tracking, hopefully if
    there will be interest in this, I'll add it, but I would like to first hear
    if that's a good idea. The current implementation uses fast bucket calculation
    to avoid overheads and tries to cover most useful range of devices via buckets
    (128us..256ms, so that covers both NVMe/SSD/HDD and abnormally high latency
    too as from time to time I'm try to help with I/O stuck for seconds,
    usually a sign
    of some I/O multipath issues, device resetting, or hypervisor woes).
    postgres=# select
    substring(backend_type,1,8) as backend,object,context,io_type,
    bucket_latency_us as lat_us,
    round(bucket_latency_us/1000.0, 3) as lat_ms,
    bucket_count as count
    from pgstatgetiohistogram()
    where
    bucket_count > 0
    order by 1,2,3,4,5;
    backendobjectcontextio_typelat_uslat_mscount
    autovacurelationnormalread1280.12854
    autovacurelationnormalread2560.2567
    autovacurelationnormalread5120.5121
    autovacurelationvacuumread1280.1288
    autovacurelationvacuumread2560.2565
    backgrourelationbulkreadread1280.128658
    backgrourelationnormalread1280.1285
    checkpoirelationnormalfsync20482.04837
    checkpoirelationnormalfsync40964.0967
    checkpoirelationnormalfsync1638416.3844
    checkpoirelationnormalfsync3276832.7681
    checkpoirelationnormalfsync6553665.5361
    checkpoirelationnormalwrite1280.1282059
    checkpoirelationnormalwrite2560.2562
    checkpoirelationnormalwrite5120.5121
    checkpoirelationnormalwriteback1280.12864
    checkpoirelationnormalwriteback2560.2561
    client brelationbulkreadread1280.128675
    client brelationbulkreadread2560.2561
    client brelationbulkwriteextend1280.128260
    client brelationbulkwriteextend5120.5121
    client brelationbulkwritewrite1280.12814404
    client brelationnormalextend1280.1286
    client brelationnormalread1280.128273
    client brelationnormalread2560.2566
    client brelationvacuumread1280.128907
    client brelationvacuumread2560.2563
    client brelationvacuumread5120.5122
    Of course most of the I/O calls today are hitting page cache, so one would
    expect they'll be < 128us most of the time, but above you can see here degraded
    fsync/fdatasync as well (BTW that was achieved via device mapper
    delayed device). My hope that above would help tremendously when dealing
    with flaky storage, or I/O path issues, or even hypervisors being paused.
    Alternative idea I was having would be simply to add logging of slow I/O
    outliers, but meh.. then one would to answer all those questions:
    what should be the threshold (=>guc?), risk of spamming the log and so on
    (and I wouldn't be fond of proposing yet another log_* GUC ;))
    Any hints, co-authors, or help are more than welcome!
    -J.
    • Jump to comment-1
      Andres Freund<andres@anarazel.de>
      Jan 26, 2026, 3:08 PM UTC
      Hi,
      On 2026-01-26 10:40:52 +0100, Jakub Wartak wrote:
      I'm proposing that we add pgstatio_histogram that would track/show I/O
      latencies profile, so we could quickly identify I/O outliers.
      I think that's an interesting feature!
      From time to time users complain that 'PostgreSQL is slow or stuck' (usually
      COMMIT is slow), when it is quite apparent that it is down to somewhere in
      the I/O stack. It is quite easy to prove once one has proper measurement
      tools in place and is able to correlate, but it takes IMHO way too much time
      and energy to cross-correlate all of that information (iostat -x 1s, wait
      events 1s, and so on), especially if one would like to provide rapid
      response.
      For measuring particularly stuck things, I've been wondering about having a
      regular timer that starts to collect more information if stuck in a place for
      a while. That would probably end up being lower overhead than constantly
      measuring... But would also be a lot more work.
      Right now the patch does not include per-backend/PID tracking, hopefully if
      there will be interest in this, I'll add it, but I would like to first hear
      if that's a good idea. The current implementation uses fast bucket
      calculation to avoid overheads and tries to cover most useful range of
      devices via buckets (128us..256ms, so that covers both NVMe/SSD/HDD and
      abnormally high latency too as from time to time I'm try to help with I/O
      stuck for seconds, usually a sign of some I/O multipath issues, device
      resetting, or hypervisor woes).
      Hm. Isn't 128us a pretty high floor for at least reads and writes? On a good
      NVMe disk you'll get < 10us, after all.
      I see a few problems with the source of the latency measurements though:
      - The latency gathered is that it's quite heavily affected by scheduler
      noise. If your process isn't scheduled because other processes are busy
      doing stuff on the CPU, it's quite possible to get results many orders of
      magnitude wrong.
      - With AIO, you're measuring wait time, and that time can be affected by other
      IOs in the queue. That will often *drastically* overestimate IO latency
      measured this way.
      I don't see how we can do better absent cooperation from the kernel (by
      putting lower-level measurements into io_uring completions, for example)
      though. So maybe this is just how it has to be and we ought to just document
      it.
      postgres=# select
      substring(backendtype,1,8) as backend,object,context,iotype,
      bucketlatencyus as lat_us,
      round(bucketlatencyus/1000.0, 3) as lat_ms,
      bucket_count as count
      from pgstatgetiohistogram()
      where
      bucket_count > 0
      order by 1,2,3,4,5;
      backend | object | context | iotype | latus | lat_ms | count
      ----------+----------+-----------+-----------+--------+--------+-------
      autovacu | relation | normal | read | 128 | 0.128 | 54
      Perhaps the latency should be represented as a range?
      Of course most of the I/O calls today are hitting page cache, so one would
      expect they'll be < 128us most of the time
      Have you measured whether overhead is measurable when hitting the page cache?
      I'd hope that it doesn't, due to io combing amortizing the cost somewhat. But
      it seems worth measuring.
      I assume you made pgstatgetioopname() return "hit?" because you don't
      expect that to ever be hit?
      +static inline int getbucketindex(uint32_t val) {
      +#define MINPGSTATIOHIST_LATENCY 127
      + const uint32t maxindex = PGSTATIOHIST_BUCKETS - 1;
      + /*
      + * hopefully calculated to be 25 by the compiler:
      + * clz(127) = clz(01111111b on uint32) = 25
      + */
      + const uint32t minlatencyleadingzeros =
      + pgleadingzerobits32(MINPGSTATIOHISTLATENCY);
      +
      + /*
      + * make sure the tmp value at least 127 (our minimum bucket size)
      + * as __builtin_clz might return undefined behavior when operating on 0
      + */
      + uint32t tmp = val | MINPGSTATIOHISTLATENCY;
      + / count leading zeros /
      + int leadingzeros = pgleadingzerobits32(tmp);
      +
      + / normalize the index /
      + uint32t index = minlatencyleadingzeros - leading_zeros;
      +
      + / clamp it to the maximum /
      + return (index > maxindex) ? maxindex : index;
      +}
      Wouldn't it be easier to handle the minimum latency by shifting right?
      I think we may also need to handle inputs that don't fit a uint32. For things
      like a stopped VM or such we could see IOs that that don't fit into a uint32
      when measured in microseconds. So perhaps I'd make the input to the bucket
      calc 64 bits, then shift to the minimum precision and mask to implement
      clamping.
      @@ -152,6 +189,10 @@ pgstatcountiooptime(IOObject ioobject, IOContext iocontext, IOOp io_op,
      INSTRTIMEADD(PendingIOStats.pendingtimes[ioobject][iocontext][ioop],
      io_time);

      + / calculate the bucket_index based on latency in us /
      + bucketindex = getbucketindex(INSTRTIMEGETMICROSEC(io_time));
      + PendingIOStats.pendinghisttimebuckets[ioobject][iocontext][ioop][bucket_index]++;
      +
      / Add the per-backend count /
      pgstatcountbackendiooptime(ioobject, iocontext, ioop,
      It's annoying to have to convert to microseconds here, that's not free :(.
      @@ -1356,6 +1356,24 @@ typedef enum iostatcol
      IONUMCOLUMNS,
      } iostatcol;

      +/*
      +* When adding a new column to the pgstatio view and the
      +* pgstatgetbackendio() function, add a new enum value here above
      +* IONUMCOLUMNS.
      +*/
      +typedef enum histiostat_col
      +{
      + HISTIOCOL_INVALID = -1,
      + HISTIOCOLBACKENDTYPE,
      + HISTIOCOL_OBJECT,
      + HISTIOCOL_CONTEXT,
      + HISTIOCOL_IOTYPE,
      + HISTIOCOLBUCKETUS,
      + HISTIOCOL_COUNT,
      + HISTIOCOLRESETTIME,
      + HISTIONUM_COLUMNS
      +} historygethistory_state;
      Think the IONUMCOLUMNS reference in the comment is a copy-pasto. I don't
      think this should be introduced in the middle of the pgstatio implementation.
      +/*
      + * pgleadingzero_bits32
      + * Returns the number of leading 0-bits in x, starting at the most significant bit position.
      + * Word must not be 0 (as it is undefined behavior).
      + */
      +static inline int
      +pgleadingzero_bits32(uint32 word)
      Do we really need this in addition to the already existing
      pgleftmostone_pos32()? Particularly because that already has an msvc
      implementation...
      Greetings,
      Andres Freund
      • Jump to comment-1
        Jakub Wartak<jakub.wartak@enterprisedb.com>
        Jan 27, 2026, 12:06 PM UTC
        On Mon, Jan 26, 2026 at 4:08 PM Andres Freund <andres@anarazel.de> wrote:

        Hi,

        On 2026-01-26 10:40:52 +0100, Jakub Wartak wrote:
        I'm proposing that we add pgstatio_histogram that would track/show I/O
        latencies profile, so we could quickly identify I/O outliers.

        I think that's an interesting feature!
        Hi Andres, thanks for kind words and review!
        [..]

        For measuring particularly stuck things, I've been wondering about having
        a
        regular timer that starts to collect more information if stuck in a place
        for
        a while. That would probably end up being lower overhead than constantly
        measuring... But it would also be a lot more work.
        Well if something is really stuck, I think the wait events are covering us
        on that,
        aren't they? One can argue if they carry enough information (for me they
        mostly
        do, but I'm trying to squeeze some more stuff into them in a nearby thread
        [1],
        BTW: it's kind of "blocked" due to that 56-bit relfilenode idea/question,
        any thoughts on that?)
        Right now the patch does not include per-backend/PID tracking,
        hopefully if
        there will be interest in this, I'll add it, but I would like to first
        hear
        if that's a good idea. The current implementation uses fast bucket
        calculation to avoid overheads and tries to cover most useful range of
        devices via buckets (128us..256ms, so that covers both NVMe/SSD/HDD and
        abnormally high latency too as from time to time I'm try to help with
        I/O
        stuck for seconds, usually a sign of some I/O multipath issues, device
        resetting, or hypervisor woes).

        Hm. Isn't 128us a pretty high floor for at least reads and writes? On a
        good
        NVMe disk you'll get < 10us, after all.
        I was blind and concentrated way too much on the bad-behaving I/O rather
        than good
        I/O - let's call it I/O negativity bias 8)
        Now v2 contains the min bucket lowered to 8us (but max then is just ~131ms,
        I
        didn't want it to use more than 64b total, 16*4b (uint32)=64b and well
        16*8b(uint64)=128b already, so that's why it's capped max at 131072us right
        now).
        I see a few problems with the source of the latency measurements though:

        - The latency gathered is that it's quite heavily affected by scheduler
        noise. If your process isn't scheduled because other processes are busy
        doing stuff on the CPU, it's quite possible to get results many orders
        of
        magnitude wrong.

        - With AIO, you're measuring wait time, and that time can be affected by
        other
        IOs in the queue. That will often drastically overestimate IO latency
        measured this way.

        I don't see how we can do better absent cooperation from the kernel (by
        putting lower-level measurements into io_uring completions, for example)
        though. So maybe this is just how it has to be and we ought to just
        document
        it.
        Right, I think this is a complex topic on it's own, I've added a small
        section into
        the docs. I didn't want to start the thread with undermining my own
        results, but
        indeed I'm getting "bad" numbers. Bad in essence that perceived latency
        numbers do not match with other stuff:
        E.g.checkpointer's fsync/fdatasync latency is awful, although i've been
        using
        this to simulate latency (just 2ms!, but it ended up adding way more):
        dd if=/dev/zero of=/fs bs=1M count=1024
        losetup /dev/loop15 /fs
        echo "0 $(blockdev --getsz /dev/loop15) delay /dev/loop15 0 2" | \
            dmsetup create delay
        mkfs.ext4 /dev/mapper/delay
        mount /dev/mapper/delay /mnt
        ... and it has e.g. quite interesting effects:
        - lack of "noatime" the impact is clearly visible on fsync
        - even with noatime I'm was getting spikes of latenices above 131ms
        (sic!) with this:
        select pg_stat_reset_shared();
        pgbench -i -s 10 -p 1234 -h /tmp postgres
        checkpoint;
        - I've created attached bpftrace to see the gap between kernel and uprobes,
        but
        it's not that high, sample of the view
        
        backendobjectcontextio_typelat_uslat_mscount
        checkpoirelationnormalfsync3276832.76842
        checkpoirelationnormalfsync6553665.5363
        checkpoirelationnormalfsync262144262.1441
        vs eBPF, which does not seem to see that, worst case seem to be like
        wasted
        ~20us (gap is between user and kernel)
        
            [fdatasync] PID 54197  | Kernel: 12943  us | User: 12964  us | Lag: 21us
        [..but usually it's just:]
            [    fsync] PID 52266  | Kernel: 1711   us | User: 1714   us | Lag: 3us
            [    fsync] PID 52266  | Kernel: 19913  us | User: 19916  us | Lag: 3us
            [    fsync] PID 52266  | Kernel: 1993   us | User: 1996   us | Lag: 3us
            [    fsync] PID 52266  | Kernel: 1994   us | User: 1995   us | Lag: 1us
            [    fsync] PID 52266  | Kernel: 53734  us | User: 53736  us | Lag: 2us
            [    fsync] PID 52266  | Kernel: 8066   us | User: 8072   us | Lag: 6us
            [    fsync] PID 52266  | Kernel: 2107   us | User: 2109   us | Lag: 2us
            [    fsync] PID 52266  | Kernel: 1972   us | User: 1974   us | Lag: 2us
        (this is on 2ms delayed + noatime fs + with CONFIG_HZ=1000 + laptop's
        NVMe
        that's idle).
        - in mdsyncfiletag() we seem to have pgstatcountiooptime() after
        potential
        FileClose(), but I haven't witnessed long close(), it's still
        context-switch
        - I've spotted that power mgmt might be influencing it even further (but
        that's not
        in the docs yet, dunno if I should add it there like the next item on the
        list)
        backend | object | context | iotype | latus | lat_ms | count
        ----------+----------+-----------+-----------+--------+--------+-------
        autovacu | relation | normal | read | 128 | 0.128 | 54

        Perhaps the latency should be represented as a range?
        Cool idea, I haven't even thought about this one! From now v2 shows:
        postgres=# select
        substring(backend_type,1,8) as btype,
        object, context, io_type, bucket_latency_us as lat_us, bucket_count as
        cnt
        from pgstatgetiohistogram()
        where
        bucket_count > 0
        order by 1,2,3,4,5 ;
        btypeobjectcontextio_typelat_uscnt
        [..]
         checkpoi | relation | normal  | write     | [0,9)       |  33
         checkpoi | relation | normal  | write     | [8,17)      |   8
         checkpoi | relation | normal  | write     | [16,33)     |   1
        Of course most of the I/O calls today are hitting page cache, so one
        would
        expect they'll be < 128us most of the time

        Have you measured whether overhead is measurable when hitting the page
        cache?
        I'd hope that it doesn't, due to io combing amortizing the cost somewhat.
        But
        it seems worth measuring.
        Not yet, I first wanted to hear if I'm not sailing into some plain stupid
        direction somewhere with this idea or implementation (e.g.
        that INSTRTIMEGET_MICROSEC() was a really stupid omission from my side).
        I'll try to perform this test overhead measurement hopefully with v3 once
        we settle on how to do that bit shifting/clz().
        I assume you made pgstatgetioopname() return "hit?" because you don't
        expect that to ever be hit?
        Yes, my patch seems to always return 0 for "hit?". I'll need to investigate
        that
        further.
        +static inline int getbucketindex(uint32_t val) {
        +#define MINPGSTATIOHIST_LATENCY 127
        + const uint32t maxindex = PGSTATIOHIST_BUCKETS - 1;
        + /*
        + * hopefully calculated to be 25 by the compiler:
        + * clz(127) = clz(01111111b on uint32) = 25
        + */
        + const uint32t minlatencyleadingzeros =
        + pgleadingzerobits32(MINPGSTATIOHISTLATENCY);
        +
        + /*
        + * make sure the tmp value at least 127 (our minimum bucket size)
        + * as __builtin_clz might return undefined behavior when
        operating on 0
        + */
        + uint32t tmp = val | MINPGSTATIOHISTLATENCY;
        + / count leading zeros /
        + int leadingzeros = pgleadingzerobits32(tmp);
        +
        + / normalize the index /
        + uint32t index = minlatencyleadingzeros - leading_zeros;
        +
        + / clamp it to the maximum /
        + return (index > maxindex) ? maxindex : index;
        +}

        Wouldn't it be easier to handle the minimum latency by shifting right?
        What you seem to suggest seems to be equally width buckets {1,2,3,4..ms} and
        not logarithmic buckets {1,2,4,8..ms} or am I missing something? The patch
        as is
        stands has two ways #ifdef implementations now, with bitwise shifting
        working
        now, but even in objdump on -O2 there's plenty of those jumps because of
        current
        code.
        I think we may also need to handle inputs that don't fit a uint32.
        Fixed.
        [..] For things
        like a stopped VM or such we could see IOs that that don't fit into a
        uint32
        when measured in microseconds. So perhaps I'd make the input to the
        bucket
        calc 64 bits, then shift to the minimum precision and mask to implement
        clamping.
        @@ -152,6 +189,10 @@ pgstatcountiooptime(IOObject io_object,
        IOContext iocontext, IOOp ioop,
        INSTRTIMEADD(PendingIOStats.pendingtimes[ioobject][iocontext][ioop],
        io_time);

        + / calculate the bucket_index based on latency in us /
        + bucket_index =
        getbucketindex(INSTRTIMEGETMICROSEC(iotime));
        +
        PendingIOStats.pendinghisttimebuckets[ioobject][iocontext][ioop][bucket_index]++;
        +
        / Add the per-backend count /
        pgstatcountbackendiooptime(ioobject, io_context,
        io_op,

        It's annoying to have to convert to microseconds here, that's not free :(.
        Oooops, fixed, getbucketindex() now operates directly on nanos/int64.
        [..]
        + HISTIOCOL_COUNT,
        + HISTIOCOLRESETTIME,
        + HISTIONUM_COLUMNS
        +} historygethistory_state;
        ]> Think the IONUMCOLUMNS reference in the comment is a copy-pasto. I
        don't
        think this should be introduced in the middle of the pgstatio
        implementation.
        Right, I've moved it to just before pgstatiohistogrambuild_tuples().
        +/*
        + * pgleadingzero_bits32
        + * Returns the number of leading 0-bits in x, starting at
        the most significant bit position.
        + * Word must not be 0 (as it is undefined behavior).
        + */
        +static inline int
        +pgleadingzero_bits32(uint32 word)

        Do we really need this in addition to the already existing
        pgleftmostone_pos32()? Particularly because that already has an msvc
        implementation...
        Well, I would be all in for removal , but please see above the
        getbucketindex() discussion.
        I've tried to get rid of it (but maybe i misunderstood something), but in
        the end I think it is more
        elegant/faster to have it there so that code in getbucketindex() stays
        more readable, rather
        than throw more bitwise voodoo there(?)
        -J.
        [1] -
        https://www.postgresql.org/message-id/CAKZiRmyZzmOODYS6n8mns9zN4RcS3o9kfrdQDyeRupqaGp9PmQ%40mail.gmail.com
        • Jump to comment-1
          Jakub Wartak<jakub.wartak@enterprisedb.com>
          Jan 28, 2026, 11:12 AM UTC
          On Tue, Jan 27, 2026 at 1:06 PM Jakub Wartak <jakub.wartak@enterprisedb.com>
          wrote:
          [..]
          Of course most of the I/O calls today are hitting page cache, so one
          would
          expect they'll be < 128us most of the time

          Have you measured whether overhead is measurable when hitting the page
          cache?
          I'd hope that it doesn't, due to io combing amortizing the cost
          somewhat. But
          it seems worth measuring.

          Not yet, I first wanted to hear if I'm not sailing into some plain stupid
          direction somewhere with this idea or implementation (e.g.
          that INSTRTIMEGET_MICROSEC() was a really stupid omission from my side).

          I'll try to perform this test overhead measurement hopefully with v3 once
          we settle on how to do that bit shifting/clz().
          [..]
          Here's the answer: on properly isolated perf test run (my
          old&legacy&predictiable
          4s32c64t NUMA box, s_b=8GB, DB size 16GB, hugepages, no turboboost, proper
          warmup,
          no THP, cpupower D0, no physical I/O, ~22k pread64() calls/sec combined to
          VFS
          cache)
          and started on just using single NUMA: numactl --membind=0
          --cpunodebind=0
          measured using: pgbench -M prepared -c 4 -j 4 postgres -T 20 -P 1 -S
          master+trackiotimings=on, 60s warmup and then 3x runs
          tps = 44615.603668
          tps = 44556.191492
          tps = 44813.793981
          avg = 44662
          master+trackiotimings=on+patch, , 60s warmup and then 3x runs
          tps = 44441.879384
          tps = 44403.101737
          tps = 45036.747418
          avg = 44627
          so that's like 99.921% (so literally no overhead) and yields picture like:
          postgres=# select bucketlatencyus, bucket_count
          from pgstatio_histogram
          where
          bucket_count > 0 and
          backend_type = 'client backend' and
          io_type = 'read' and
          context = 'normal'
          order by bucketlatencyus;
          bucket_latency_usbucket_count
          [0,9)273455
          [8,17)3820379
          [16,33)29359
          [32,65)585
          [64,129)467
          [128,257)419
          [256,513)15828
          [512,1025)89
          So one can also see 0.25..0.5ms bucket being larger there (initial reading
          of the data from physical device) and that's hardware RAID-1 with 2x
          Intel D3-S4510. And if I do pgbuffercacheevictall()+vm dropcache+
          pgstatreset_shared(), I get this picture (note for bulkreads):
          postgres=# select bucketlatencyus, bucket_count
          from pgstatio_histogram
          where
          bucket_count > 0 and
          backend_type = 'client backend' and context='bulkread';
          bucket_latency_usbucket_count
          [0,9)102555
          [8,17)70
          [16,33)3938
          [32,65)6763
          [64,129)5206
          [128,257)9392
          [256,513)10959
          [512,1025)21372
          [1024,2049)502
          [2048,4097)34
          [4096,8193)2
          [8192,16385)2
          [16384,32769)7
          So clearly there's a distinction between reading the VFS cache and hitting
          physical I/O.
          Now we'll just probably settle on the proper getbucketindex() impl.
          -J.
          • Jump to comment-1
            Andres Freund<andres@anarazel.de>
            Jan 29, 2026, 4:27 PM UTC
            Hi,
            On 2026-01-28 12:12:10 +0100, Jakub Wartak wrote:
            On Tue, Jan 27, 2026 at 1:06 PM Jakub Wartak <jakub.wartak@enterprisedb.com>
            Not yet, I first wanted to hear if I'm not sailing into some plain stupid
            direction somewhere with this idea or implementation (e.g.
            that INSTRTIMEGET_MICROSEC() was a really stupid omission from my side).

            I'll try to perform this test overhead measurement hopefully with v3 once
            we settle on how to do that bit shifting/clz().

            [..]
            Here's the answer: on properly isolated perf test run (my
            old&legacy&predictiable
            4s32c64t NUMA box, s_b=8GB, DB size 16GB, hugepages, no turboboost, proper
            warmup,
            no THP, cpupower D0, no physical I/O, ~22k pread64() calls/sec combined to
            VFS
            cache)
            and started on just using single NUMA: numactl --membind=0
            --cpunodebind=0
            measured using: pgbench -M prepared -c 4 -j 4 postgres -T 20 -P 1 -S

            master+trackiotimings=on, 60s warmup and then 3x runs
            tps = 44615.603668
            tps = 44556.191492
            tps = 44813.793981
            avg = 44662

            master+trackiotimings=on+patch, , 60s warmup and then 3x runs
            tps = 44441.879384
            tps = 44403.101737
            tps = 45036.747418
            avg = 44627

            so that's like 99.921% (so literally no overhead) and yields picture like:
            I don't think that's a particularly useful assurance, unfortunately:
            1) Using pgbench with an in-memory readonly workload is typically limited by
            context switch overhead and per-statement overhead. After a short while you
            have at most one IO per statement (the heap page), which obviously isn't
            going to be affected by a small per-IO overhead.
            2) The per-core memory bandwidth on that old machine, if it's the quite old
            EDB machine I think it is, is so low, that you'd be bottlenecked by memory
            bandwidth well before you're going to be bottlenecked by actual CPU stuff
            (which the bucket computation is).
            I think you'd have to test something like pg_prewarm(), with
            iocombinelimit=1, on a modern client CPU (client CPUs typically have much
            higher per-core memory bandwidth than the more scalable server CPUs).
            Greetings,
            Andres Freund
            • Jump to comment-1
              Jakub Wartak<jakub.wartak@enterprisedb.com>
              Jan 30, 2026, 1:44 PM UTC
              On Thu, Jan 29, 2026 at 5:27 PM Andres Freund <andres@anarazel.de> wrote:

              Hi,

              On 2026-01-28 12:12:10 +0100, Jakub Wartak wrote:
              On Tue, Jan 27, 2026 at 1:06 PM Jakub Wartak <jakub.wartak@enterprisedb.com>
              Not yet, I first wanted to hear if I'm not sailing into some plain stupid
              direction somewhere with this idea or implementation (e.g.
              that INSTRTIMEGET_MICROSEC() was a really stupid omission from my side).

              I'll try to perform this test overhead measurement hopefully with v3 once
              we settle on how to do that bit shifting/clz().

              [..]
              Here's the answer: on properly isolated perf test run (my
              old&legacy&predictiable
              4s32c64t NUMA box, s_b=8GB, DB size 16GB, hugepages, no turboboost, proper
              warmup,
              no THP, cpupower D0, no physical I/O, ~22k pread64() calls/sec combined to
              VFS
              cache)
              and started on just using single NUMA: numactl --membind=0
              --cpunodebind=0
              measured using: pgbench -M prepared -c 4 -j 4 postgres -T 20 -P 1 -S

              master+trackiotimings=on, 60s warmup and then 3x runs
              tps = 44615.603668
              tps = 44556.191492
              tps = 44813.793981
              avg = 44662

              master+trackiotimings=on+patch, , 60s warmup and then 3x runs
              tps = 44441.879384
              tps = 44403.101737
              tps = 45036.747418
              avg = 44627

              so that's like 99.921% (so literally no overhead) and yields picture like:

              I don't think that's a particularly useful assurance, unfortunately:

              1) Using pgbench with an in-memory readonly workload is typically limited by
              context switch overhead and per-statement overhead. After a short while you
              have at most one IO per statement (the heap page), which obviously isn't
              going to be affected by a small per-IO overhead.

              2) The per-core memory bandwidth on that old machine, if it's the quite old
              EDB machine I think it is, is so low, that you'd be bottlenecked by memory
              bandwidth well before you're going to be bottlenecked by actual CPU stuff
              (which the bucket computation is).
              Hi, thanks for having a look!
              That legacy server is mine, but yeah even on same NUMA it can just get
              ~16GB/s AFAIR and just ~4GB between nodes. I've forgot to reply in
              that old NUMA thread
              back then, maybe it's not relevant, but I find it valuable often as
              the bottlenecks
              are easier to hit/notice and there's not that many traps that modern CPUs have
              (+ find having 4 NUMAs/socket in 2U is not that easy today, after all
              it's physical
              box nearby so PMCs are there too - unlike in "cloud", and by having 4 nodes the
              disbalances between nodes/zones are much more cleanly visible than
              just 1 local vs
              1 remote). Somehow I built trust on the results that machine (but I
              still can lift it!
              so probably still shouldn't trust it fully - pun to the "never trust a
              machine you
              can lift" :D)
              I think you'd have to test something like pg_prewarm(), with
              iocombinelimit=1, on a modern client CPU (client CPUs typically have much
              higher per-core memory bandwidth than the more scalable server CPUs).
              Fair point, thanks for explaining in the above chapter. So for much more modern
              Intel(R) Ultra 7 (1s16c32t,that can - according to mlc(1) - can do up
              to 85-90GB/s
              bandwidth and has those ugly P/E-cores, so I've pinned (taskset) the backend
              doing pgprewam to P-Core @ that usually runs @ 5Ghz, but with noturbo
              that's just @ 1.4Ghz). That was on normal build (not debug, so -O2, no casserts,
              gcc 13, kernel 6.14.x, noturbo, sb=8GB with HP, DB scale 600
              (pgbench_accounts @
              7.7GB), performance governor, no THP, ...)
              I'was kind of surprised, but here it goes full disclosure of my results. With
              the patch and trackiotiming=on, iocombinelimit=1, IO_METHOD='SYNC'
              just to care
              about 1 PID, here are timings of pgbuffercacheevict_all() and then
              measured duration of
              select pgprewarm('pgbenchaccounts');
              So initially I've got this picture, eliminated some worst/best too:
              Without patch:
              Time: 4644.346 ms (00:04.644)
              Time: 4612.610 ms (00:04.613)
              Time: 4639.133 ms (00:04.639)
              Time: 4625.020 ms (00:04.625)
              Time: 4636.652 ms (00:04.637)
              Avg: 4631ms
              With the patch:
              Time: 4765.780 ms (00:04.766)
              Time: 4784.308 ms (00:04.784)
              Time: 4754.661 ms (00:04.755)
              Time: 4770.772 ms (00:04.771)
              Time: 4768.232 ms (00:04.768)
              Avg: 4768ms (102.95%)
              With the patch and __builtin_clzl()
              Time: 4750.293 ms (00:04.750)
              Time: 4729.288 ms (00:04.729)
              Time: 4727.820 ms (00:04.728)
              Time: 4729.760 ms (00:04.730)
              Time: 4727.146 ms (00:04.727)
              Avg: 4732ms (102.18%)
              So clearly there was some overhead (I've started getting worried),
              and __builtin_clz() was cheaper slightly cheaper or just too much jitter --
              yes I've got plenty jittering out with this (not-shown, so above are like 5
              results out of 15).
              With v2 patch and __builtinclzl() and default trackio_timing=off (default)
              got back to ~4660m as expected.
              With v2 patch and __builtinclzl() and default iocombine_limit=128kB
              and trackiotiming=off, went back to ~4150ms:
              Time: 4151.942 ms (00:04.152)
              Time: 4133.747 ms (00:04.134)
              Time: 4153.103 ms (00:04.153)
              Time: 4135.199 ms (00:04.135)
              With thje patch and __builtinclzl() and default iocombine_limit=128kB
              trackiotiming=on, was also @ ~4150ms.
              Time: 4152.941 ms (00:04.153)
              Time: 4154.096 ms (00:04.154)
              Time: 4155.119 ms (00:04.155)
              So with "batching" the IOs, the overhead is almost gone. BTW that's with
              current_clocksource says "tsc". After dozens of runs, I've noticed thermals
              starting playing a bigger role than this patch, so i've did idle-set -D0
              and it degraded even further.
              Master, but still got lots of fluctuations, non filtered picture
              Time: 5518.546 ms (00:05.519)
              Time: 5587.675 ms (00:05.588)
              Time: 5512.828 ms (00:05.513)
              Time: 5534.023 ms (00:05.534)
              Time: 5728.125 ms (00:05.728)
              Time: 5731.543 ms (00:05.732)
              Time: 5762.687 ms (00:05.763)
              Time: 5565.607 ms (00:05.566)
              Time: 5498.496 ms (00:05.498)
              Time: 5637.870 ms (00:05.638)
              but if I leave it idle for a while couple minutes then I get:
              Time: 5577.879 ms (00:05.578)
              Time: 5575.648 ms (00:05.576)
              Time: 5548.146 ms (00:05.548)
              Some break and with the patch and __builtin_clzl (it gets lower sometimes than
              master, how can I trust this?!)
              Time: 5504.415 ms (00:05.504)
              Time: 5531.827 ms (00:05.532)
              Time: 5733.146 ms (00:05.733)
              Time: 5511.549 ms (00:05.512)
              So something more happening there , probably with thermals/scheduler than with
              patch. So of course I've done some home work [1][2], I have found even
              Your's rant on
              some of this [1] and truth to be told I'm unable to stabilize those
              deviations on
              this modern client CPU. So i've tried on another much more predictable
              (and non-modern :P) client CPU: Intel Core i5 7600k (1s4c4t) and got much more
              consistent numbers there (those are non-filtered, almost identical variables
              from also same setup(also 6.14.x, same tweaks, also with taskset to 1c) ):
              Master:
              Time: 2592.351 ms (00:02.592)
              Time: 2574.612 ms (00:02.575)
              Time: 2592.530 ms (00:02.593)
              Time: 2575.356 ms (00:02.575)
              Time: 2594.687 ms (00:02.595)
              Avg=2585ms
              Master+patch:
              Time: 2577.610 ms (00:02.578)
              Time: 2585.796 ms (00:02.586)
              Time: 2568.559 ms (00:02.569)
              Time: 2586.199 ms (00:02.586)
              Time: 2567.872 ms (00:02.568)
              Avg=2576ms (below master?!)
              Master+patch__builtin_clzl:
              Time: 2578.083 ms (00:02.578)
              Time: 2586.732 ms (00:02.587)
              Time: 2573.176 ms (00:02.573)
              Time: 2592.048 ms (00:02.592)
              Time: 2575.731 ms (00:02.576)
              Time: 2575.570 ms (00:02.576)
              Avg=2579ms (below master?!)
              Just Master again:
              Time: 2578.838 ms (00:02.579)
              Time: 2588.531 ms (00:02.589)
              Time: 2572.165 ms (00:02.572)
              Time: 2591.528 ms (00:02.592)
              Time: 2572.015 ms (00:02.572)
              Time: 2589.921 ms (00:02.590)
              Time: 2572.124 ms (00:02.572)
              Avg=2580ms
              So to sum-up:
              - it still looks OK to me
              - bigger impact than the patches itself can be thermals on modern-day CPUs(?)
              - older/legacy CPU (desktop one) seems to be less jittering than modern client
              laptop CPU even with the most strict perf. settings (?)
              - worst-case: to spot that ~2% regression one would have to disable
              the io batching,
              enable track_io_timing (that's the not default)
              I'm attaching v3 which has now default switched to __builtin_clzl() which
              works ok for uint64 (not sure if I need to care about __builtin_clzll
              on Windows?).
              Open questions:
              0. Should I pursue more benchmarking or the above results are enough?
              1. __builtin_clzl() or not to __builtin_clzl() that is the question... ?
              2. Should I add per-PID backend stats too or are you having something
              against it?
              3. Shouldn't we fix that mdsyncfiletag() mentioned earlier we seem to have
              pgstat_count_io_op_time() *after* potential FileClose() (as per my
              earlier question)
              -J.
              [1] - https://www.postgresql.org/message-id/20231115180433.p3eeaczbam5zxdz5%40awork3.anarazel.de
              [2] - https://vondra.me/posts/benchmarking-is-hard-sometimes/