Add expressions to pg_restore_extended_stats()

  • Jump to comment-1
    Corey Huinker<corey.huinker@gmail.com>
    Jan 30, 2026, 5:09 AM UTC
    This is a new thread that continues the work [1] of the Extended Statistics
    set/restore/clear functions thread [2] which itself was a continuation of
    the work [3] of the Statistics Export and Import thread [4], all of which
    is too much for anyone to review, so I'll give a recap:
    A longstanding complaint about postgres has been the amount of time
    required to rebuild stats after a major version upgrade, during which the
    database is online but under heavy I/O and queries of any real complexity
    will get terrible query plans because they lack the optimizer statistics
    which the previous major version had just moments ago, but weren't carried
    over to the new version.
    Version 18 introduced the ability to import stats at the relation level and
    attribute level [3], and these stats were now dumped and restored by
    default via pg_upgrade. This meant that most customers could reduce the
    time in which the database was online but in a state of degraded
    performance. It was, however, not a complete solution, because it still
    lacked statistics for extended objects (i.e. CREATE STATISTICS xyz...) and
    custom statistic kinds like those found in extensions like PostGIS. Still,
    this made things better for 99% of installations, and while it was not
    trivial to determine if a given instance was in that 1%, enhancements were
    made to vacuumdb [5] to detect what tables were missing statistics and
    analyze just those tables, thus reducing the scope of the I/O-intensive
    rebuild period for those in the unlucky 1%.
    Work in this 19-dev cycle has sought to close that 1% gap by importing
    statistics for extended statistics objects. These stats are quite a bit
    more complicated than their relation and attribute equivalents, but good
    progress has been made [1], resulting in the carryover of many statistics
    types: ndistinct, dependencies, and MCV. All of them, except for the
    statistics associated with expressions in the definition of the statistics
    object (i.e. CREATE STATISTICS xyz on upper(name), ...). This last type of
    statistics has proved to be a tough nut to crack for reasons I will
    describe in detail. We could stop here, but if we did we would actually
    create work for vacuumdb, specifically the code that processes the
    --missing-stats-only option, which currently looks for matching extended
    statistics data (pgstatisticext_data) rows that match the object
    definition (pgstatisticext) rows, and considers any match to be
    sufficient for "not missing stats". That assumption would no longer hold in
    the case of stats objects that have expressions, because they'd be missing
    their stxdexprs stats. While we can teach vacuumdb that difference, we
    could instead solve the expressions problem, and close the statistics gap
    even further [6].
    We have a working but not thoroughly tested implementation (attached).
    There remains one sticky problem: the serialization format of the
    statistics stored in pgstatisticext_data.stxdexprs. The type of the
    attribute is pgcatalog.pgstatistic[], which is to say that it's an array
    of records, the length of the array is equal to the number of expressions
    in the extended statistics object. pg_statistic is where attribute
    statistics are imported, so the structure has the complexity of attribute
    stats itself, slightly reduced by the fact that the fields describing the
    attribute are left as InvalidOid values, but still quite complicated.
    Several of the attributes in pg_statistic are of type ANYARRAY, because
    they are most-common-values/histogram/most-common-elements arrays, each of
    which has a composite datatype determined by the datatype of the
    expression(s) and other columns in the object definition. This presents a
    problem for utilities like deconstruct_array(), in that the datatype to be
    deconstructed varies by column and by the datatype of the expression
    definition, and that datatype could itself be an array which
    deconstruct_array would then try to deconstruct...there is no way to get
    deconstruct_array() to stop 2 levels deep.
    This problem was solved for pgrestoreattributestats by having pgdump
    export the ANYARRAY values CAST-ed to type "text" rather than "text[]",
    which allowed each type of statistics to be decomposed according to it's
    own rules, and that worked fine when each statistic type became a parameter
    in pgrestoreattribute_stats(). But now we've got all of those types, and
    we're getting them multiple times, so that method doesn't quite scale.
    I've considered several ways around this issue:
    1. Defining a strict order for the statistics types, following the order
    they appear in pgstatsext (nullfrac, avgwidth, n_distinct,
    mostcommonelems, ...) and then exprs from pgstatsext_exprs in last
    place. Each value is CAST-ed to "text",
    which means that we can deconstruct them in a fashion very similar to how
    we did for attribute stats. Those text values are put into an array in the
    strict order, and those arrays are aggregated into a 2-D array.
    Pros:
    - This method is implemented and it works, and the code for it is reusing
    tools and coding patterns we've already incorporated (deconstruct_array,
    safe input functions, arglist arrays). Patch attached.
    Cons:
    - The ordering is completely opaque. Documenting that ordering might help a
    little, but there's nothing intuitive about it and checking it has been an
    irritant to author and reviewer alike.
    - This opaque ordering must be kept in sync between
    pgrestoreextendedstats and pgdump or else statistical garbage will
    result.
    2. Defining a record type specifically for purpose.
    Pros:
    - It could be decomposed via standard composite input function, and then
    each type deconstructed on its own terms
    Cons:
    - It's type-clutter, and a type that is likely only used during upgrades.
    - When new stats types are introduced, the structure would also have to
    change, breaking typecasts of composite values from older versions. This
    alone makes this option unpalatable to most reviewers, and I can't advocate
    for it.
    3. Keeping the 2-D text array in #1, but each "row" is a list of
    kwargs-like pairs like the arguments used in pgrestoreattribute_stats
    (i.e. ARRAY['nullfrac','0.5','avgwidth','1.0','mostcommonvalues',...]
    Pros:
    - Flexibility in ordering
    - Clarity at a glance, provided that the reader has seen the kwargs
    convention of the pgrestore*_stats functions.
    - Still uses existing tooling like #1, and not that much more overhead.
    - The name-value pairing problem has the same solution as the arg-pairing
    that the function already does
    Cons:
    - adds overhead of storing the stat type names, and the key-value pairing
    - the 2-D nature of the array requires that the number of elements be
    fixed, so we couldn't leave out a stat type from one row unless we left it
    out of the other one as well
    - adds the possibility of duplicate names
    4. JSON. The outer structure would be an array of objects, each object
    would be a key-value.
    Pros:
    - Flexibility in ordering
    - Clarity at a glance in a format well understood even without prior
    knowledge of our kwargs convention
    - we have already implemented similar things for the new formats of
    pgndistinct and pgdependences.
    - This method currently has the interest of Michael Paquier, the committer
    of all the v19 work to date.
    Cons:
    - Requires implementing a state engine to parse the json, check for missing
    values, resolve duplicates. We do that for pg_dependencies, and that takes
    800-ish lines of code to handle 3 key names, this would have 10.
    - the escaping of values in a composite record CASTed to text and then
    further encoded as a JSON value would be extremely unreadable, and likely
    quite bloated.
    - using JSON for stats serialization met with immediate strong opposition
    from several reviewers. That resistance may not be there for this vastly
    reduced scope, especially in light of the new JSON-compatible formats for
    pgndistinct and pgdependencies, but it does give me pause.
    And...that's the major decision point. If we solve that, the rest is far
    less controversial. My apologies that this summary itself needs a summary.
    Thanks for reading. Eager to hear perspectives on the serialization methods
    propsed, or suggestions of other methods.
    [1] https://commitfest.postgresql.org/patch/5517/
    [2]
    https://www.postgresql.org/message-id/flat/aTE4AL7U0dp1Jjrx%40paquier.xyz#72116daf9d37828a47ce477a852a78d3
    [3] https://commitfest.postgresql.org/patch/4538/
    [4]
    https://www.postgresql.org/message-id/flat/CADkLM%3DcB0rF3p_FuWRTMSV0983ihTRpsH%2BOCpNyiqE7Wk0vUWA%40mail.gmail.com
    [5] https://commitfest.postgresql.org/patch/5523/
    [6] The issue of custom statistic kinds like those found in PostGIS would
    still remain.
    • Jump to comment-1
      Michael Paquier<michael@paquier.xyz>
      Jan 30, 2026, 5:56 AM UTC
      On Fri, Jan 30, 2026 at 12:08:49AM -0500, Corey Huinker wrote:
      3. Keeping the 2-D text array in #1, but each "row" is a list of
      kwargs-like pairs like the arguments used in pgrestoreattribute_stats
      (i.e. ARRAY['nullfrac','0.5','avgwidth','1.0','mostcommonvalues',...]

      4. JSON. The outer structure would be an array of objects, each object
      would be a key-value.
      I'd still favor 4 on the ground that it's easier to edit and read,
      which would more in line with the MCV, dependencies, ndistinct and
      att/rel statistics. The format proposed in the attached patch is hard
      to work with, anyway. Now, I do take your point about composite
      record values casted into a single text value could be confusing
      (double-quote issues, I guess?), so perhaps a text[] as in 3 would be
      more adapted for readability. We could also force some checks based
      the order of the arguments in the input array, so as duplicates are
      not an issue, I guess?
      Structurally, I feel that import_expressions() is overcomplicated, and
      with the correct structure tracking the state of each field, I would
      try to reduce a bit the duplication that the patch presents, aiming at
      less callers of statattbuildstavalues() and statattsetslot(),
      perhaps.
      I have also posted a few more arguments here, for reference:
      https://www.postgresql.org/message-id/aXvqN2fhDJZhL2RS@paquier.xyz
      --
      Michael
      • Jump to comment-1
        Corey Huinker<corey.huinker@gmail.com>
        Jan 30, 2026, 8:07 AM UTC
        On Fri, Jan 30, 2026 at 12:55 AM Michael Paquier <michael@paquier.xyz>
        wrote:
        On Fri, Jan 30, 2026 at 12:08:49AM -0500, Corey Huinker wrote:
        3. Keeping the 2-D text array in #1, but each "row" is a list of
        kwargs-like pairs like the arguments used in pgrestoreattribute_stats
        (i.e. ARRAY['nullfrac','0.5','avgwidth','1.0','mostcommonvalues',...]

        4. JSON. The outer structure would be an array of objects, each object
        would be a key-value.

        I'd still favor 4 on the ground that it's easier to edit and read,
        which would more in line with the MCV, dependencies, ndistinct and
        att/rel statistics. The format proposed in the attached patch is hard
        to work with, anyway. Now, I do take your point about composite
        record values casted into a single text value could be confusing
        (double-quote issues, I guess?), so perhaps a text[] as in 3 would be
        more adapted for readability.
        Hmm, maybe it isn't so bad:
        SELECT '{"{\"{1,1}\",\"{2,1}\",\"{3,-1}\",\"{NULL,0}\"}"}'::text[];
                           text
        ---------------------------------------------------
         {"{\"{1,1}\",\"{2,1}\",\"{3,-1}\",\"{NULL,0}\"}"}
        (1 row)
        
        SELECT
        array_to_json('{"{\"{1,1}\",\"{2,1}\",\"{3,-1}\",\"{NULL,0}\"}"}'::text[]);
                       array_to_json
        ---------------------------------------------------
         ["{\"{1,1}\",\"{2,1}\",\"{3,-1}\",\"{NULL,0}\"}"]
        (1 row)
        Mind you, this is an ANYARRAY first casted to text, if we cast the
        pgstatsextexprs.mostcommon_values directly to JSON then it'll drill
        down into the innards of the composite values because it can see the local
        datatypes, and that breaks our ability to use regular input functions. I
        learned that the hard way when using JSON for serializing attribute stats
        stuff when this effort first began.
        Before seeing that, I wanted to try option 3 first, as it brings clarity
        with no real increase in tooling other than looking for repeated keywords,
        but if you're hyped for json then I'll try that first.
        We could also force some checks based
        the order of the arguments in the input array, so as duplicates are
        not an issue, I guess?
        If we're doing a kwargs-thing then I may as well just track which keywords
        were already used. We already bail out on the whole expressions array at
        the first sign of inconsistency, so it's not like we have to decide which
        of the duplicates to keep.
        Structurally, I feel that import_expressions() is overcomplicated, and
        with the correct structure tracking the state of each field, I would
        try to reduce a bit the duplication that the patch presents, aiming at
        less callers of statattbuildstavalues() and statattsetslot(),
        perhaps.
        I don't think we can get around those. It's a limitation of how the
        sta(kindN/opN/collN/numbersN/valuesN) system in pg_statistic works. We want
        to fill in each stakind as we find it, and we don't know how many of them
        we've already filled out. An array of records would have been better, but
        we've got 5 parallel arrays of scalars and we have to live with it.