pgsql-hackers
❮
Add expressions to pg_restore_extended_stats()
- Jump to comment-1Corey Huinker<corey.huinker@gmail.com>Jan 30, 2026, 5:09 AM UTCThis is a new thread that continues the work [1] of the Extended Statistics
set/restore/clear functions thread [2] which itself was a continuation of
the work [3] of the Statistics Export and Import thread [4], all of which
is too much for anyone to review, so I'll give a recap:
A longstanding complaint about postgres has been the amount of time
required to rebuild stats after a major version upgrade, during which the
database is online but under heavy I/O and queries of any real complexity
will get terrible query plans because they lack the optimizer statistics
which the previous major version had just moments ago, but weren't carried
over to the new version.
Version 18 introduced the ability to import stats at the relation level and
attribute level [3], and these stats were now dumped and restored by
default via pg_upgrade. This meant that most customers could reduce the
time in which the database was online but in a state of degraded
performance. It was, however, not a complete solution, because it still
lacked statistics for extended objects (i.e. CREATE STATISTICS xyz...) and
custom statistic kinds like those found in extensions like PostGIS. Still,
this made things better for 99% of installations, and while it was not
trivial to determine if a given instance was in that 1%, enhancements were
made to vacuumdb [5] to detect what tables were missing statistics and
analyze just those tables, thus reducing the scope of the I/O-intensive
rebuild period for those in the unlucky 1%.
Work in this 19-dev cycle has sought to close that 1% gap by importing
statistics for extended statistics objects. These stats are quite a bit
more complicated than their relation and attribute equivalents, but good
progress has been made [1], resulting in the carryover of many statistics
types: ndistinct, dependencies, and MCV. All of them, except for the
statistics associated with expressions in the definition of the statistics
object (i.e. CREATE STATISTICS xyz on upper(name), ...). This last type of
statistics has proved to be a tough nut to crack for reasons I will
describe in detail. We could stop here, but if we did we would actually
create work for vacuumdb, specifically the code that processes the
--missing-stats-only option, which currently looks for matching extended
statistics data (pgstatisticext_data) rows that match the object
definition (pgstatisticext) rows, and considers any match to be
sufficient for "not missing stats". That assumption would no longer hold in
the case of stats objects that have expressions, because they'd be missing
their stxdexprs stats. While we can teach vacuumdb that difference, we
could instead solve the expressions problem, and close the statistics gap
even further [6].
We have a working but not thoroughly tested implementation (attached).
There remains one sticky problem: the serialization format of the
statistics stored in pgstatisticext_data.stxdexprs. The type of the
attribute is pgcatalog.pgstatistic[], which is to say that it's an array
of records, the length of the array is equal to the number of expressions
in the extended statistics object. pg_statistic is where attribute
statistics are imported, so the structure has the complexity of attribute
stats itself, slightly reduced by the fact that the fields describing the
attribute are left as InvalidOid values, but still quite complicated.
Several of the attributes in pg_statistic are of type ANYARRAY, because
they are most-common-values/histogram/most-common-elements arrays, each of
which has a composite datatype determined by the datatype of the
expression(s) and other columns in the object definition. This presents a
problem for utilities like deconstruct_array(), in that the datatype to be
deconstructed varies by column and by the datatype of the expression
definition, and that datatype could itself be an array which
deconstruct_array would then try to deconstruct...there is no way to get
deconstruct_array() to stop 2 levels deep.
This problem was solved for pgrestoreattributestats by having pgdump
export the ANYARRAY values CAST-ed to type "text" rather than "text[]",
which allowed each type of statistics to be decomposed according to it's
own rules, and that worked fine when each statistic type became a parameter
in pgrestoreattribute_stats(). But now we've got all of those types, and
we're getting them multiple times, so that method doesn't quite scale.
I've considered several ways around this issue:
1. Defining a strict order for the statistics types, following the order
they appear in pgstatsext (nullfrac, avgwidth, n_distinct,
mostcommonelems, ...) and then exprs from pgstatsext_exprs in last
place. Each value is CAST-ed to "text",
which means that we can deconstruct them in a fashion very similar to how
we did for attribute stats. Those text values are put into an array in the
strict order, and those arrays are aggregated into a 2-D array.
Pros:
- This method is implemented and it works, and the code for it is reusing
tools and coding patterns we've already incorporated (deconstruct_array,
safe input functions, arglist arrays). Patch attached.
Cons:
- The ordering is completely opaque. Documenting that ordering might help a
little, but there's nothing intuitive about it and checking it has been an
irritant to author and reviewer alike.
- This opaque ordering must be kept in sync between
pgrestoreextendedstats and pgdump or else statistical garbage will
result.
2. Defining a record type specifically for purpose.
Pros:
- It could be decomposed via standard composite input function, and then
each type deconstructed on its own terms
Cons:
- It's type-clutter, and a type that is likely only used during upgrades.
- When new stats types are introduced, the structure would also have to
change, breaking typecasts of composite values from older versions. This
alone makes this option unpalatable to most reviewers, and I can't advocate
for it.
3. Keeping the 2-D text array in #1, but each "row" is a list of
kwargs-like pairs like the arguments used in pgrestoreattribute_stats
(i.e. ARRAY['nullfrac','0.5','avgwidth','1.0','mostcommonvalues',...]
Pros:
- Flexibility in ordering
- Clarity at a glance, provided that the reader has seen the kwargs
convention of the pgrestore*_stats functions.
- Still uses existing tooling like #1, and not that much more overhead.
- The name-value pairing problem has the same solution as the arg-pairing
that the function already does
Cons:
- adds overhead of storing the stat type names, and the key-value pairing
- the 2-D nature of the array requires that the number of elements be
fixed, so we couldn't leave out a stat type from one row unless we left it
out of the other one as well
- adds the possibility of duplicate names
4. JSON. The outer structure would be an array of objects, each object
would be a key-value.
Pros:
- Flexibility in ordering
- Clarity at a glance in a format well understood even without prior
knowledge of our kwargs convention
- we have already implemented similar things for the new formats of
pgndistinct and pgdependences.
- This method currently has the interest of Michael Paquier, the committer
of all the v19 work to date.
Cons:
- Requires implementing a state engine to parse the json, check for missing
values, resolve duplicates. We do that for pg_dependencies, and that takes
800-ish lines of code to handle 3 key names, this would have 10.
- the escaping of values in a composite record CASTed to text and then
further encoded as a JSON value would be extremely unreadable, and likely
quite bloated.
- using JSON for stats serialization met with immediate strong opposition
from several reviewers. That resistance may not be there for this vastly
reduced scope, especially in light of the new JSON-compatible formats for
pgndistinct and pgdependencies, but it does give me pause.
And...that's the major decision point. If we solve that, the rest is far
less controversial. My apologies that this summary itself needs a summary.
Thanks for reading. Eager to hear perspectives on the serialization methods
propsed, or suggestions of other methods.
[1] https://commitfest.postgresql.org/patch/5517/
[2]
https://www.postgresql.org/message-id/flat/aTE4AL7U0dp1Jjrx%40paquier.xyz#72116daf9d37828a47ce477a852a78d3
[3] https://commitfest.postgresql.org/patch/4538/
[4]
https://www.postgresql.org/message-id/flat/CADkLM%3DcB0rF3p_FuWRTMSV0983ihTRpsH%2BOCpNyiqE7Wk0vUWA%40mail.gmail.com
[5] https://commitfest.postgresql.org/patch/5523/
[6] The issue of custom statistic kinds like those found in PostGIS would
still remain.- Jump to comment-1Michael Paquier<michael@paquier.xyz>Jan 30, 2026, 5:56 AM UTCOn Fri, Jan 30, 2026 at 12:08:49AM -0500, Corey Huinker wrote:
3. Keeping the 2-D text array in #1, but each "row" is a list of
I'd still favor 4 on the ground that it's easier to edit and read,
kwargs-like pairs like the arguments used in pgrestoreattribute_stats
(i.e. ARRAY['nullfrac','0.5','avgwidth','1.0','mostcommonvalues',...]
4. JSON. The outer structure would be an array of objects, each object
would be a key-value.
which would more in line with the MCV, dependencies, ndistinct and
att/rel statistics. The format proposed in the attached patch is hard
to work with, anyway. Now, I do take your point about composite
record values casted into a single text value could be confusing
(double-quote issues, I guess?), so perhaps a text[] as in 3 would be
more adapted for readability. We could also force some checks based
the order of the arguments in the input array, so as duplicates are
not an issue, I guess?
Structurally, I feel that import_expressions() is overcomplicated, and
with the correct structure tracking the state of each field, I would
try to reduce a bit the duplication that the patch presents, aiming at
less callers of statattbuildstavalues() and statattsetslot(),
perhaps.
I have also posted a few more arguments here, for reference:
https://www.postgresql.org/message-id/aXvqN2fhDJZhL2RS@paquier.xyz
--
Michael- Jump to comment-1Corey Huinker<corey.huinker@gmail.com>Jan 30, 2026, 8:07 AM UTCOn Fri, Jan 30, 2026 at 12:55 AM Michael Paquier <michael@paquier.xyz>
wrote:On Fri, Jan 30, 2026 at 12:08:49AM -0500, Corey Huinker wrote:
3. Keeping the 2-D text array in #1, but each "row" is a list of
kwargs-like pairs like the arguments used in pgrestoreattribute_stats
(i.e. ARRAY['nullfrac','0.5','avgwidth','1.0','mostcommonvalues',...]
4. JSON. The outer structure would be an array of objects, each object
would be a key-value.
Hmm, maybe it isn't so bad:
I'd still favor 4 on the ground that it's easier to edit and read,
which would more in line with the MCV, dependencies, ndistinct and
att/rel statistics. The format proposed in the attached patch is hard
to work with, anyway. Now, I do take your point about composite
record values casted into a single text value could be confusing
(double-quote issues, I guess?), so perhaps a text[] as in 3 would be
more adapted for readability.
Mind you, this is an ANYARRAY first casted to text, if we cast theSELECT '{"{\"{1,1}\",\"{2,1}\",\"{3,-1}\",\"{NULL,0}\"}"}'::text[]; text --------------------------------------------------- {"{\"{1,1}\",\"{2,1}\",\"{3,-1}\",\"{NULL,0}\"}"} (1 row) SELECT array_to_json('{"{\"{1,1}\",\"{2,1}\",\"{3,-1}\",\"{NULL,0}\"}"}'::text[]); array_to_json --------------------------------------------------- ["{\"{1,1}\",\"{2,1}\",\"{3,-1}\",\"{NULL,0}\"}"] (1 row)
pgstatsextexprs.mostcommon_values directly to JSON then it'll drill
down into the innards of the composite values because it can see the local
datatypes, and that breaks our ability to use regular input functions. I
learned that the hard way when using JSON for serializing attribute stats
stuff when this effort first began.
Before seeing that, I wanted to try option 3 first, as it brings clarity
with no real increase in tooling other than looking for repeated keywords,
but if you're hyped for json then I'll try that first.We could also force some checks based
If we're doing a kwargs-thing then I may as well just track which keywords
the order of the arguments in the input array, so as duplicates are
not an issue, I guess?
were already used. We already bail out on the whole expressions array at
the first sign of inconsistency, so it's not like we have to decide which
of the duplicates to keep.
Structurally, I feel that import_expressions() is overcomplicated, andwith the correct structure tracking the state of each field, I would
I don't think we can get around those. It's a limitation of how the
try to reduce a bit the duplication that the patch presents, aiming at
less callers of statattbuildstavalues() and statattsetslot(),
perhaps.
sta(kindN/opN/collN/numbersN/valuesN) system in pg_statistic works. We want
to fill in each stakind as we find it, and we don't know how many of them
we've already filled out. An array of records would have been better, but
we've got 5 parallel arrays of scalars and we have to live with it.