pgsql-hackers
❮
Wrong results from inner-unique joins caused by collation mismatch
- Jump to comment-1Richard Guo<guofenglinux@gmail.com>Apr 24, 2026, 11:42 AM UTCWhile working on the wrong results issue caused by collation mismatch
in GROUP BY and havingQual [1], I noticed $subject:
create collation ci (provider = icu, locale = 'und-u-ks-level2',
deterministic = false);
create table t (a text);
insert into t values ('A'), ('a');
create unique index on t (a);
-- wrong results: should be 4 rows
select * from t t1 join t t2 on t1.a = t2.a collate ci;(2 rows)a a A A a a
The root cause is explained by the XXX comment in
relationhasuniqueindexfor():
That assumption stopped being safe when nondeterministic collations/* * XXX at some point we may need to check collations here too. * For the moment we assume all collations reduce to the same * notion of equality. */
were introduced in PG 12. A unique index enforces uniqueness under
its own collation; if a query's equality clause uses a different
collation, and either side is nondeterministic, the index's uniqueness
does not imply uniqueness under the clause.
Several planner optimizations use this uniqueness proof, and all of
them can yield wrong results in this scenario. These include
inner-unique join execution, left-join removal, semijoin-to-innerjoin
reduction, and self-join elimination.
My first thought was to fix this by:
+ if (!IndexCollMatchesExprColl(ind->indexcollations[c],
+ exprInputCollation((Node *) rinfo->clause)))
+ continue;
However, this caused an unexpected plan diff in join.out where a
left-join removal over (name, text) stopped working, because name and
text use different collations. So this check is too strict: a
mismatch between two deterministic collations should be OK for
uniqueness proof, as a deterministic collation treats two strings as
equal iff they are byte-wise equal (see CREATE COLLATION).
Hence, I got attached patch. Thoughts?
[1] https://postgr.es/m/CAMbWs48Dn2wW6XM94GZsoyMiH42=KgMo+WcobPKuWvGYnWaPOQ@mail.gmail.com
- Richard- Jump to comment-1Tom Lane<tgl@sss.pgh.pa.us>Apr 24, 2026, 2:53 PM UTCRichard Guo <guofenglinux@gmail.com> writes:
My first thought was to fix this by:
+ if (!IndexCollMatchesExprColl(ind->indexcollations[c],
+ exprInputCollation((Node *) rinfo->clause)))+ continue;
However, this caused an unexpected plan diff in join.out where a
Yes, we'd be taking a serious performance hit if we insisted on
left-join removal over (name, text) stopped working, because name and
text use different collations. So this check is too strict: a
mismatch between two deterministic collations should be OK for
uniqueness proof, as a deterministic collation treats two strings as
equal iff they are byte-wise equal (see CREATE COLLATION).
exact collation matches for this purpose. I agree that disallowing
non-matching non-deterministic collations is the right fix.Hence, I got attached patch. Thoughts?
I don't love doing it like this, for two reasons:
1. I think there are other places in the planner that will need
substantially this same logic. I recommend breaking out a
subroutine defined more or less as "do these collations have
equivalent notions of equality".
2. I find the test next to unreadable as written --- for example,
it's more difficult than it should be to figure out what happens
if one collation is deterministic and the other not. Using a
subroutine would help here by letting you break down the test
into multiple steps.regards, tom lane- Jump to comment-1Richard Guo<guofenglinux@gmail.com>Apr 24, 2026, 3:44 PM UTCOn Fri, Apr 24, 2026 at 11:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Richard Guo <guofenglinux@gmail.com> writes:
My first thought was to fix this by:+ if (!IndexCollMatchesExprColl(ind->indexcollations[c],
+ exprInputCollation((Node *) rinfo->clause)))+ continue;
However, this caused an unexpected plan diff in join.out where a
left-join removal over (name, text) stopped working, because name and
text use different collations. So this check is too strict: a
mismatch between two deterministic collations should be OK for
uniqueness proof, as a deterministic collation treats two strings as
equal iff they are byte-wise equal (see CREATE COLLATION).Yes, we'd be taking a serious performance hit if we insisted on
Thanks for taking a look!
exact collation matches for this purpose. I agree that disallowing
non-matching non-deterministic collations is the right fix.Hence, I got attached patch. Thoughts?
I don't love doing it like this, for two reasons:
Right. I just found several other places that need this same logic,
1. I think there are other places in the planner that will need
substantially this same logic. I recommend breaking out a
subroutine defined more or less as "do these collations have
equivalent notions of equality".
which are in queryisdistinct_for(). Without it, we produce wrong
results:
select * from t t1 join(select distinct a from t) t2 on t1.a = t2.a COLLATE "ci";(2 rows)a a A a a a
select * from t t1 join(select a from t group by a) t2 on t1.a = t2.a COLLATE "ci";(2 rows)a a A a a a 2. I find the test next to unreadable as written --- for example,
Agreed. Will wrap the logic in a subroutine.
it's more difficult than it should be to figure out what happens
if one collation is deterministic and the other not. Using a
subroutine would help here by letting you break down the test
into multiple steps.
- Richard- Jump to comment-1Richard Guo<guofenglinux@gmail.com>Apr 25, 2026, 9:24 AM UTCOn Sat, Apr 25, 2026 at 12:44 AM Richard Guo <guofenglinux@gmail.com> wrote:
On Fri, Apr 24, 2026 at 11:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
1. I think there are other places in the planner that will need
substantially this same logic. I recommend breaking out a
subroutine defined more or less as "do these collations have
equivalent notions of equality".Right. I just found several other places that need this same logic,
0001 wrapped the logic in subroutine collationsarecompatible().
which are in queryisdistinct_for(). Without it, we produce wrong
results:
(I'm a little unsure about the InvalidOid cases. The current
implementation treats InvalidOid on either side as compatible, as
absence of a collation can't conflict with the other side. This
generalizes the asymmetric treatment in IndexCollMatchesExprColl().)
0002 fixed queryisdistinct_for(), using that subroutine.
- Richard