Postgres Email Lists

pgsql-hackers

❮

Wrong results from inner-unique joins caused by collation mismatch

Jump to comment-1
Richard Guo<guofenglinux@gmail.com>
Apr 24, 2026, 11:42 AM UTC
While working on the wrong results issue caused by collation mismatch
in GROUP BY and havingQual [1], I noticed $subject:
create collation ci (provider = icu, locale = 'und-u-ks-level2',
deterministic = false);
create table t (a text);
insert into t values ('A'), ('a');
create unique index on t (a);
-- wrong results: should be 4 rows
select * from t t1 join t t2 on t1.a = t2.a collate ci;
a a
A A
a a
(2 rows)
The root cause is explained by the XXX comment in
relationhasuniqueindexfor():
```
/*
* XXX at some point we may need to check collations here too.
* For the moment we assume all collations reduce to the same
* notion of equality.
*/
```
That assumption stopped being safe when nondeterministic collations
were introduced in PG 12. A unique index enforces uniqueness under
its own collation; if a query's equality clause uses a different
collation, and either side is nondeterministic, the index's uniqueness
does not imply uniqueness under the clause.
Several planner optimizations use this uniqueness proof, and all of
them can yield wrong results in this scenario. These include
inner-unique join execution, left-join removal, semijoin-to-innerjoin
reduction, and self-join elimination.
My first thought was to fix this by:
+ if (!IndexCollMatchesExprColl(ind->indexcollations[c],
+ exprInputCollation((Node *) rinfo->clause)))
+ continue;
However, this caused an unexpected plan diff in join.out where a
left-join removal over (name, text) stopped working, because name and
text use different collations. So this check is too strict: a
mismatch between two deterministic collations should be OK for
uniqueness proof, as a deterministic collation treats two strings as
equal iff they are byte-wise equal (see CREATE COLLATION).
Hence, I got attached patch. Thoughts?
[1] https://postgr.es/m/CAMbWs48Dn2wW6XM94GZsoyMiH42=KgMo+WcobPKuWvGYnWaPOQ@mail.gmail.com
- Richard
View in PostgreSQL Archives →
- Jump to comment-1
  Tom Lane<tgl@sss.pgh.pa.us>
  Apr 24, 2026, 2:53 PM UTC
  Richard Guo <guofenglinux@gmail.com> writes:
  My first thought was to fix this by:
  + if (!IndexCollMatchesExprColl(ind->indexcollations[c],
  + exprInputCollation((Node *) rinfo->clause)))
  + continue;
  However, this caused an unexpected plan diff in join.out where a
  left-join removal over (name, text) stopped working, because name and
  text use different collations. So this check is too strict: a
  mismatch between two deterministic collations should be OK for
  uniqueness proof, as a deterministic collation treats two strings as
  equal iff they are byte-wise equal (see CREATE COLLATION).
  Yes, we'd be taking a serious performance hit if we insisted on
  exact collation matches for this purpose. I agree that disallowing
  non-matching non-deterministic collations is the right fix.
  Hence, I got attached patch. Thoughts?
  I don't love doing it like this, for two reasons:
  1. I think there are other places in the planner that will need
  substantially this same logic. I recommend breaking out a
  subroutine defined more or less as "do these collations have
  equivalent notions of equality".
  2. I find the test next to unreadable as written --- for example,
  it's more difficult than it should be to figure out what happens
  if one collation is deterministic and the other not. Using a
  subroutine would help here by letting you break down the test
  into multiple steps.
  regards, tom lane
  View in PostgreSQL Archives →
  Jump to comment-1
  Richard Guo<guofenglinux@gmail.com>
  Apr 24, 2026, 3:44 PM UTC
  On Fri, Apr 24, 2026 at 11:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
  Richard Guo <guofenglinux@gmail.com> writes:
  My first thought was to fix this by:
  + if (!IndexCollMatchesExprColl(ind->indexcollations[c],
  + exprInputCollation((Node *) rinfo->clause)))
  + continue;
  However, this caused an unexpected plan diff in join.out where a
  left-join removal over (name, text) stopped working, because name and
  text use different collations. So this check is too strict: a
  mismatch between two deterministic collations should be OK for
  uniqueness proof, as a deterministic collation treats two strings as
  equal iff they are byte-wise equal (see CREATE COLLATION).
  Yes, we'd be taking a serious performance hit if we insisted on
  exact collation matches for this purpose. I agree that disallowing
  non-matching non-deterministic collations is the right fix.
  Thanks for taking a look!
  Hence, I got attached patch. Thoughts?
  I don't love doing it like this, for two reasons:
  
  1. I think there are other places in the planner that will need
  substantially this same logic. I recommend breaking out a
  subroutine defined more or less as "do these collations have
  equivalent notions of equality".
  Right. I just found several other places that need this same logic,
  which are in queryisdistinct_for(). Without it, we produce wrong
  results:
  select * from t t1 join
  (select distinct a from t) t2 on t1.a = t2.a COLLATE "ci";
  a a
  A a
  a a
  (2 rows)
  select * from t t1 join
  (select a from t group by a) t2 on t1.a = t2.a COLLATE "ci";
  a a
  A a
  a a
  (2 rows)
  2. I find the test next to unreadable as written --- for example,
  it's more difficult than it should be to figure out what happens
  if one collation is deterministic and the other not. Using a
  subroutine would help here by letting you break down the test
  into multiple steps.
  Agreed. Will wrap the logic in a subroutine.
  - Richard
  View in PostgreSQL Archives →
  Jump to comment-1
  Richard Guo<guofenglinux@gmail.com>
  Apr 25, 2026, 9:24 AM UTC
  On Sat, Apr 25, 2026 at 12:44 AM Richard Guo <guofenglinux@gmail.com> wrote:
  On Fri, Apr 24, 2026 at 11:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
  1. I think there are other places in the planner that will need
  substantially this same logic. I recommend breaking out a
  subroutine defined more or less as "do these collations have
  equivalent notions of equality".
  Right. I just found several other places that need this same logic,
  which are in queryisdistinct_for(). Without it, we produce wrong
  results:
  0001 wrapped the logic in subroutine collationsarecompatible().
  (I'm a little unsure about the InvalidOid cases. The current
  implementation treats InvalidOid on either side as compatible, as
  absence of a collation can't conflict with the other side. This
  generalizes the asymmetric treatment in IndexCollMatchesExprColl().)
  0002 fixed queryisdistinct_for(), using that subroutine.
  - Richard
  View in PostgreSQL Archives →