pgsql-general
❮
pg_kazsearch: Full-text search extension for Kazakh language
- Jump to comment-1Darkhan<darkhanahmetov2005@gmail.com>Apr 5, 2026, 1:32 PM UTCHi all,
I built pg_kazsearch, a PostgreSQL extension that adds full-text search
support for Kazakh. Currently there's no Kazakh dictionary, stemmer, or
stop word list available in PostgreSQL, so anyone searching Kazakh text is
stuck with trigram matching or application-level workarounds.
Kazakh is agglutinative — a single word can carry 5-6 suffixes, which makes
standard search approaches miss most relevant results. pg_kazsearch
provides a custom Kazakh stemmer (core written in Rust), a stop word list,
and a text search dictionary that plugs into the standard PostgreSQL FTS
infrastructure — GIN indexes, ts_rank, phrase search all work out of the
box.
I tested it on a dataset of 3,000 real Kazakh news articles. On the same
query, pg_kazsearch returns 61 relevant articles vs 1 with trigram search,
with a 23% improvement in recall overall.
You can install it with a single command via deb package or Docker image,
no compilation needed.
Repo: https://github.com/darkhanakh/pg-kazsearch
I'd appreciate any feedback, especially from anyone working on text search
internals or with experience supporting non-Latin or agglutinative
languages in PostgreSQL.
Thanks, Darkhan- Jump to comment-1Adrien Nayrat<adrien.nayrat@anayrat.info>Apr 8, 2026, 2:43 PM UTCOn 4/5/26 3:32 PM, Darkhan wrote:
Hi all,
I built pg_kazsearch, a PostgreSQL extension that adds full-text search
support for Kazakh. Currently there's no Kazakh dictionary, stemmer, or
stop word list available in PostgreSQL, so anyone searching Kazakh text is
stuck with trigram matching or application-level workarounds.Kazakh is agglutinative — a single word can carry 5-6 suffixes, which makes
standard search approaches miss most relevant results. pg_kazsearch
provides a custom Kazakh stemmer (core written in Rust), a stop word list,
and a text search dictionary that plugs into the standard PostgreSQL FTS
infrastructure — GIN indexes, ts_rank, phrase search all work out of the
box.I tested it on a dataset of 3,000 real Kazakh news articles. On the same
query, pg_kazsearch returns 61 relevant articles vs 1 with trigram search,
with a 23% improvement in recall overall.You can install it with a single command via deb package or Docker image,
no compilation needed.
Repo: https://github.com/darkhanakh/pg-kazsearch
I'd appreciate any feedback, especially from anyone working on text searchinternals or with experience supporting non-Latin or agglutinative
languages in PostgreSQL.Thanks, Darkhan
Hello,
Thanks for your work.
I don't know anything about Kazakh.
But have you try to add it to Snowball stemmer [1] ?
As Postgres uses it, you have more chances to have Kazakh
supported in future versions.
1: https://github.com/snowballstem/snowball
--
Adrien NAYRAT
https://pro.anayrat.info- Jump to comment-1Darkhan<darkhanahmetov2005@gmail.com>Apr 8, 2026, 2:55 PM UTCThanks for the suggestion!
I did look into Snowball early on. There is actually a Turkish stemmer in
Snowball already and Turkish is structurally very similar to Kazakh (both
agglutinative Turkic languages). But honestly the Turkish one is pretty
lobotomized, it only handles nominal suffixes and doesn’t account for verb
morphology at all. The author even mentions this in the comments. So it
kind of works for basic noun cases but falls apart on real text.
The reason I went with a standalone extension is that Kazakh has suffix
chains where vowel harmony interacts with each layer and you need
context-aware decisions, not just stripping patterns from the end of the
word. My stemmer uses a penalty-scored BFS over possible suffix
decompositions instead of the linear step-by-step stripping that Snowball
does. With 5-6 suffixes stacked on one word you really need to evaluate
multiple decomposition paths to find the best one.
That said contributing a simplified Kazakh stemmer to Snowball is something
I’d like to explore longer term. Even a basic version would be better than
nothing which is what exists today. Would need to figure out how much of
the BFS logic can fit into the Snowball language or if a simpler approach
gets close enough.
Appreciate the pointer!
Darkhan
On Wed, 8 Apr 2026 at 19:42 Adrien Nayrat <adrien.nayrat@anayrat.info>
wrote:On 4/5/26 3:32 PM, Darkhan wrote:
Hi all,
I built pg_kazsearch, a PostgreSQL extension that adds full-text search
support for Kazakh. Currently there's no Kazakh dictionary, stemmer, or
stop word list available in PostgreSQL, so anyone searching Kazakh textis
stuck with trigram matching or application-level workarounds.
Kazakh is agglutinative — a single word can carry 5-6 suffixes, whichmakes
standard search approaches miss most relevant results. pg_kazsearch
provides a custom Kazakh stemmer (core written in Rust), a stop wordlist,
and a text search dictionary that plugs into the standard PostgreSQL FTS
infrastructure — GIN indexes, ts_rank, phrase search all work out of the
box.
I tested it on a dataset of 3,000 real Kazakh news articles. On the same
query, pg_kazsearch returns 61 relevant articles vs 1 with trigramsearch,
with a 23% improvement in recall overall.
You can install it with a single command via deb package or Docker image,
no compilation needed.
Repo: https://github.com/darkhanakh/pg-kazsearch
I'd appreciate any feedback, especially from anyone working on textsearch
internals or with experience supporting non-Latin or agglutinative
languages in PostgreSQL.
Thanks, Darkhan
Hello,
Thanks for your work.
I don't know anything about Kazakh.
But have you try to add it to Snowball stemmer [1] ?
As Postgres uses it, you have more chances to have Kazakh
supported in future versions.
1: https://github.com/snowballstem/snowball
--
Adrien NAYRAT
https://pro.anayrat.info- Jump to comment-1Philip Johnston<philip@pgcache.com>Apr 10, 2026, 3:24 PM UTCDarkhan,
Great work! As a former archaeologist your comment about Kazakh being
agglutinative reminded me of ancient Sumerian which has a similar structure.
You might find some interest among philologists and ancient near eastern
historians for your work.
Philip
On Wed, Apr 8, 2026 at 9:56 AM Darkhan <darkhanahmetov2005@gmail.com> wrote:Thanks for the suggestion!
I did look into Snowball early on. There is actually a Turkish stemmer in
Snowball already and Turkish is structurally very similar to Kazakh (both
agglutinative Turkic languages). But honestly the Turkish one is pretty
lobotomized, it only handles nominal suffixes and doesn’t account for verb
morphology at all. The author even mentions this in the comments. So it
kind of works for basic noun cases but falls apart on real text.
The reason I went with a standalone extension is that Kazakh has suffix
chains where vowel harmony interacts with each layer and you need
context-aware decisions, not just stripping patterns from the end of the
word. My stemmer uses a penalty-scored BFS over possible suffix
decompositions instead of the linear step-by-step stripping that Snowball
does. With 5-6 suffixes stacked on one word you really need to evaluate
multiple decomposition paths to find the best one.
That said contributing a simplified Kazakh stemmer to Snowball is
something I’d like to explore longer term. Even a basic version would be
better than nothing which is what exists today. Would need to figure out
how much of the BFS logic can fit into the Snowball language or if a
simpler approach gets close enough.
Appreciate the pointer!
DarkhanOn Wed, 8 Apr 2026 at 19:42 Adrien Nayrat <adrien.nayrat@anayrat.info>
wrote:
On 4/5/26 3:32 PM, Darkhan wrote:
Hi all,
I built pg_kazsearch, a PostgreSQL extension that adds full-text search
support for Kazakh. Currently there's no Kazakh dictionary, stemmer, or
stop word list available in PostgreSQL, so anyone searching Kazakh textis
stuck with trigram matching or application-level workarounds.
Kazakh is agglutinative — a single word can carry 5-6 suffixes, whichmakes
standard search approaches miss most relevant results. pg_kazsearch
provides a custom Kazakh stemmer (core written in Rust), a stop wordlist,
and a text search dictionary that plugs into the standard PostgreSQL FTS
infrastructure — GIN indexes, ts_rank, phrase search all work out of the
box.
I tested it on a dataset of 3,000 real Kazakh news articles. On the same
query, pg_kazsearch returns 61 relevant articles vs 1 with trigramsearch,
with a 23% improvement in recall overall.
You can install it with a single command via deb package or Dockerimage,
no compilation needed.
Repo: https://github.com/darkhanakh/pg-kazsearch
I'd appreciate any feedback, especially from anyone working on textsearch
internals or with experience supporting non-Latin or agglutinative
languages in PostgreSQL.
Thanks, Darkhan
Hello,
Thanks for your work.
I don't know anything about Kazakh.
But have you try to add it to Snowball stemmer [1] ?
As Postgres uses it, you have more chances to have Kazakh
supported in future versions.
1: https://github.com/snowballstem/snowball
--
Adrien NAYRAT
https://pro.anayrat.info