pg_kazsearch: Full-text search extension for Kazakh language

  • Jump to comment-1
    Darkhan<darkhanahmetov2005@gmail.com>
    Apr 5, 2026, 1:32 PM UTC
    Hi all,
    I built pg_kazsearch, a PostgreSQL extension that adds full-text search
    support for Kazakh. Currently there's no Kazakh dictionary, stemmer, or
    stop word list available in PostgreSQL, so anyone searching Kazakh text is
    stuck with trigram matching or application-level workarounds.
    Kazakh is agglutinative — a single word can carry 5-6 suffixes, which makes
    standard search approaches miss most relevant results. pg_kazsearch
    provides a custom Kazakh stemmer (core written in Rust), a stop word list,
    and a text search dictionary that plugs into the standard PostgreSQL FTS
    infrastructure — GIN indexes, ts_rank, phrase search all work out of the
    box.
    I tested it on a dataset of 3,000 real Kazakh news articles. On the same
    query, pg_kazsearch returns 61 relevant articles vs 1 with trigram search,
    with a 23% improvement in recall overall.
    You can install it with a single command via deb package or Docker image,
    no compilation needed.
    Repo: https://github.com/darkhanakh/pg-kazsearch
    I'd appreciate any feedback, especially from anyone working on text search
    internals or with experience supporting non-Latin or agglutinative
    languages in PostgreSQL.
    Thanks, Darkhan
    • Jump to comment-1
      Adrien Nayrat<adrien.nayrat@anayrat.info>
      Apr 8, 2026, 2:43 PM UTC
      On 4/5/26 3:32 PM, Darkhan wrote:
      Hi all,
      I built pg_kazsearch, a PostgreSQL extension that adds full-text search
      support for Kazakh. Currently there's no Kazakh dictionary, stemmer, or
      stop word list available in PostgreSQL, so anyone searching Kazakh text is
      stuck with trigram matching or application-level workarounds.
      Kazakh is agglutinative — a single word can carry 5-6 suffixes, which makes
      standard search approaches miss most relevant results. pg_kazsearch
      provides a custom Kazakh stemmer (core written in Rust), a stop word list,
      and a text search dictionary that plugs into the standard PostgreSQL FTS
      infrastructure — GIN indexes, ts_rank, phrase search all work out of the
      box.
      I tested it on a dataset of 3,000 real Kazakh news articles. On the same
      query, pg_kazsearch returns 61 relevant articles vs 1 with trigram search,
      with a 23% improvement in recall overall.
      You can install it with a single command via deb package or Docker image,
      no compilation needed.
      Repo: https://github.com/darkhanakh/pg-kazsearch
      I'd appreciate any feedback, especially from anyone working on text search
      internals or with experience supporting non-Latin or agglutinative
      languages in PostgreSQL.
      Thanks, Darkhan
      Hello,
      Thanks for your work.
      I don't know anything about Kazakh.
      But have you try to add it to Snowball stemmer [1] ?
      As Postgres uses it, you have more chances to have Kazakh
      supported in future versions.
      1: https://github.com/snowballstem/snowball
      --
      Adrien NAYRAT
      https://pro.anayrat.info
      • Jump to comment-1
        Darkhan<darkhanahmetov2005@gmail.com>
        Apr 8, 2026, 2:55 PM UTC
        Thanks for the suggestion!
        I did look into Snowball early on. There is actually a Turkish stemmer in
        Snowball already and Turkish is structurally very similar to Kazakh (both
        agglutinative Turkic languages). But honestly the Turkish one is pretty
        lobotomized, it only handles nominal suffixes and doesn’t account for verb
        morphology at all. The author even mentions this in the comments. So it
        kind of works for basic noun cases but falls apart on real text.
        The reason I went with a standalone extension is that Kazakh has suffix
        chains where vowel harmony interacts with each layer and you need
        context-aware decisions, not just stripping patterns from the end of the
        word. My stemmer uses a penalty-scored BFS over possible suffix
        decompositions instead of the linear step-by-step stripping that Snowball
        does. With 5-6 suffixes stacked on one word you really need to evaluate
        multiple decomposition paths to find the best one.
        That said contributing a simplified Kazakh stemmer to Snowball is something
        I’d like to explore longer term. Even a basic version would be better than
        nothing which is what exists today. Would need to figure out how much of
        the BFS logic can fit into the Snowball language or if a simpler approach
        gets close enough.
        Appreciate the pointer!
        Darkhan
        On Wed, 8 Apr 2026 at 19:42 Adrien Nayrat <adrien.nayrat@anayrat.info>
        wrote:
        On 4/5/26 3:32 PM, Darkhan wrote:
        Hi all,

        I built pg_kazsearch, a PostgreSQL extension that adds full-text search
        support for Kazakh. Currently there's no Kazakh dictionary, stemmer, or
        stop word list available in PostgreSQL, so anyone searching Kazakh text
        is
        stuck with trigram matching or application-level workarounds.

        Kazakh is agglutinative — a single word can carry 5-6 suffixes, which
        makes
        standard search approaches miss most relevant results. pg_kazsearch
        provides a custom Kazakh stemmer (core written in Rust), a stop word
        list,
        and a text search dictionary that plugs into the standard PostgreSQL FTS
        infrastructure — GIN indexes, ts_rank, phrase search all work out of the
        box.

        I tested it on a dataset of 3,000 real Kazakh news articles. On the same
        query, pg_kazsearch returns 61 relevant articles vs 1 with trigram
        search,
        with a 23% improvement in recall overall.

        You can install it with a single command via deb package or Docker image,
        no compilation needed.

        Repo: https://github.com/darkhanakh/pg-kazsearch

        I'd appreciate any feedback, especially from anyone working on text
        search
        internals or with experience supporting non-Latin or agglutinative
        languages in PostgreSQL.

        Thanks, Darkhan

        Hello,

        Thanks for your work.
        I don't know anything about Kazakh.

        But have you try to add it to Snowball stemmer [1] ?
        As Postgres uses it, you have more chances to have Kazakh
        supported in future versions.


        1: https://github.com/snowballstem/snowball

        --
        Adrien NAYRAT
        https://pro.anayrat.info
        • Jump to comment-1
          Philip Johnston<philip@pgcache.com>
          Apr 10, 2026, 3:24 PM UTC
          Darkhan,
          Great work! As a former archaeologist your comment about Kazakh being
          agglutinative reminded me of ancient Sumerian which has a similar structure.
          You might find some interest among philologists and ancient near eastern
          historians for your work.
          Philip
          On Wed, Apr 8, 2026 at 9:56 AM Darkhan <darkhanahmetov2005@gmail.com> wrote:
          Thanks for the suggestion!

          I did look into Snowball early on. There is actually a Turkish stemmer in
          Snowball already and Turkish is structurally very similar to Kazakh (both
          agglutinative Turkic languages). But honestly the Turkish one is pretty
          lobotomized, it only handles nominal suffixes and doesn’t account for verb
          morphology at all. The author even mentions this in the comments. So it
          kind of works for basic noun cases but falls apart on real text.

          The reason I went with a standalone extension is that Kazakh has suffix
          chains where vowel harmony interacts with each layer and you need
          context-aware decisions, not just stripping patterns from the end of the
          word. My stemmer uses a penalty-scored BFS over possible suffix
          decompositions instead of the linear step-by-step stripping that Snowball
          does. With 5-6 suffixes stacked on one word you really need to evaluate
          multiple decomposition paths to find the best one.

          That said contributing a simplified Kazakh stemmer to Snowball is
          something I’d like to explore longer term. Even a basic version would be
          better than nothing which is what exists today. Would need to figure out
          how much of the BFS logic can fit into the Snowball language or if a
          simpler approach gets close enough.

          Appreciate the pointer!

          Darkhan
          On Wed, 8 Apr 2026 at 19:42 Adrien Nayrat <adrien.nayrat@anayrat.info>
          wrote:
          On 4/5/26 3:32 PM, Darkhan wrote:
          Hi all,

          I built pg_kazsearch, a PostgreSQL extension that adds full-text search
          support for Kazakh. Currently there's no Kazakh dictionary, stemmer, or
          stop word list available in PostgreSQL, so anyone searching Kazakh text
          is
          stuck with trigram matching or application-level workarounds.

          Kazakh is agglutinative — a single word can carry 5-6 suffixes, which
          makes
          standard search approaches miss most relevant results. pg_kazsearch
          provides a custom Kazakh stemmer (core written in Rust), a stop word
          list,
          and a text search dictionary that plugs into the standard PostgreSQL FTS
          infrastructure — GIN indexes, ts_rank, phrase search all work out of the
          box.

          I tested it on a dataset of 3,000 real Kazakh news articles. On the same
          query, pg_kazsearch returns 61 relevant articles vs 1 with trigram
          search,
          with a 23% improvement in recall overall.

          You can install it with a single command via deb package or Docker
          image,
          no compilation needed.

          Repo: https://github.com/darkhanakh/pg-kazsearch

          I'd appreciate any feedback, especially from anyone working on text
          search
          internals or with experience supporting non-Latin or agglutinative
          languages in PostgreSQL.

          Thanks, Darkhan

          Hello,

          Thanks for your work.
          I don't know anything about Kazakh.

          But have you try to add it to Snowball stemmer [1] ?
          As Postgres uses it, you have more chances to have Kazakh
          supported in future versions.


          1: https://github.com/snowballstem/snowball

          --
          Adrien NAYRAT
          https://pro.anayrat.info