Misraj Technology | Full text search in postgres

Full text search is an advanced search technique that enhances query results, thereby improving the UX.

In this blog, I'll explain, using examples, how to implement full text search in Arabic, as I believe it's one of the most challenging languages for FTS, because it's a very complex language to normalize.

Pattern matching with LIKE/ILIKE

Pattern matching is not considered full text search (FTS), but it can be sufficient for simple projects where search is not a critical feature.

The difference between LIKE and ILIKE is case sensitivity: LIKE is case-sensitive, while ILIKE is case-insensitive. So searching with LIKE 'Apple%' will only match Apple or Apples, whereas ILIKE 'Apple%' will match all case variations like apples, APPLE, etc.

But Arabic does not have cases, so it has no effect.

However, it can be prolonged when you have a large dataset.

Triagram search

A trigram is a group of three consecutive characters taken from a string. We can measure the similarity of two strings by counting the number of trigrams they share. This simple idea turns out to be very effective for measuring the similarity of words in many natural languages - postgresql.org

can easily be enabled by running this SQL command:

CREATE EXTENSION pg_trgm; -- enable trigram search module

You can read more about the operators that pg_trgm provides on the official website

. Trigram search is a good technique for fuzzy search and handling users' typos, for example.

SELECT similarity('book', 'bok'); -- returns 0.5

Note

The matching threshold is by default 0.3, and you can modify it

SELECT similarity('مكتبة', 'مكتبي'); -- returns 0.5

Butt for Arabic text, as it's a complex language, sometimes pg_trgm does not give a good result, for example:

SELECT similarity('كتب', 'كتاب'); -- returns 0.2857143

This is considered not a match.

Also, when the word includes diacritics (tashkeel), it is treated as completely different characters:

SELECT similarity('تشكيل', 'تَشْكِيلٍ'); -- returns 0.06666667

Therefore, you should be cautious and attempt to normalize the text, at the very least by removing the diacritics.

I won’t go deep into the other trigram operators that pg_trgm provide; I'll probably write another blog post to explore trigram search in more detail.

That said, it's considered a good solution when you’re working with a small number of words, but it can be very slow on large texts.

Indexing

trigram search provides GIN and Gist indexes too, to make your queries way faster.

-- GIN index
CREATE INDEX trgm_idx ON table_name USING GIN (col_name gin_trgm_ops);
-- Gist index
CREATE INDEX trgm_idx ON table_name USING GIST (col_name gist_trgm_ops);

Final example of a bookstore search:

SELECT * FROM books WHERE title % 'صحيح ملم';

This should return the book

If you run the same query with LIKE/ILIKE It will not return anything, and this is the good thing about trigram search, catching typos's

Pg Trgm Demo

FTS: Basics

First of all, before understanding full text search (FTS), we must ask: why do we need it?

Full text search shines when we want to implement a "search by meaning" feature. For example, searching for "سيارة" should also return results that mention "عربة".

Full text search in PostgreSQL is based on the match operator @@, which returns true if a tsvector matches a tsquery.

Basic example of the FTS operators:

fts=# select to_tsvector('arabic', 'كيف يسد الخيال فجوات التاريخ؟');
                to_tsvector
--------------------------------------------
 'تاريخ':5 'جوا':4 'خيال':3 'كيف':1 'يسد':2

to_tsvecotr Function converts text into a tsvector type. A tsvector is a highly optimized, sorted list of distinct lexemes along with their positions in the original document.

Note

Normalization (Stemming/Lemmatization): It reduces tokens to their base or root forms, called lexemes. For example, "يجري," "يجرون," would all be reduced to the lexeme "جري." This allows a search for "جري" to match all its variations.

But if you noticed something, it normalized "فجوات" to "جوا", which is very, very wrong, and we will come to this later, as Arabic support in FTS is very poor.

A simpler example of why it's very poor:

fts=# select to_tsvector('arabic', 'يجري'), to_tsvector('arabic', 'يجرون');
-[ RECORD 1 ]----------
to_tsvector | 'يجر':1
to_tsvector | 'يجرون':1

to_tsquery Will return the same normalized words as tsvector to make the text consistent when performing the match

fts=# select plainto_tsquery('arabic', 'كيف يسد الخيال فجوات التاريخ؟');
             plainto_tsquery
------------------------------------------
 'كيف' & 'يسد' & 'خيال' & 'جوا' & 'تاريخ'

As you can see, it's still converting "فجوات" to "جوا".Example of the usage of both:

fts=# select to_tsvector('arabic', 'كيف يسد الخيال فجوات التاريخ؟') @@ to_tsquery('arabic', 'تاريخ');
 ?column?
----------
 t

t Stands for true, as the document and query matched.

Normalization is probably the most challenging part in FTS for the Arabic language, as you need a good dictionary if you want to have good results.

FTS: Dictionaries

Dictionaries are used to eliminate words that should not be considered in a search (stop words), and to normalize words so that different derived forms of the same word will match. A successfully normalized word is called a lexeme. - postgresql.org

There are 6 main dictionaries in PostgreSQL that we can use to lexmize a word, which are:

Stop Words Dictionary: Eliminates common stop words in the specified language (e.g., the, of, عن, على, فـ...).
Simple Dictionary: The simple dictionary template operates by converting the word to lower case and checking it against a file of stop words. It's the simplest dictionary that you can get started with.
Synonym Dictionary: Used to replace a word with a synonym, for example, we want the search for "عربة، عربية، كرهبة" to be lexmized to "سيارة"
Thesaurus Dictionary: Same as the synonym dictionary, but it supports phrases, so for example, "المدينة المنورة" wants it to be "المدينة"
Ispell Dictionary: The Ispell dictionary template supports morphological dictionaries, which can normalize many different linguistic forms of a word into the same lexeme, converting "أكلنا، أكلتم، يأكلون" to "أكل"
Snowball Dictionary A: A Snowball dictionary is a very basic stemmer. It maps different forms of the same word to a common "stem" (e.g, connections, connective, connected, and connecting to connect).

Important

the default to_tsvector('arabic', ...) actually uses the snowball dictionary to lexmize the word, and as you saw above, it converted "فجوات" to "جوا", which is wrong.

You can try the snowball stemming in different languagesonn on their official website.

FTS: Implementation

For implementing the Arabic full text search, we will use the hunspell format files, as ispell files do not exist for the Arabic language.

You can download them from here.

You just need the ar.dic and ar. aff

I'm running a PostgreSQL container using Docker, so if you want to follow me, you can run it.

docker cp ./ar.aff ftsdb:/usr/share/postgresql/16/tsearch_data/ar.affix
docker cp ./ar.dic ftsdb:/usr/share/postgresql/16/tsearch_data/ar.dict

Ispell expects the files to be .affix and .dict.

ftsdb It is the name of the container, so you are copying from your machine to the container

After that, we need to copy these files. toI'mm using Postgres V16, so make sure to run pg_config --sharedir to know where your dir if you are not sure which version.

If you try to ls usr/share/postgresql/16/tsearch_data You will find files and templates for different languages. You can copy some of them and build your own.

Example of a synonym dictionary file template

root@efc3952f8152:/# cat usr/share/postgresql/16/tsearch_data/synonym_sample.syn
postgres	pgsql
postgresql	pgsql
postgre	pgsql
gogle	googl
indices	index*

As I mentioned before, here searching for "postgres", "postgresql" or "postgre" will be lexmized to "pgsql", i.e, they are the same word.

Let's go back.

Now, after copying the files, we will create a custom dictionary to use those files

CREATE TEXT SEARCH DICTIONARY arabic_hunspell (
   TEMPLATE  = ispell,
   DictFile  = ar,
   AffFile   = ar,
   -- StopWords = ar
);
 
CREATE TEXT SEARCH CONFIGURATION public.arabic (
    COPY = pg_catalog.english
);
 
ALTER TEXT SEARCH CONFIGURATION arabic
    ALTER MAPPING
    FOR
        asciiword, asciihword, hword_asciipart, word, hword, hword_part
    WITH
        arabic_hunspell;

Note

I commented out the StopWords because we don't have a file for it, but as an exercise, you can copy a template from the tsearch_data folder and put Arabic stop words in it, reconfigure it to use your custom file.

What we did above is create a custom ispell dictionary, then create a basic settings copied from the English FTS settings, and finally overwrite that configurations to make it work for our custom dictionary.

Now, if we want to test the new configurations that use the custom files, we can run them.

fts=# SELECT ts_debug('arabic', 'فجوات');
-[ RECORD 1 ]-----------------------------------------------------------------------------------------------------------------------------
ts_debug | (word,"Word, all letters",فجوات,"{arabic_hunspell}",arabic_hunspell,"{جو,فجوة,جوة,جو,فجوة,جوة,جو,فجوة,جوة,جو,فجوة,جوة}")

As you can see, it's using our arabic_hunspell and the output is different for the word فجوات

Is it good? Nope, it's still not perfect, as we can see down below:

fts=# select to_tsquery('arabic', 'كتب');
-[ RECORD 1 ]---------------------------------------------------------
to_tsquery | 'كتب' | 'تب' | 'كتب' | 'تب' | 'تبي' | 'تب' | 'كتب' | 'تب'
 
fts=# select to_tsvector('arabic', 'يكتبون');
-[ RECORD 1 ]-----------
to_tsvector | 'يكتبون':1

يكتبون is not properly lexmized and therefore, the search might not give an accurate result.

Let's try to configure the synonym dictionary and see it in action

I created a file called ar.syn with this content

يثرب المدينة
طيبة المدينة

This is telling PostgreSQL to lexmize both words to "المدينة".

After copying the file to the tsearch As we did before with the hunspell dictionary, we create the synonym dictionary to use this file, and change the configuration to use the synonym dictionary.

CREATE TEXT SEARCH DICTIONARY arabic_syn (
    TEMPLATE = synonym,
    SYNONYMS = ar
);
 
ALTER TEXT SEARCH CONFIGURATION arabic
    ALTER MAPPING FOR
        asciiword, asciihword, hword_asciipart,
        word, hword, hword_part
    WITH arabic_syn, arabic_hunspell;

Postgres now is lexemizing the words to their synonym we defined in the file:

Syn Dictionary Example

And this is an example of the query results in action:

Syn in Action

FTS: Performance & Optimization

Full text search can become very slow when your data becomes too large, and therefore, there 2 things you should do:

Index: For each language, you need an index in case you want to support multiple languages

CREATE INDEX arabic_col_idx ON table_name USING GIN (to_tsvector('arabic', col_name));

You can use GIST index too

Saving the vector in the database: to save the time lexemzing the content

CREATE TABLE my_table (
  id serial PRIMARY KEY,
  content text,
  content_vector tsvector GENERATED ALWAYS AS (to_tsvector('arabic', content)) STORED
);

Now you can just use that column directly when searching.

SELECT * FROM my_table WHERE content_vector @@ to_tsquery('arabic', 'طيبة');

FTS: Ranking

Ranking attempts to measure how relevant documents are to a particular query, so that when there are many matches the most relevant ones can be shown first - postgres.org

There are 2 ranking functions in PostgreSQL

ts_rank(): ranks results based on the frequency of their matching lexemes
ts_rank_cd(): computes the "cover density" ranking

SELECT
   name,
   ts_rank_cd(to_tsvector('arabic', name), to_tsquery('arabic', 'text')) rank
FROM documents
WHERE to_tsvector('arabic', name) @@ to_tsquery('arabic', 'text')
ORDER BY rank DESC

Final review

Full text search is a powerful technique that enhances query results and the experience of your users, but as we saw some languages like arabic support is still very limited due to the nature of the language, but if you are planning to use it for other languages, you will get a super nice results especially when you mix it with the synonym dictionary.

Full text search in postgres

Pattern matching with LIKE/ILIKE

Triagram search

Indexing

FTS: Basics

FTS: Dictionaries

FTS: Implementation

FTS: Performance & Optimization

FTS: Ranking

Final review

Related posts

Bots Are Getting Smarter — So Are Our Defenses

Advanced Web Data Extraction for AI & ML: Techniques, Security, and Ethics

Navigating the Digital Landscape: How Cloud Computing is Powering Modern Businesses