Version:

This documentation is sourced from a third-party project and is not maintained by pgEdge.

Text Analyzer

text analyzer has 3 main components, character filters, pre tokenizer, token filters.

character filter: It is used to filter out some characters before tokenization. e.g. to_lowercase, unicode_normalization.
pre tokenizer: It is used to split the text into tokens. e.g. unicode segmentation will split texts according to the Unicode Standard Annex #29
token filter: It is used to filter out some tokens after tokenization. e.g. stopwords, stemmer.

Character Filter

We support following character filters:

to_lowercase: Convert all characters to lowercase.
unicode_normalization: Normalize the text according to the Unicode Normalization Forms (NFC, NFD, NFKC, NFKD).

Pre Tokenizer

We support following pre tokenizers:

regex: Generate tokens by matching the regular expression.
unicode_segmentation: Split the text into tokens according to the Unicode Standard Annex #29.
jieba: Chinese text segmentation using the Jieba library.

Token Filter

We support following token filters:

skip_non_alphanumeric: Skip tokens where all characters are not alphanumeric.
stemmer: Stem tokens using the Snowball stemmer algorithm.
stopwords: Filter out tokens that are in the stop words list.
synonym: Replace tokens with their synonyms.
pg_dict: Process tokens using the PostgreSQL dictionary. You can integrate this with the PostgreSQL dictionary or other extensions that provide dictionaries.

Supported algorithms for `stemmer`

arabic, armenian, basque, catalan, danish, dutch, english_porter, english_porter2, estonian, finnish, french, german, greek, hindi, hungarian, indonesian, irish, italian, lithuanian, nepali, norwegian, portuguese, romanian, russian, serbian, spanish, swedish, tamil, turkish, yiddish

Customize dictionary for `stopwords` and `synonym`

We support customize stopwords and synonym by providing a dictionary.

-- Create a dictionary for stopwords, each line is a stopword.
SELECT create_stopwords('stop1', $$
it
is
an
$$);

SELECT tokenizer_catalog.create_text_analyzer('test_stopwords', $$
pre_tokenizer = "unicode_segmentation"
[[character_filters]]
to_lowercase = {}
[[token_filters]]
stopwords = "stop1"
$$);

SELECT tokenizer_catalog.apply_text_analyzer('It is an apple.', 'test_stopwords');
----
{apple}

-- Create a dictionary for synonyms, each line is a synonym.
SELECT create_synonym('syn1', $$
pgsql postgres postgresql
index indices
$$);

SELECT tokenizer_catalog.create_text_analyzer('test_synonym', $$
pre_tokenizer = "unicode_segmentation"
[[token_filters]]
synonym = "syn1"
$$);

SELECT tokenizer_catalog.apply_text_analyzer('postgresql indices', 'test_synonym');
----
{pgsql,index}