Skip to content
This documentation is sourced from a third-party project and is not maintained by pgEdge.

Text Analyzer

text analyzer has 3 main components, character filters, pre tokenizer, token filters.

  • character filter: It is used to filter out some characters before tokenization. e.g. to_lowercase, unicode_normalization.
  • pre tokenizer: It is used to split the text into tokens. e.g. unicode segmentation will split texts according to the Unicode Standard Annex #29
  • token filter: It is used to filter out some tokens after tokenization. e.g. stopwords, stemmer.

Character Filter

We support following character filters:

  • to_lowercase: Convert all characters to lowercase.
  • unicode_normalization: Normalize the text according to the Unicode Normalization Forms (NFC, NFD, NFKC, NFKD).

Pre Tokenizer

We support following pre tokenizers:

  • regex: Generate tokens by matching the regular expression.
  • unicode_segmentation: Split the text into tokens according to the Unicode Standard Annex #29.
  • jieba: Chinese text segmentation using the Jieba library.

Token Filter

We support following token filters:

  • skip_non_alphanumeric: Skip tokens where all characters are not alphanumeric.
  • stemmer: Stem tokens using the Snowball stemmer algorithm.
  • stopwords: Filter out tokens that are in the stop words list.
  • synonym: Replace tokens with their synonyms.
  • pg_dict: Process tokens using the PostgreSQL dictionary. You can integrate this with the PostgreSQL dictionary or other extensions that provide dictionaries.

Supported algorithms for stemmer

arabic, armenian, basque, catalan, danish, dutch, english_porter, english_porter2, estonian, finnish, french, german, greek, hindi, hungarian, indonesian, irish, italian, lithuanian, nepali, norwegian, portuguese, romanian, russian, serbian, spanish, swedish, tamil, turkish, yiddish

Customize dictionary for stopwords and synonym

We support customize stopwords and synonym by providing a dictionary.

-- Create a dictionary for stopwords, each line is a stopword.
SELECT create_stopwords('stop1', $$
it
is
an
$$);

SELECT tokenizer_catalog.create_text_analyzer('test_stopwords', $$
pre_tokenizer = "unicode_segmentation"
[[character_filters]]
to_lowercase = {}
[[token_filters]]
stopwords = "stop1"
$$);

SELECT tokenizer_catalog.apply_text_analyzer('It is an apple.', 'test_stopwords');
----
{apple}
-- Create a dictionary for synonyms, each line is a synonym.
SELECT create_synonym('syn1', $$
pgsql postgres postgresql
index indices
$$);

SELECT tokenizer_catalog.create_text_analyzer('test_synonym', $$
pre_tokenizer = "unicode_segmentation"
[[token_filters]]
synonym = "syn1"
$$);

SELECT tokenizer_catalog.apply_text_analyzer('postgresql indices', 'test_synonym');
----
{pgsql,index}