Full Text Search in Postgresql

Full Text Search (FTS) provides the capabilites to search the natural language document with the search terms and return the results in the order of most relevant similarities documents.

Purpose of FTS is to look for a word or group of words in the collection of documents. FTS is more like a performing contains operation rather than doing exact term search. In PostgreSQL, document is a textual field in a row of the table. Field can be a made of single textual column or created by concatenating the multiple columns. To note, strings in jsonb column also supported.

Generally any text search engine should have a capability to parse the documents into tokens and converts the tokens into lexemes.

Lexemes

Tokens are raw fragments of the document text, while lexemes are words that are believed useful for indexing and searching. Parser normalises the similar words in the documents in the different forms and eliminates the stop words. This is why FTS doesn’t support searching on stop words.

tsvector (document)

tsvector is the data type used in PostgreSQL for storing the preprocessed document. Document must be converted to tsvector format. Searching and ranking are performed on tsvector format of the document. Original document will be used only during the select statement.

To convert a document into tsvector format, to_tsvector function can be used. First argument takes the text search config name, where we use english. Search engine uses this config to generates the lexemes. For each languages, lexemes will be generated differently. Below is the sample query that uses to_tsvector function. During text search, we use this in where clause.

select to_tsvector('pg_catalog.english', coalesce(field,''));

tsquery (search terms)

tsquery type is used for search terms, that also converts the search terms into lexemes where each term in the query will be combined with boolean operators & (AND), | (OR), and ! (NOT).

SELECT to_tsquery('english', 'The & Fat & Rats'); to_tsquery --------------- 'fat' & 'rat'

match operator

@@ - represents the match operator which returns true if tsvector matches tsquery. Position doesn't matter. Both below queries are same.

1 SELECT to_tsvector('fat cats ate fat rats') @@ to_tsquery('fat & rat'); 
SELECT to_tsquery('fat & rat') @@ to_tsvector('fat cats ate fat rats');

Performance improvement 

With the explanation so far, it is clear that to search a document, it needs to be parsed using tsvector with specific linguistic configuration. Performing this parsing during query execution will be slower. For each record, search column needs to be converted to to_tsvector.

SELECT * FROM employee WHERE to_tsvector('english', department) @@ to_tsquery('english', 'finance');

 In this case, to speed up the query performance, search column (department) needs to be indexed; However indexing the plain task_subject column will not be helpful since we are performing the query against tsvector representation of task_subject.

So in this case creating Inverted index will improve the performance. It can be created using GIN index in Postgres. Also Postgresql supports creating functional indexes.  Below is the syntax for creating GIN (functional) index. In this way, previous query should be able to utilize the index.

CREATE INDEX idx_tsve_department ON employee USING GIN (to_tsvector('english', department);

 

But, wait still we haven't avoided the converting task_subject into tsvector format while executing the query. Do we have any mechanism to avoid that? 

Yes, lets create a separate column that stores the converted value. And handle it in the API while storing the data or create a trigger that update the column.

It is the ideal way to have a tsvector type column designated for search and created an inverted index on that column for better performance.

Prefix search 

Prefix search is one of the widely used approach. Using * this can be achieved.  This needs to be attached with query search term with : (colon).

SELECT to_tsvector('finance.insurance') @@ to_tsquery('finance:*'); 

SELECT * from employee where to_tsvector('english', department) @@ to_tsquery('english', 'rep:*');

Ranking

Ranking is another important factor to consider when performing Full Text Search. E.g. Ranking helps to measure how relevant the query search results are. Based on the ranking, most relevant records can be shown at the top of the search results. 

Function ts_rank_cd can be used to generate the ranking based on the document and search term.

syntax: ts_rank_cd(vector tsvector, query tsquery) returns float4

SELECT ts_rank_cd(to_tsvector('english', department), to_tsquery('english', 'rep:*')) as subjectrank, * FROM employee WHERE to_tsvector('english', department) @@ to_tsquery('english', 'rep:*') ORDER BY subjectrank desc;

What if we want to perform search on multiple columns?

Since we will be creating a separate column of type tsvector, on each insert or update operation, lets concatenate all the relevant columns into tsvector and store in the specialised search column.

Below script will do all the respective operations for full text search.

  • Create a search column of type tsvector.

    ALTER TABLE employee ADD COLUMN tsv_department_search tsvector;
  • Create GIN index

    CREATE INDEX idx_tsv_departmenr_search ON employee USING GIN (tsv_department_search);
  • Create trigger function to update the search column. If we want to set different weightages for each column, we could set this in trigger function. setweight function can be used to set the weightage of each search term.

    CREATE FUNCTION department_update_trigger() RETURNS trigger AS $$ begin new.tsv_department_search := to_tsvector('pg_catalog.english', coalesce(new.department,'')) || to_tsvector('pg_catalog.english', coalesce(new.subdepartment,'{}')); return new; end $$ LANGUAGE plpgsql;
  • Create trigger based on above trigger function.

    CREATE TRIGGER tr_tsv_department_search BEFORE INSERT OR UPDATE ON employee FOR EACH ROW EXECUTE PROCEDURE department_update_trigger();

Once above scripts are executed, full text search can be performed using query like below.

SELECT * FROM employee WHERE tsv_department_search @@ to_tsquery('finance:* & insurance);

Note: Postgres supports searching in jsonb column where function to_tsvector or json(b)_to_tsvector reduce each string value in the document to a tsvector, and then concatenate those in document order to produce a single tsvector. However, behaviour on datetime is unpredictable. It is not a general practice to perform FTS on datetime. For more information on function refer https://www.postgresql.org/docs/11/functions-textsearch.html.

Comments