Build multi-tenant RAG with Neon's database-per-user model — no nosy neighbors, max isolation, minimal costs

Postgres tsvector data type

Optimize full-text search in Postgres with the tsvector data type

tsvector is a specialized Postgres data type designed for full-text search operations. It represents a document in a form optimized for text search, where each word is reduced to its root form (lexeme) and stored with information about its position and importance.

In Postgres, the tsvector data type is useful for implementing efficient full-text search capabilities, allowing for fast and flexible searching across large volumes of text data.

Try it on Neon!

Neon is Serverless Postgres built for the cloud. Explore Postgres features and functions in our user-friendly SQL editor. Sign up for a free account to get started.

Sign Up

Storage and syntax

A tsvector value is a sorted list of distinct lexemes, which are words that have been normalized to merge different variants of the same word. Each lexeme can be followed by position(s) and/or weight(s).

The general syntax for a tsvector is:

'word1':1,3 'word2':2A ...

Where:

  • word1, word2, etc., are the lexemes
  • 1, 3, etc. are integers indicating the position of the word in the document
  • positions can sometimes be followed by a letter to indicate a weight ('A', 'B', 'C' or 'D'), like 2A. The default weight is 'D'.

For example:

  • 'a':1A 'cat':2 'sat':3 'on':4 'the':5 'mat':6

When a document is cast to tsvector, it doesn't perform any normalization and just splits the text into lexemes. To normalize the text, you can use the to_tsvector function with a specific text search configuration. For example:

SELECT
    'The quick brown fox jumps over the lazy dog.'::tsvector as colA,
    to_tsvector('english', 'The quick brown fox jumps over the lazy dog.') as colB;

This query produces the following output. The function to_tsvector() tokenizes the input document and computes the normalized lexemes based on the specified text search configuration (in this case, 'english'). The output is a tsvector with the normalized lexemes and their positions.

cola                              |                         colb
----------------------------------------------------------------+-------------------------------------------------------
 'The' 'brown' 'dog.' 'fox' 'jumps' 'lazy' 'over' 'quick' 'the' | 'brown':3 'dog':9 'fox':4 'jump':5 'lazi':8 'quick':2
(1 row)

Example usage

Consider a scenario where we're building a blog platform and want to implement full-text search for articles. We'll use tsvector to store the searchable content of each article.

The query below creates a table and inserts some sample blog data:

CREATE TABLE blog_posts (
    id SERIAL PRIMARY KEY,
    title TEXT NOT NULL,
    content TEXT NOT NULL,
    search_vector tsvector
);

INSERT INTO blog_posts (title, content)
VALUES
    ('PostgreSQL Full-Text Search', 'PostgreSQL offers powerful full-text search capabilities using tsvector and tsquery.'),
    ('Indexing in Databases', 'Proper indexing is crucial for database performance. It can significantly speed up query execution.'),
    ('ACID Properties', 'ACID (Atomicity, Consistency, Isolation, Durability) properties ensure reliable processing of database transactions.');

UPDATE blog_posts
SET search_vector = to_tsvector('english', title || ' ' || content);

CREATE INDEX idx_search_vector ON blog_posts USING GIN (search_vector);

To search for blog posts containing specific words, we can use the match operator @@, with a tsquery search expression:

SELECT title
FROM blog_posts
WHERE search_vector @@ to_tsquery('english', 'database & performance');

This query returns the following output:

title
-----------------------
 Indexing in Databases
(1 row)

Other examples

Use different text search configurations with tsvector

Postgres supports text search configurations for multiple languages. Here's an example using the 'spanish' configuration:

CREATE TABLE product_reviews (
    id SERIAL PRIMARY KEY,
    product_name TEXT NOT NULL,
    review TEXT NOT NULL,
    search_vector tsvector
);

INSERT INTO product_reviews (product_name, review)
VALUES
    ('Laptop XYZ', 'Este laptop es muy rápido y tiene una excelente batería.'),
    ('Smartphone ABC', 'La cámara del teléfono es increíble, pero la batería no dura mucho.'),
    ('Tablet 123', 'La tablet es ligera y fácil de usar, perfecta para leer libros.');

UPDATE product_reviews
SET search_vector = to_tsvector('spanish', product_name || ' ' || review);

SELECT product_name
FROM product_reviews
WHERE search_vector @@ to_tsquery('spanish', 'batería & (excelente | dura)');

This query returns the following output:

product_name
----------------
 Laptop XYZ
 Smartphone ABC
(2 rows)

Rank the search results from a tsvector column

We can use the ts_rank function to rank search results based on relevance:

CREATE TABLE news_articles (
    id SERIAL PRIMARY KEY,
    headline TEXT NOT NULL,
    body TEXT NOT NULL,
    search_vector tsvector
);

INSERT INTO news_articles (headline, body)
VALUES
    ('Climate Change Summit Concludes', 'World leaders agreed on new measures to combat global warming at the climate summit.'),
    ('New Study on Climate Change', 'Scientists publish groundbreaking research on the effects of climate change on biodiversity.'),
    ('Tech Giant Announces Green Initiative', 'Major tech company pledges to be carbon neutral by 2030 in fight against climate change.');

UPDATE news_articles
SET search_vector = to_tsvector('english', headline || ' ' || body);

SELECT headline, ts_rank(search_vector, query) AS rank
FROM news_articles, to_tsquery('english', 'climate & change') query
WHERE search_vector @@ query
ORDER BY rank DESC;

This query returns the following output:

headline                |    rank
---------------------------------------+------------
 New Study on Climate Change           |  0.2532141
 Climate Change Summit Concludes       | 0.10645772
 Tech Giant Announces Green Initiative | 0.09910322
(3 rows)

All the articles were related to climate change, but the first article was ranked higher due to the higher relevance for the search terms.

Additional considerations

  • Performance: While tsvector enables fast full-text search, creating and updating tsvector columns can be computationally expensive. Consider using triggers or background jobs to update tsvector columns asynchronously.
  • Storage: tsvector columns can significantly increase the size of your database. Monitor your database size and consider using partial indexes if full-text search is only needed for a subset of your data.
  • Language support: PostgreSQL supports many languages out of the box, but you may need to install additional dictionaries for some languages.
  • Stemming and stop words: The text search configuration determines how words are stemmed and which words are ignored as stop words. Choose the appropriate configuration for your use case.

Resources

Need help?

Join our Discord Server to ask questions or see what others are doing with Neon. Users on paid plans can open a support ticket from the console. For more details, see Getting Support.

Last updated on

Was this page helpful?