The unaccent
extension for Postgres enables handling of text data in a more user-friendly and language-tolerant way. It allows you to remove accents/stress (diacritic signs) from text strings, making it easier to perform searches and comparisons that are insensitive to accents. This is particularly useful in multilingual applications where users might not consistently use accents when typing search queries.
Imagine a user searching for "Hôtel" but only typing "Hotel". Without unaccent
, the database might not find the intended results. With unaccent
, you can ensure that searches are more forgiving and return relevant results regardless of accent variations.
Try it on Neon!
Neon is Serverless Postgres built for the cloud. Explore Postgres features and functions in our user-friendly SQL editor. Sign up for a free account to get started.
This guide will walk you through the essentials of using the unaccent
extension. You'll learn how to enable the extension on Neon, understand its key concepts, use it in queries, optimize performance with indexing, and consider its limitations.
unaccent
extension
Enable the You can enable the extension by running the following CREATE EXTENSION
statement in the Neon SQL Editor or from a client such as psql that is connected to your Neon database.
CREATE EXTENSION IF NOT EXISTS unaccent;
Version availability:
Please refer to the list of all extensions available in Neon for up-to-date extension version information.
unaccent()
Removing accents with The primary function provided by the unaccent
extension is unaccent()
. This function takes a text input and returns the same text with accents removed.
Let's see it in action with a few examples:
SELECT unaccent('Hôtel');
-- Hotel
SELECT unaccent('cliché');
-- cliche
SELECT unaccent('naïve');
-- naive
SELECT unaccent('café');
-- cafe
SELECT unaccent('Déjà vu');
-- Deja vu
As you can see, unaccent()
effectively strips the diacritics, transforming words with accents into their unaccented counterparts. This transformation is based on a set of rules that are configurable, allowing for customization to suit specific language needs.
Practical usage examples
unaccent
is most commonly used to enhance text searching, making it more forgiving and user-friendly. Let's explore some typical use cases:
Basic accent-insensitive searching
Imagine you have a product catalog and want users to be able to search for products regardless of whether they use accents or not. For instance, a user might search for "cafe" or "café" and expect to find products containing "café".
Consider a products
table with the following data:
CREATE TABLE products (
id SERIAL PRIMARY KEY,
name TEXT
);
INSERT INTO products (name) VALUES
('cafe'),
('café'),
('Café'),
('Café au lait');
Without unaccent
, a simple WHERE
clause would differentiate between accented and unaccented characters:
SELECT * FROM products WHERE name = 'café';
id | name
----|------
2 | café
SELECT * FROM products WHERE name = 'cafe';
id | name
----|------
1 | cafe
By applying unaccent()
to both the column and the search term, you can achieve accent-insensitive matching:
SELECT * FROM products WHERE unaccent(name) = unaccent('cafe');
id | name
---|------
1 | cafe
2 | café
ILIKE
Case-insensitive and accent-insensitive searching with For even more flexible searching, you can combine unaccent()
with the ILIKE
operator for case-insensitive and accent-insensitive searches. This is particularly useful for free-text search scenarios.
SELECT * FROM product WHERE unaccent(name) ILIKE unaccent('%cafe%');
id | name
---+------
1 | cafe
2 | café
3 | Café
4 | Café au lait
In this example, ILIKE
handles case-insensitivity (matching 'cafe', 'Cafe', etc.), and unaccent()
ensures that accents are ignored during the comparison. Applying unaccent()
to both sides of the ILIKE
condition is crucial for this to work effectively.
Integration with Full-Text search
While unaccent()
can be used directly in WHERE
clauses, its true power for search applications is realized when integrated into with Postgres full-text search capabilities. unaccent
is designed as a text search dictionary. By incorporating it into your text search configurations, you can ensure that indexing and searching operations automatically handle accent removal.
This involves creating or modifying text search configurations to include the unaccent
dictionary in the analysis process. When text is indexed and queried using such a configuration, accents are automatically stripped, leading to efficient and accent-insensitive full-text searches.
Configuration Modifications in Neon
It's important to note that because unaccent
is managed by Neon, modifying the default unaccent.rules
file or other configuration settings requires administrative privileges that are not available to Neon users. If you have specific needs for customized unaccent
rules or configurations, please open a support ticket to discuss your requirements with Neon support.
Performance and Indexing Considerations
Using unaccent()
in queries can have performance implications, especially in large tables. Applying functions in WHERE
clauses often prevents the database from efficiently using standard indexes.
unaccent()
Indexing with Directly indexing an expression like unaccent(column)
typically doesn't work efficiently because, by default, unaccent()
is not marked as an IMMUTABLE
function. Postgres requires functions used in index expressions to be IMMUTABLE
to guarantee consistent index usage.
To enable indexing with unaccent()
, you can create an IMMUTABLE
wrapper function around it. This wrapper function essentially tells Postgres that the function's output will always be the same for a given input, allowing it to be used in index expressions.
Here's an example of creating an IMMUTABLE
wrapper function:
CREATE OR REPLACE FUNCTION f_unaccent(text) RETURNS text
AS $$
SELECT public.unaccent('public.unaccent', $1);
$$ LANGUAGE sql IMMUTABLE PARALLEL SAFE STRICT;
Explanation:
CREATE OR REPLACE FUNCTION f_unaccent(text) RETURNS text
: This defines a new function namedf_unaccent
that takes text as input and returns text.AS $$ ... $$ LANGUAGE sql IMMUTABLE PARALLEL SAFE STRICT
: This is the function body, written in SQL, and the important part is declaring itIMMUTABLE
.SELECT public.unaccent('public.unaccent', $1);
: Inside the wrapper, we are calling the originalunaccent
function, making sure to schema-qualify it aspublic.unaccent
for robustness.
Once you have this IMMUTABLE
wrapper function, you can create indexes on it:
CREATE INDEX idx_products_name_unaccent ON products (f_unaccent(name));
Now, queries using f_unaccent(name)
in the WHERE
clause can effectively utilize this index, significantly improving performance for accent-insensitive searches.
SELECT * FROM products WHERE f_unaccent(name) = f_unaccent('cafe');
-- This query can now use the 'idx_products_name_unaccent' index
Alternative: Generated columns
Another strategy for optimizing performance is to use generated columns. You can add a new column to your table that stores the unaccented version of your text data. This column can then be indexed directly and queried efficiently.
ALTER TABLE products ADD COLUMN name_unaccented text
GENERATED ALWAYS AS (f_unaccent(name)) STORED;
CREATE INDEX idx_products_name_unaccent_generated ON products (name_unaccented);
SELECT * FROM products WHERE name_unaccented = 'cafe';
-- This query will use the 'idx_products_name_unaccent_generated' index to find rows where unaccented 'name' matches 'cafe'
Generated columns add storage overhead but can offer performance benefits for read-heavy workloads.
Limitations
While unaccent
is very useful, it's important to be aware of its limitations:
- Rule-based:
unaccent
operates based on a predefined set of rules defined in its configuration file (unaccent.rules
). The effectiveness of accent removal depends on the completeness and accuracy of these rules for your target languages. - Language specificity: The default rules are primarily geared towards European languages. For languages with different diacritic systems or complex character transformations, the default rules might not be sufficient, and customization of the
unaccent.rules
file might be required. - No contextual understanding:
unaccent
performs a character-by-character transformation based on its rules. It does not understand the context or meaning of words. In some cases, this might lead to over-simplification or loss of subtle distinctions in meaning that accents might convey in certain languages.
Conclusion
The unaccent
extension is a valuable tool for handling text data in Postgres, especially in multilingual applications where accent-insensitive searching is essential. By enabling unaccent
, you can ensure that your database is more user-friendly and tolerant of accent variations in search queries.
Resources
Need help?
Join our Discord Server to ask questions or see what others are doing with Neon. Users on paid plans can open a support ticket from the console. For more details, see Getting Support.