Semantic search using OpenAI, pg_embedding and Neon
Learn how to build semantic search experiences using OpenAI, Neon and pg_embedding
A few weeks back, we published an AI-powered app where you can submit an idea for a startup and get a list of similar companies YCombinator has invested in. The app got attention on HackerNews and Twitter, resulting in 5,000+ visitors and 2,500+ submissions.
If you haven’t had a chance to try it out, go to neon.tech/ycmatcher
How the app works
The app uses semantic search, a search technique that understands the meaning behind a user’s search query. This is more powerful than keyword-based search that looks for exact string matches.
For example, consider these two search queries:
- “Ride-sharing for friends.”
- “An app that allows you to get from point A to point B, and you split the fee.”
Even though they’re completely different and have nothing in common from a lexical perspective, you will find that the app returns similar results. This is possible because we rely on vector embeddings and vector similarity search.
What are vector embeddings?
A vector embedding is a vector (list) of floating point numbers. We can use it to represent unstructured data (e.g., text, images, audio, or other types of information).
What’s powerful about embeddings is they can capture the meaning behind the text and be used to measure the relatedness of text strings. The smaller the distance between two vectors, the more they’re related to each other and vice-versa.
Consider the following three sentences summarizing different books:
- “A young wizard attends a magical school and battles against an evil dark lord.”
- “A group of friends embarks on an adventurous journey to destroy a powerful ring and save the world.”
- “A detective investigates a series of mysterious murders in a small town.”
For example, we can have the following embeddings:
- Summary #1 →
[0.1, 0.1, 0.1]
- Summary #2 →
[-0.2, 0.2, 0.3]
- Summary #3 →
[0.3, -0.2, 0.4]
If we want to find out which two summaries are most related, we can calculate the distance between every two embeddings ( #1 and #2, #2 and #3, #1 and #3). The two embeddings closest to each other distance-wise are the most similar.
At a high level, this is how the YC idea matcher app works:
- Convert the user’s search query into an embedding
- Go through every company description and return the most similar ones
Generating vector embeddings using OpenAI
One way to generate embeddings is by using OpenAI’s Embeddings API. It enables sending a text string to an API endpoint and returning a corresponding vector embedding.
Here’s an example API call using the
text-embedding-ada-002has an output dimension of 1536. This means that the embedding array included in the response will have a size of 1536. There are other embedding models out there that you can use. However, keep in mind that when comparing the distance between two vectors, they must be of similar length.
Vector similarity search in Postgres using pg_embedding
The process of representing data into embeddings and calculating the similarity between one or more items is known as vector search (or similarity search). It has a wide range of applications:
- Information Retrieval: you can retrieve relevant information based on user queries since you can accurately search based on the meaning of the user query. (This is what YC idea matcher does)
- Natural Language Processing: since embeddings capture the meaning of the text, you can use them to classify text and run sentiment analysis.
- Recommendation Systems: You can recommend similar items based on a given set of items. (e.g., movies/products/books, etc.)
- Anomaly Detection: since you can determine the similarity between items in a given dataset, you can determine items that don’t belong.
Storing and retrieving vector embeddings can be done in Postgres. This is incredibly useful because it eliminates the need to introduce an external vector store when building AI and LLM applications if you’re already using Postgres.
YC idea matcher uses pg_embedding. However, since pg_embedding is compatible with pgvector, the YC idea matcher app can easily be converted to use pgvector instead.
To get started:
1. Enable pg_embedding
2. Create a column for storing vector data (this step was done in the data collection step for YC idea matcher)
3. Run similarity search queries
This query retrieves the ID from the documents table, sorts the results by the shortest distance between the embedding column and the array
[1.1, 2.2, 3.3], and returns only the first result.
<=> operator calculates the Cosine similarity between two vectors. pg_embedding supports Euclidean (L2) distance using the
<-> operator as well as Manhattan distance using the
<~> . Cosine similarity works best when comparing vector embeddings that represent text.
Code deep dive
You can find the app’s source code on GitHub. In this section, we’ll cover how it works.
Gathering the data
The first step was to gather company data from the YCombinator public API. The API returns the data in the following format:
companies: is an array of objects, where each object contains data about a specific company
nextPage: contains the API endpoint URL with the page number specified as a query parameter
page: the current page number
totalPages: total number of pages. There are 177 in total, and each page returns an array of 25 companies (the last page returns three companies, so we have 4,403 companies in total. However, some companies didn’t have a long description, so we removed them.)
We then wrote a script that went through each page, and for each company’s long description, we generated an embedding and stored the company data in a Neon database.
Once we had the data, the last step was building the app. We used Next.js, a React framework for building full-stack web applications.
Building the UI and API
On the frontend, a form captures user submissions, tracks the submit event using Vercel analytics, and sends a POST request to an API endpoint.
We then render the API response, which is a list of companies. You can check out the frontend code in the page.tsx file
As for the API, it has the following code:
The API endpoint is a serverless edge function deployed to the
us-east-1 region (Washington, D.C., USA) on Vercel. We specified this region since it’s the same region where the Neon database is deployed.
Lastly, we generate an embedding for the user-submitted query and return the five most similar companies as JSON.
The end result is fast since it uses a small data set. However, it’s important to create an index when working with a large amount of data (e.g., tens of thousands of rows or more). You can check out pg_embedding’s documentation to learn how to create an HNSW index to optimize search behavior.
Also, if you’re new to Neon, you can sign up for free.