A common SaaS pattern involves deploying a single application environment shared among all users. However, RAG presents additional reliability and security challenges, relying heavily on user-provided information and unpredictable model APIs. This article covers a multi-tenant RAG pipeline example (available on GitHub) that leverages Neon’s database-per-user design and the multi-tenancy features of the Inngest workflow engine.
The Noisy Neighbor Problem: RAG Edition
The Noisy neighbor problem can be described as:
“where a few customers make up a disproportionate fraction of the system’s overall activity, and other customers suffer higher latencies, locks, and unavailabilities.”
As RAG applications heavily rely on data input, vector database, and external API calls (models and tools), the Noisy neighbor problem can come up at two levels:
- At the database level, an organization’s usage or amount of data can impact the overall performance of the application
- At the RAG execution level, a spike in usage can lead to capacity issues (e.g., rate limits), slowing down or blocking other organizations’ RAG executions.
Adds to the Noisy neighbor problem the cost and data security considerations:
- One organization’s usage should not lead to an unwanted surge in LLM cost
- One organization should not be able to access another organization’s data (database or vector)
This is where the concept of multi-tenancy, which consists of isolating the data of each organization and applying guaranteed capacity, comes into play:
- Each organization should only get access to their contacts
- One organization’s usage of the RAG pipeline should not impact the capacity of others
Our example application is a CRM that enables users from different organizations to upload contact CSV files, which get enriched and embedded into a vector store, which is later used for semantic search:
Each organization gets assigned a dedicated application and vector Neon database and benefits from guaranteed capacity by leveraging Inngest workflows:
Let’s walk through its implementation.
Data Isolation: Dedicated Neon Projects Per Tenant
Let’s first look at how our application isolates each workspace’s application and vector data into dedicated databases.
Database architecture
To provide complete data isolation and guaranteed database performance, our application leverages Neon’s “database-per-user” design, where each organization gets a new Neon project containing a contacts and embeddings tables:
For the sake of simplicity, our demo application exposes a /api/setup endpoint that creates two workspaces to test the application using the Neon API.
In a real-world application, each workspace creation sets up the Neon project and creates the schema, as covered in the db-per-tenant repository. Creating a new workspace and its associated databases only takes a few seconds and can be easily integrated into a Next.js API route.
Note that Neon’s database-per-user design differs from regular isolated database architecture. By leveraging Neon projects, our end-users benefit from complete data isolation without the usual associated cost of duplicated architecture.
Querying multi-tenant databases
Neon’s serverless SDK makes connecting to a database on the fly easy. Our project provides a getTenantConnectionString(workspaceId)
helper that retrieves the Neon project associated with a given workspace:
This helper is used to connect to each workspace application and vector database dynamically:
By leveraging getTenantConnectionString()
, our RAG now benefits from guaranteed database performance and data isolation for each workspace.
Let’s now see how our RAG guarantees fair processing performance by applying throttling and concurrency per workspace.
Noisy neighbor prevention: Guaranteed RAG workflow capacity
Our RAG relies on two rate-limited and costly 3rd party APIs: the SERP and OpenAI APIs.
To protect our RAG from spikes in usage that could affect the application’s performance and cost, our RAG is implemented leveraging Inngest workflows and its multi-tenancy features.
Our first AI workflow performs enrichment (using the SERP and OpenAI APIs) and generates an embedding document for each contact of the CSV file:
The contact enrichment workflow ensures a guaranteed capacity with a workflow-level throttling configuration using the event.data.workflowId
:
The throttling
configuration assigns each workspace to its own capacity, matching the OpenAI and SERP rate limits:
The same principles are later used to insert the embeddings into the vector store in batches of 100 documents. By using the event.data.workspaceId
, each batch is guaranteed to contain documents from the same workspace:
We finally see our loadVectorStore(workspaceId)
in action, which is used in combination with workspace-based batches.
Conclusion
This example is the perfect illustration of how putting a RAG into the users’ hands brings similar challenges as deploying SaaS applications:
- One customer spike in usage can affect the global application’s performance
- Any data leak between organizations can lead to a loss of trust
- RAGs rely on costly external services (ex, OpenAI) that should be pretty used by all users of the application
Fortunately, Neon’s database-per-user and Inngest’s throttling and batching per key features make it easy to isolate workspaces into their own data and RAG workflows without any additional infrastructure work or cost.
Feel free to explore this article’s demo by looking at the repository on Github, running it locally, or quickly deploying it on Vercel.
Neon is a serverless Postgres platform that helps you build faster. We have a generous Free Plan – create your account here (no credit card required). You can add the Inngest integration in one click.