Say you’re Walmart, trying to build an LLM-powered customer support tool. You want the LLM’s answers to take into account things like past orders, product descriptions, corporate brand guides, and Walmart’s documentation. To do this, you need RAG.
RAG (retrieval augmented generation) is an opaque term for a very simple workflow: extracting relevant information from domain-specific documents, and providing it as specific context to a large language model. To implement it, there are a few main steps;
- Take the domain-specific documents, chunk them in some way, and embed the chunks.
- Store these embeddings.
- At query time, embed the query, and find the closest match in the vector database.
- Add that match into the query context.
You might ask, why is RAG even necessary? As contexts get longer and longer (Anthropic’s Claude offers a 100K token context, for example), why not just stuff all the documents in the context and call it a day?
First, while 100K tokens is generous, it’s possible that it’s still not big enough to fit all the domain-specific documents. Second, when you feed 100k tokens worth of context into a model, it tends to degrade performance. Models don’t pay uniform attention to context (they have a tendency to pay the most attention to the beginning and the end), so providing only the most relevant information improves performance. Finally, RAG reduces costs. Feeding all data into the model is expensive in terms of computing time and money. Retrieving only the relevant context is more efficient.
At this point, many of the components involved in building RAG are fairly commoditized – getting baseline embeddings, for example, is trivial – you can do it just by calling the OpenAI API or using a Hugging Face model. Similarly, off-the-shelf vector databases like Pinecone make indexing and searching embeddings straightforward.
But getting RAG to work well, consistently, for your use case, is hard. The issues tend to appear after hooking up an end-to-end RAG system, once you’re writing prompts using the retrieved context and testing full queries from users. Looking at the final results, teams realize they aren’t getting good answers to some queries. This post offers some tips on debugging RAG issues and avoiding common mistakes.
1. Make sure your embeddings encode similarity the way you want them to
The first step in RAG is taking your dataset of domain-specific documents and embedding them. It’s important to get “good” embeddings.
But what does that even mean?
It’s helpful to review what embeddings are. Embeddings are a map of full-text documents, to vectors of real numbers in a lower-dimensional space. In theory, the more related two documents are semantically, the closer their corresponding vectors should be in the lower-dimensional space. The issue is that what it means for two documents to be “related” or “relevant” is subjective. Take a look at the three documents below:
1/ “The bright white cat jumped in front of the streetlight.”
2/ “The dog leaped at the birthday party.”
3/ “The driverless car was struck in traffic.”
There’s one interpretation in which example 1/ and 2/ are closer together semantically, because dogs and cats are both animals, and they’re both performing leaping motions. But there’s another interpretation where 1/ and 3/ are closer together semantically because they’re both describing things that are happening on the road.
When people talk about generating “better embeddings” for RAG, what they actually mean is to generate embeddings that encode similarity in a way that matches how you want to retrieve documents. For example, a recycling sorting company would want images clustered by material and texture, not color. This sort of semantic mismatch is a prevalent source of poor quality. You may look at the top query results and understand why those documents were retrieved based on their embeddings, yet recognize they are not actually the most relevant documents for the query.
Services like OpenAPI provide generic embeddings trained on large corpora that are optimized to be good for most uses. These offer a quick way to get started, but their semantic interpretation of your documents might be different from what you want. For example, OpenAI embeddings generically put different languages far apart, since languages are very different. However, your use case might require embeddings that group similar topics together across languages.
An alternative to the generic embedding models is to fine-tune your own embeddings. To do this you need to provide both positive and negative examples of pairs of documents that should be close together in latent space, and pairs that should not be. The linked tutorial shows how you can generate some of these examples just by running DocQA over your documents.
While not super rigorous, one popular way to check the quality of embeddings is to visualize them: this allows you to do a spot-check on whether concepts and documents that you think should be close together are actually close together.
2. Chunk intelligently
If your original domain-specific documents are lengthy, you will likely have to chunk them into smaller pieces before embedding them. How you choose to do this can potentially cause issues.
If you chunk too large, and the chunk contains a broad range of topics, you may not get great embeddings.
On the other hand, if you chunk too small and the topic you’re trying to capture actually spans two or three sentences, which the document gets split on, you’ve now awkwardly split semantic meaning between chunks.
- Consider logical chunking boundaries like sections and sentences.
- Chunks should overlap a bit.
- Use chunking techniques like sliding windows which increase the indexing work but avoid unlucky splits.
- Further advice on chunking here.
3. Make sure your query and your embedded documents aren’t too far apart semantically.
Say you want to write a natural language query on top of your database – “What are my daily active users in the US for this product?” – and get back a chart. In this case, the documents being embedded are table schemas, and you’re hoping that your RAG setup retrieves the right table definitions to answer the question – maybe a table with login events and another with user location.
But as anyone who has worked at a real company, especially a real company with growth knows, data warehouse schemas are hardly ever pristine. There are usually tons of duplicate tables lying around, and table and column names aren’t always human-readable. Additionally, the query isn’t posing the question in schema terms. For example, it asks about “in the US” rather than naming a country column. That info may be in a non-obvious column like language locale containing “English US”. In other words, there’s a semantic gap between the query and the table schema documents.
As the query language moves away from overt schema details, embeddings struggle to connect them. You may end up embedding a request like “Can you produce a chart showing X?” where the US part becomes a minor footnote.
One technique to address this is hypothetical document embeddings.
- Ask the LLM to generate a fake schema document that would answer the question. For example:
IdealActiveUsers **Description:** This table contains information about the daily active users for various products, segmented by country. **Columns:** 1. **Date**: Date type. Represents the day for which the data is recorded. 2. **ProductID**: Integer. A unique identifier for each product. 3. **Country**: String. The country from which the user accessed the product. 4. **ActiveUsers**: Integer. The number of users who were active on the given date for the specified product in the mentioned country. **Sample Entries:** | Date | ProductID | Country | ActiveUsers | |------------|-----------|---------|-------------| | 2023-10-03 | 101 | US | 5000 | | 2023-10-03 | 102 | UK | 3000 | **Related Tables:** - **UserLocation**: Contains information about the user's country based on their activity. - **LoginEvents**: Contains logs of user logins, which can be used to determine daily activity.
This hypothetical schema is designed to be an ideal representation of the data structure that would answer the query directly.
- Embed the fake schema document.
- Search for the closest real document to the “hallucinated” ideal one.
4. Consider alternatives to nearest neighbor search
By default, the way that RAG systems decide which documents to retrieve is by doing a nearest neighbor search. In other words, it takes the query, embeds it, and then finds the embedded vectors with the largest cosine similarity to the query’s embed. In doing so, it reduces the embedded vectors down to just a single distance. For a given set of embeddings, each data point will have the exact same nearest neighbors.
But sometimes, as we previously talked about, the absolute closest point is not what you want – your ideal match may be the nearest across a different semantic dimension not impacting the raw proximity. So how do you capture that complexity?
One alternative people have tried is to retrieve a larger set of candidates, say the top 500 matches instead of just the single closest. You can then re-rank this larger set using the full embedding vectors as input to a classifier model like SVM. This is more expensive, requiring more data retrieval and complex models. But it squeezes more value from the rich embeddings compared to just nearest neighbor.
RAG promises to make large language models more useful by grounding them in real-world data. But there’s a big gap between a RAG prototype and something that works well in production. A lot of this comes down to ensuring that the embeddings, and the retrievals, match the user’s understanding of relevance, rather than some arbitrary statistical one.