Say you’re Walmart, trying to build an LLM-powered customer support tool. You want the LLM’s answers to take into account things like past orders, product descriptions, corporate brand guides, and Walmart’s documentation. To do this, you need RAG.
RAG (retrieval augmented generation) is an opaque term for a very simple workflow: extracting relevant information from domain-specific documents, and providing it as specific context to a large language model. To implement it, there are a few main steps;
You might ask, why is RAG even necessary? As contexts get longer and longer (Anthropic’s Claude offers a 100K token context, for example), why not just stuff all the documents in the context and call it a day?
First, while 100K tokens is generous, it’s possible that it’s still not big enough to fit all the domain-specific documents. Second, when you feed 100k tokens worth of context into a model, it tends to degrade performance. Models don’t pay uniform attention to context (they have a tendency to pay the most attention to the beginning and the end), so providing only the most relevant information improves performance. Finally, RAG reduces costs. Feeding all data into the model is expensive in terms of computing time and money. Retrieving only the relevant context is more efficient.
At this point, many of the components involved in building RAG are fairly commoditized – getting baseline embeddings, for example, is trivial – you can do it just by calling the OpenAI API or using a Hugging Face model. Similarly, off-the-shelf vector databases like Pinecone make indexing and searching embeddings straightforward.
But getting RAG to work well, consistently, for your use case, is hard. The issues tend to appear after hooking up an end-to-end RAG system, once you’re writing prompts using the retrieved context and testing full queries from users. Looking at the final results, teams realize they aren’t getting good answers to some queries. This post offers some tips on debugging RAG issues and avoiding common mistakes.
The first step in RAG is taking your dataset of domain-specific documents and embedding them. It’s important to get “good” embeddings.
But what does that even mean?
It’s helpful to review what embeddings are. Embeddings are a map of full-text documents, to vectors of real numbers in a lower-dimensional space. In theory, the more related two documents are semantically, the closer their corresponding vectors should be in the lower-dimensional space. The issue is that what it means for two documents to be “related” or “relevant” is subjective. Take a look at the three documents below:
There’s one interpretation in which example 1/ and 2/ are closer together semantically, because dogs and cats are both animals, and they’re both performing leaping motions. But there’s another interpretation where 1/ and 3/ are closer together semantically because they’re both describing things that are happening on the road.
When people talk about generating “better embeddings” for RAG, what they actually mean is to generate embeddings that encode similarity in a way that matches how you want to retrieve documents. For example, a recycling sorting company would want images clustered by material and texture, not color. This sort of semantic mismatch is a prevalent source of poor quality. You may look at the top query results and understand why those documents were retrieved based on their embeddings, yet recognize they are not actually the most relevant documents for the query.
Services like OpenAPI provide generic embeddings trained on large corpora that are optimized to be good for most uses. These offer a quick way to get started, but their semantic interpretation of your documents might be different from what you want. For example, OpenAI embeddings generically put different languages far apart, since languages are very different. However, your use case might require embeddings that group similar topics together across languages.
An alternative to the generic embedding models is to fine-tune your own embeddings. To do this you need to provide both positive and negative examples of pairs of documents that should be close together in latent space, and pairs that should not be. The linked tutorial shows how you can generate some of these examples just by running DocQA over your documents.
If your original domain-specific documents are lengthy, you will likely have to chunk them into smaller pieces before embedding them. How you choose to do this can potentially cause issues.
If you chunk too large, and the chunk contains a broad range of topics, you may not get great embeddings.
On the other hand, if you chunk too small and the topic you’re trying to capture actually spans two or three sentences, which the document gets split on, you’ve now awkwardly split semantic meaning between chunks.
Say you want to write a natural language query on top of your database – “What are my daily active users in the US for this product?” – and get back a chart. In this case, the documents being embedded are table schemas, and you’re hoping that your RAG setup retrieves the right table definitions to answer the question – maybe a table with login events and another with user location.
But as anyone who has worked at a real company, especially a real company with growth knows, data warehouse schemas are hardly ever pristine. There are usually tons of duplicate tables lying around, and table and column names aren’t always human-readable. Additionally, the query isn’t posing the question in schema terms. For example, it asks about “in the US” rather than naming a country column. That info may be in a non-obvious column like language locale containing “English US”. In other words, there’s a semantic gap between the query and the table schema documents.
As the query language moves away from overt schema details, embeddings struggle to connect them. You may end up embedding a request like “Can you produce a chart showing X?” where the US part becomes a minor footnote.
One technique to address this is hypothetical document embeddings.
1. Ask the LLM to generate a fake schema document that would answer the question. For example:
This hypothetical schema is designed to be an ideal representation of the data structure that would answer the query directly.
Other options:
2. Embed the fake schema document
3. Search for the closest real document to the “hallucinated” ideal one.
By default, the way that RAG systems decide which documents to retrieve is by doing a nearest neighbor search. In other words, it takes the query, embeds it, and then finds the embedded vectors with the largest cosine similarity to the query’s embed. In doing so, it reduces the embedded vectors down to just a single distance. For a given set of embeddings, each data point will have the exact same nearest neighbors.
But sometimes, as we previously talked about, the absolute closest point is not what you want – your ideal match may be the nearest across a different semantic dimension not impacting the raw proximity. So how do you capture that complexity?
RAG promises to make large language models more useful by grounding them in real-world data. But there’s a big gap between a RAG prototype and something that works well in production. A lot of this comes down to ensuring that the embeddings, and the retrievals, match the user’s understanding of relevance, rather than some arbitrary statistical one.