Don’t Neglect Your RAG Prompts

Make sure your embeddings encode similarity the way you want them to

In discussions around improving the performance of Retrieval Augmented Generation (RAG) systems, a lot of focus is given to the retrieval step. Embedding model selection, vector databases, chunking strategies, distance functions, reranking, etc., all factor into retrieving the most relevant documents.

I’ll be discussing the impact of prompting on the augmented generation step, with an example of how a simple, several word prompt change reordering the documents and question can lead to a large (93% -> 99%) jump in accuracy.

Setup

To mimic a real world RAG scenario, I’ll be working with questions from the TriviaQA dataset. This benchmark dataset contains trivia questions, along with one or more documents (wikipedia or ranked web search results) containing enough information to answer.

I started with the questions with verified web search results, filtered out the questions that had documents that were too big for the LLM we’ll be using (16k tokens), and then picked 100 random questions to use for the rest of this discussion.

Our prompt templates will take in the question and the top 3 documents, and try to produce the correct answer. By always providing the correct documents, this resembles a RAG system with perfect retrieval. The dataset contains a correct answer for each question, along with potential alternate correct answers (e.g., “John F. Kennedy” vs “JFK”).

To prevent style or LLM verbosity from impacting this exercise too much, I define a correct LLM response as one where either of these are true:

The response is a substring of any correct answer
Any correct answer is a substring of the response

We’ll be using OpenAI’s gpt-3.5-16k model for this exercise, with temperature 0.7, and no custom system instructions or settings.

Our Initial Prompt

Let’s start with a simple RAG prompt:


return f"""
    Please answer the following question, using the following documents.

    Question:

    {trivia_entry.question}

    Documents:

    {documents}

    Write your answer in the json form:

    {{
        "answer": "your answer"
    }}

    Make sure your answer is just the answer in json form, with no commentary.

    Start!"""

‍

We provide the question, we provide the documents, and we ask it to give us the answer in a json object. This simple prompt also performs pretty well!

It would be easy to stop working on the prompt now – these are really solid results after all, and you suspect you could now go tweak some vector db settings to get better at retrieving relevant documents. However, if we take a look at some of these failure cases, we see that even when the documents are correct, the LLM can fail to correctly use the information:

In these examples, the information is available in the supporting documents, but it may require reasoning across multiple parts of the documents (the publishing date of the Kiss Flights document + “last month”); or it may require interpreting the question in the context of the documents (“Prime Minister” is sort of a correct answer, but it’s clearly not what’s being asked).

A Small Prompt Change

Let’s take the same prompt, but put the question after the documents.


return f"""
    Please answer the following question, using the following documents.

    Documents:

    {documents}

    Question:

    {trivia_entry.question}

    Write your answer in the json form:

    {{
        "answer": "your answer"
    }}

    Make sure your answer is just the answer in json form, with no commentary.

    Start!"""

‍

It’s a small change in our prompt, but this leads to really different results, and both tricky questions we looked at earlier are now correctly answered.

Why might that simple change help? It’s well known that ordering and position in prompt context matters, and the question-at-the-end structure is the default in OpenAI’s documentation. In general, you want to favor placing the most relevant or important information towards the end of your prompt, because LLM attention mechanisms are order-dependent, often demonstrating a recency bias. In the case of a question answering task, the question asked is the most critical piece, so it deserves the spot after the documents.

As illustrated in this recent blog post by Anthropic, even small changes to the end of a prompt can have a big impact on how long-context information is used by the LLM.

In fact, ordering of your documents may also impact performance. One study even found that sub-optimal distributions of relevant info in context would perform worse than just asking the question without any additional information.

Here we come to the moral of the blog post: spend more time on prompt engineering, because you may be leaving accuracy on the table. If your prompt context is starting to get long, you should try testing various ways to reorder its components, especially at the very end.

Resources

I’ve uploaded the code used for this blog post here. Please let me know if you spot anything obviously wrong in the methodology, and I’ll make sure to update this post.

Now for the obligatory work plug: this was inspired by all the prompt engineering we’ve done while working on Tidepool, which is an analytics tool to answer questions about large text datasets. If you ever find yourself trying to make sense of a large pile of text, please check us out!

Quinn Johnson

Cofounder and CTO at Aquarium. Formerly leading AI/ML data engineering and labeling at Cruise and Ouster. Now yelling prompts into the LLM void.