Picking The Right Examples For Your Prompt

Make sure your embeddings encode similarity the way you want them to

Providing examples can greatly improve your LLM prompts (both OpenAI and Anthropic include it in their prompting guides), but did you know that the specific choice of examples can dramatically impact performance?

In this post, we’ll look at a few different scenarios of using LLMs to classify documents. These cases will demonstrate that the way you pick your examples impacts accuracy, and that the best way to pick examples varies.

Task 1 – Classifying Emotion

Throughout this post, we’ll be using gpt-3.5-turbo with more straightforward benchmark datasets, but you see the same behaviors when you try to use a more powerful model to solve harder, domain-specific problems.

We’ll start by looking at this twitter emotion classification dataset. It contains filtered English twitter posts that express strong emotions, and one of six emotions it most represents. Here are a few examples:



> i feel romantic too: love
> i had begun to feel apprehensive when thick black rain clouds stormed into the sky above town: fear

‍

We’ll make a simple prompt to categorize it:



    You are an analyst. You will be given a twitter message. Determine which of these six categories they are in:
    A) sadness
    B) joy
    C) love
    D) anger
    E) fear
    F) surprise

    Here is the message to evaluate:
    Message: {message}

    Please write your reasoning before writing a response. If there is no clear emotion, pick the closest category. Your answer must be A, B, C, D, E, or F.

    # REASONING START
    [your reasoning]
    # REASONING END

    # RESULT START
    [A, B, C, D, E, or F]
    # RESULT END

‍

We’ll also make a version that has space for examples too:



    ...
    F) surprise

    Here are some examples:
    {examples_str}

    Here is the message to evaluate:
    Message: {message}

    First read the examples to define the categories. Please refer to the examples and write your reasoning before writing a response. If there is no clear emotion, pick the closest category. Your answer must be A, B, C, D, E, or F.

    ...

How well does it work?

Without any examples (zero-shot), it gets 54/100 correct. What about with some examples? If we pick 5 random examples per category (30 total), it jumps up to 59/100. Awesome! It does better, as we’d expect.

Can we make it better?

Well, if adding examples helped, why don’t we add even more examples? Let’s try picking 15 per category instead! With that, we have 90 total examples, and accuracy of… 59/100. Our GPT calls are more expensive, with negligible (if any) benefit, so we’re clearly past the point of diminishing returns.

Let’s try to pick more relevant examples instead. In this report, OpenAI got optimal results on medical classification tasks by dynamically picking the best few-shot examples based on the specific question being asked, so we can do something similar. We’ll compute an embedding for every example, and when assembling our prompt, always pick the 30 examples that are most similar (lowest cosine distance) to the message we’re trying to classify.

After that change, we get 68/100 correct. A 15% relative increase, for a small prompt construction change. Not bad!

A different task – Logic

Let’s look at a task that requires more logic. There’s a classic Stanford benchmark dataset that presents a fact and a statement, and asks if the statement is true, false, or neutral given the fact. Here’s an example fact with a few statement labels:



Fact: A few people in a restaurant setting, one of them is drinking orange juice.

Statement 1 – The people are eating omelettes: Neutral
Statement 2 – The people are sitting at desks in school: False
Statement 3 – The diners are at a restaurant: True

‍

Once again, we’ll make a simple zero shot prompt.



    You are an analyst. You will be given two sentences from a paragraph. Determine which of these three categories Sentence 2 is in, given Sentence 1:
    A) true
    B) false
    C) neutral

    Here are the sentences to evaluate:
    Sentence 1: {sentence1}
    Sentence 2: {sentence2}

    Please write your reasoning before writing a response. Remember to consider all information provided in sentence 2. You may only select one category as the result, A, B or C.

    # REASONING START
    [your reasoning]
    # REASONING END

    # RESULT START
    [A, B or C]
    # RESULT END

‍

And a slight modification to support examples:



    ...
    C) neutral

    Here are some examples:
    {examples_str}

    Here are the sentences to evaluate:
    Sentence 1: {sentence1}
    Sentence 2: {sentence2}
    

    First read the examples to define the categories. Please refer to the examples and write your reasoning before writing a response. Remember to consider all information provided in sentence 2. You may only select one category as the result, A, B or C.
    ...

‍

Without any examples, we get an accuracy of 43/100, and picking 10 random examples per category (30 total) gets us 48/100.

Earlier we saw that picking similar examples helped, so let’s do that again. Replacing the 30 examples with the 30 most similar examples by embedding cosine distance gets us… 44/100. It got worse.

In retrospect, this sort of makes sense. Most text embeddings focus on the topic of the text, but that’s not really relevant here – the most useful examples would be similar logical relationships, not other examples that happen to take place in a restaurant. Our original 10 examples per category were guaranteed to include multiple types of logical relationships, so that was probably a more diverse set of examples.

Maybe something else?

We can try sampling more examples again. Let’s go back to random examples from each category, and pick 30 instead of 10 from each (90 total). Unlike the earlier emotion-classification task, we see a clear, though small, improvement: 51/100. Since this is a more complicated task, it looks like we continue to benefit from more examples longer, though it will use up many more tokens.

Since we hypothesized that the balance of the categories ended up being important, what if we mess with that? Instead of 10 from each, let’s see what happens when we pick examples from just one label:

‍

Just true: 55/100

Just false: 41/100

Just neutral: 39/100

‍

Yeah, it looks like the distribution of categories in the examples really matters. Focusing just on the true examples ended up being the most useful for the LLM, and focusing just on false or neutral examples is worse than providing no examples at all.

If we wanted to tune this further, the next step would probably be to test out the right balance between example types, with us probably weighting it to be mostly true examples.

Conclusion

In both scenarios, we saw that adding examples was an easy way to improve prompt accuracy, but the right choice of examples is both important and inconsistent from prompt to prompt. The two methods we highlighted here (choosing similar examples to the question and controlling the distribution of your examples’ categories) can be great starting points for your experimentation.

It really requires a bit of analysis of the type of your task and the mistakes the LLM makes, but the results are worth it!

Lucy Cheng