Providing examples can greatly improve your LLM prompts (both OpenAI and Anthropic include it in their prompting guides), but did you know that the specific choice of examples can dramatically impact performance?
In this post, we’ll look at a few different scenarios of using LLMs to classify documents. These cases will demonstrate that the way you pick your examples impacts accuracy, and that the best way to pick examples varies.
Throughout this post, we’ll be using gpt-3.5-turbo with more straightforward benchmark datasets, but you see the same behaviors when you try to use a more powerful model to solve harder, domain-specific problems.
We’ll start by looking at this twitter emotion classification dataset. It contains filtered English twitter posts that express strong emotions, and one of six emotions it most represents. Here are a few examples:
We’ll make a simple prompt to categorize it:
We’ll also make a version that has space for examples too:
How well does it work?
Without any examples (zero-shot), it gets 54/100 correct. What about with some examples? If we pick 5 random examples per category (30 total), it jumps up to 59/100. Awesome! It does better, as we’d expect.
Well, if adding examples helped, why don’t we add even more examples? Let’s try picking 15 per category instead! With that, we have 90 total examples, and accuracy of… 59/100. Our GPT calls are more expensive, with negligible (if any) benefit, so we’re clearly past the point of diminishing returns.
Let’s try to pick more relevant examples instead. In this report, OpenAI got optimal results on medical classification tasks by dynamically picking the best few-shot examples based on the specific question being asked, so we can do something similar. We’ll compute an embedding for every example, and when assembling our prompt, always pick the 30 examples that are most similar (lowest cosine distance) to the message we’re trying to classify.
After that change, we get 68/100 correct. A 15% relative increase, for a small prompt construction change. Not bad!
Let’s look at a task that requires more logic. There’s a classic Stanford benchmark dataset that presents a fact and a statement, and asks if the statement is true, false, or neutral given the fact. Here’s an example fact with a few statement labels:
Once again, we’ll make a simple zero shot prompt.
And a slight modification to support examples:
Without any examples, we get an accuracy of 43/100, and picking 10 random examples per category (30 total) gets us 48/100.
Earlier we saw that picking similar examples helped, so let’s do that again. Replacing the 30 examples with the 30 most similar examples by embedding cosine distance gets us… 44/100. It got worse.
In retrospect, this sort of makes sense. Most text embeddings focus on the topic of the text, but that’s not really relevant here – the most useful examples would be similar logical relationships, not other examples that happen to take place in a restaurant. Our original 10 examples per category were guaranteed to include multiple types of logical relationships, so that was probably a more diverse set of examples.
We can try sampling more examples again. Let’s go back to random examples from each category, and pick 30 instead of 10 from each (90 total). Unlike the earlier emotion-classification task, we see a clear, though small, improvement: 51/100. Since this is a more complicated task, it looks like we continue to benefit from more examples longer, though it will use up many more tokens.
Since we hypothesized that the balance of the categories ended up being important, what if we mess with that? Instead of 10 from each, let’s see what happens when we pick examples from just one label:
Just true: 55/100
Just false: 41/100
Just neutral: 39/100
Yeah, it looks like the distribution of categories in the examples really matters. Focusing just on the true examples ended up being the most useful for the LLM, and focusing just on false or neutral examples is worse than providing no examples at all.
If we wanted to tune this further, the next step would probably be to test out the right balance between example types, with us probably weighting it to be mostly true examples.
In both scenarios, we saw that adding examples was an easy way to improve prompt accuracy, but the right choice of examples is both important and inconsistent from prompt to prompt. The two methods we highlighted here (choosing similar examples to the question and controlling the distribution of your examples’ categories) can be great starting points for your experimentation.
It really requires a bit of analysis of the type of your task and the mistakes the LLM makes, but the results are worth it!