As large language models (LLMs) like GPT-4 become more widely used, companies are integrating them into customer-facing products and relying on them for business-critical applications. This presents a challenge: how can you systematically evaluate where your LLM is performing well or poorly?
In this post, we’ll explore techniques to thoroughly understand your LLM’s capabilities and uncover areas for improvement. The goal is to determine success for your app’s particular use cases, rather than just giving an overall score against some public benchmark.
We assume that you’re an application company building on top of an existing LLM API or fine-tuning an off-the-shelf model. You want to apply the LLM’s capabilities to a specific problem domain, rather than training a generic model from scratch.
Let’s say that you’re building an application to automatically clean up podcast transcripts into human-readable, “edited for clarity” blog posts. This is a task that doesn’t have a “right answer”, but a casual reader can probably tell whether the LLM has done a good job, because the blog posts will be clear, concise, and coherent, while still preserving the voice of the original podcast guest.
The most basic approach for evaluation, which most people start with, is to simply try out your LLM + prompt extensively across a range of representative inputs and manually “eyeball” the results to evaluate quality. Let’s say you’re testing out the following prompt against the Claude API:
Please “lightly edit for clarity” the podcast transcript below into a human-readable blog post, preserving as much of the original content as possible. Do not make stuff up: {transcript}.
You can spot-check the prompt against a few examples, for instance, a Jack Ma talk, an Odd Lots transcript, and a Databricks fireside chat transcript. Spending 15 minutes to try a few different inputs and reading the model outputs can give you a surprisingly good “gut check” on model performance without needing to build complicated evaluation infrastructure.
However, since this isn’t particularly systematic, you should think of these explorations as a way of building up an evaluation set rather than the final word on whether your LLM is working well.
A high-quality evaluation set for an LLM-powered application should provide broad, realistic coverage of expected use cases from your application, not just a narrow sampling. This includes diverse examples across different topics, formats, and complexity levels to expose different failure modes. Think of it as unit test cases for your LLM + prompt combo.
You can build up this set incrementally:
Continuing from our example above, the Jack Ma transcript, an Odd Lots transcript, and a Databricks transcript represent different conversation styles that might provide challenges for the prompt + LLM. In some responses, the blog post may claim the speaker said things they did not. In others, the LLM may have inappropriately summarized the content or simplified the tone. These are cases that you should add to your evaluation set.
At this point, you will probably still be manually reading model outputs to evaluate how good they are. However, having an evaluation set gives you higher confidence that you’re evaluating the model across a more diverse and representative set of data than haphazardly prompting whatever comes to mind.
As your evaluation set grows larger, it becomes more difficult to manually inspect model outputs for every example. This is a good time to come up with simple automated metrics to evaluate your model performance. Which metrics are “right” depends on the specific use cases and desired outcomes for your LLM application:
If there are definitive correct answers for certain queries or tasks, accuracy against ground truth is a straightforward metric. You can measure the % of cases where the LLM generates the expected result.
Even without set answers, consistency in output formatting, tone, and other attributes can be easily evaluated. If your task involves outputting a structured output like JSON, it’s simple to check that the model output is valid JSON. If your task involves generating text about a certain topic, the model output text should probably explicitly mention that topic.
On the other hand, for open-ended tasks without a single right answer, defining reference responses provides a baseline for comparison. You can compare LLM outputs against references using similarity metrics like BLEU.
Together, your evaluation set and evaluation metrics will allow you to automatically evaluate any changes made to your LLM and prompt. While this won’t be 100% correct all the time (for example, it’s difficult to evaluate text generation tasks where quality is very subjective), it should be directionally correct and catch “obvious” mistakes.
Something that the field is increasingly moving towards is using LLMs to automatically evaluate other LLMs, unlocking faster iteration. For example, you might ask GPT-4 how similar Claude’s response is to a reference answer, or whether Claude’s response is coherent on a scale of 1-5.
Prompt engineering the evaluating model can take some effort but clarifies an LLM developer’s thinking about what they want their model to do. By explicitly defining what is a good or bad response from the model, they are essentially writing labeling instructions for the evaluating model to generate noisy labels for data.
Note that these labels are noisy! Labels, whether from human labelers or an LLM, still need to be inspected by a domain expert who knows what they want their model to do and can evaluate the quality of human / machine labelers.
While automated evaluations are fast and cheap and enable you to iterate quickly, you’ll want to periodically confirm their results with human evaluations.
At many machine learning companies, human evaluations / labeling have been performed by a team of trained raters. For example, these raters can simply evaluate whether a model response is “good” or “bad” according to a set of labeling instructions – you can probably reuse the prompt that you gave to the evaluation model for this!. Another way to do this is to present raters with two outputs (sometimes generated by a model, sometimes generated by a human) and ask the raters to choose which one is “better” according to the labeling instructions.
However, evaluating subjective tasks like copywriting can still be very difficult and inconsistent across raters. Instead of asking your raters to give a holistic rating, you could ask raters to assess specific dimensions of quality, for example, coherence, tone, or accuracy. Think of this like a rubric – by providing sufficient structure and training for your raters, you standardize their judgment.
However, the cost of hiring humans means that you probably won’t be able to do human evaluation as frequently. Every time you change your LLM + prompt combo, you should run an auto-eval on your metrics. Human evaluations, since they are much more expensive, should be performed periodically or on a subsample of the evaluation set.
Ultimately, you want to deploy your LLM and test it out in the real world with real users.
One way to do this is to ask your users for explicit feedback. For example, for conversational LLM applications, you might show a simple “Was this conversation helpful?” popup at the conclusion of the session with a “thumbs up” or “thumbs down” response for a user to pick from.
However, in order to reduce the amount of work required from users, you may want to look at implicit metrics. These include metrics like engagement time and conversation length for conversational applications. For generative content platforms, this can be the number of times a user re-ran generation before being satisfied with the response.
At the end of the day, there’s no substitute for looking at user interactions. The best way to tell if your LLM is working well is to directly read the text logs of interactions between your users and your model – it’s often easy to tell at a glance where the model is generating a bad output or see when your users are dissatisfied with an answer.
When you don’t have much user activity, it’s pretty easy to read through all of your logs to understand where the model works well or badly. However, as your user base grows and the number of interactions increases, it can take hours to find the critical failures in a sea of interactions.
There’s numerous ways to tackle this problem, from query systems to Jupyter notebooks and more. Our company’s product, Tidepool, solves this by using embedding analysis. With Tidepool, you can understand trends in the text data: when a user was unsatisfied or had negative sentiment, what are the most common topics do users want to talk about or most common workflows that users want to run. In addition, you can find where the model performed well or badly in production: when the model refused to answer a user’s question, when the model’s output was incorrect in achieving the user’s goal, or when the user was dissatisfied with a model response.
Evaluating LLMs requires going beyond traditional ML model analysis. However, there’s a spectrum of options that range from “simple but unreliable” to “more complex but more rigorous.”
The most effective strategy is combining these approaches to get a comprehensive view. Both qualitative human assessment and quantitative usage data are critical for optimizing LLM performance in customer-facing applications.