You can’t scroll through Twitter these days without coming across a new open-source LLM evaluation framework, guide, or startup; it’s clearly a topic on people’s minds. Now that the first flush of LLM hype has faded, people who are deploying LLMs in production are struggling to answer basic practical questions.
- Is the LLM doing what users want?
- When the LLM fails, how does it fail?
- How do you know whether a change you made improved or regressed LLM quality?
While new open-source models always brag about their performance on academic benchmarks, the reality is that those benchmarks are inherently limited, especially for difficult or subjective tasks. They don’t accurately capture what’s important – are the models solving your problem or use case?
Before starting Aquarium, I worked in the self-driving industry as an early engineer at Cruise. One of the things we spent a lot of time figuring out was our framework for developing the car’s AI software. I believe our lessons there can also be applied to improving LLMs.
One approach we used was to develop the cars entirely in simulation. Once we had set up a basic simulation framework, we could quickly set up specific scenarios in software and then run multiple simulations faster than real time and return results quickly. However, it’s very hard to build high-fidelity simulations that capture all the complexity of the real world. Also, it’s near impossible to think of simulation scenarios that were representative of the bizarre situations we would encounter in real-world driving. So our mantra was: “If it doesn’t work in simulation, it won’t work in the real world. But something working in simulation may not necessarily work in the real world.”
Another approach was to define a set of predetermined test scenarios and requirements. This can be thought of as the “systems engineering” approach to development. This involved setting up test tracks or courses, specifying requirements for certain driving scenarios, and evaluating the cars against those preset requirements in real-world tests. However, it was very time consuming to rigorously define requirements and to physically set up tests. This approach also inevitably failed to anticipate many of the edge cases and situations that emerge in real-world city driving.
Then there was a third approach, which was just to get the cars on real roads with human safety drivers overseeing them. We would then use the unlabeled production data to determine what needed to be fixed and worked on.
There were a few key things that made this successful.
- We were testing the car in a real-world environment, but with trained human operators at the wheel who could take over when needed and mark where the car made mistakes. This allowed us to easily log when and why the human operator had to intervene to take over the car.
- Once we had these takeover events logged, our QA team would go through and categorize these failures into buckets of the top 5-10 most frequent or severe issues.
- The engineering team would review these key failure buckets weekly and prioritize solutions to develop – “Initiatives A, B, and C should solve the top 3 failure types.” These solutions were validated in development by testing against previously recorded failure scenarios.
- The solutions would then be deployed to the production fleet to collect more real-world data and evaluate whether the performance improved.
One limitation was that this workflow could only identify system-level failures that required human intervention, like near-collisions. It couldn’t diagnose issues with individual modules that didn’t directly cause a serious incident.
Another limitation was that development of solutions happened only after encountering a failure on the road. For certain classes of extremely obvious, rare or dangerous scenarios, it was much safer to develop safeguards ahead of time using simulated / systems engineering tests. Otherwise, a human safety driver would be forced to do a split-second takeover to avoid disaster when they first encountered that situation in real life.
Still, overall this was a very effective workflow for knocking out the majority of common failure cases and making progress very quickly and safely. The important pieces of this strategy were:
- Setting up a good feedback loop between production and development.
- Effectively prioritizing what are the highest-impact problems to solve and what solutions are the most efficient in solving them.
Lessons for LLMs
So how does this apply to LLMs?
As I’ve been talking to customers the last few weeks, I’ve seen many LLM evaluation approaches that remind me of either the simulation approach or the systems engineering approach.
For example, some companies are setting up simulated test environments with fake documents where QA engineers run commands and log failures. Others are constructing representative test cases to run with evaluation tools like LangSmith. But like with self-driving, simulated test environments are generally not representative of situations that arise in real-world usage. And it’s difficult to build a set of representative test fixtures through pure “top-down” analysis of system requirements.
In my experience working on ML systems, I believe the most efficient way to improve model performance is to utilize feedback from production. By finding and collating issues in the production data, you can gain a clear picture of what are the most important problems to solve with the model performance and then prioritize work accordingly. You can then convert problem examples into test fixtures that accurately represent the production distribution and ensure that you do not regress on important edge cases.
Of course, there are some limitations to this approach. LLMs can be run at high throughput (for example, if you have a lot of users interacting with your LLM app), making exhaustive human evaluation of every LLM output infeasible. You won’t always have a human “safety driver” who can detect and log every model failure. There are ways to work around this – for example, by crowdsourcing “thumbs up / down” feedback from users, or having a human expert inspect a subset of the data – but these are limited samples of the entire dataset of production interactions.
The good news is that you may not need a human to double-check model outputs. LLMs are uniquely able to self-critique their performance, allowing you to have a model double-check another model’s outputs!
Anthropic pioneered a technique called RLAIF (Reinforcement learning with AI feedback), which allows a human to write instructions for a critique LLM that defines good and bad behavior of their main LLM. The critique LLM can then inspect the outputs of the main LLM and label which outputs are good and which outputs are bad. Not only is this technique more scalable than hiring humans, a team at Google Research showed that LLM critique of a model produces results that are similar in quality to human critique of that model.
Of course, a critique LLM will likely have to be customized for specific use cases – an airline customer support bot has different standards of good and bad than an e-commerce or banking support bot. Content moderation teams also have varying guidelines based on audience – say, Pinterest may be willing to show content to adults that Roblox would not want to show to kids.
Once a critique LLM can reliably identify failures in the production model, then similarly to the self-driving example, we’ll want to cluster and prioritize these failures so we can fix them. However, in high scale LLM applications, this process may emerge hundreds or thousands of issues, which can be difficult to trawl through manually and categorize into buckets of failures.
Here at Tidepool, we’ve been working on a new feature, Quality Reports, to make this workflow much easier. Quality Reports tracks all of your LLM inputs and outputs in production, highlights problems in model performance using LLM critique, automatically breaks them down into the biggest buckets of failures, and then gives you a visualization that you can inspect further and share with your team.
As you can see above, for this particular dataset, 28% of production chat sessions had model failures, which we have separated into 5 buckets. Zooming into each bucket – let’s say, “Lack of empathy or understanding,” since it’s something we probably want to prioritize – Tidepool further breaks down the issues into relevant business categories and topics.
Quality Reports is about being able to efficiently extract insights from production data that can significantly speed up your model improvement. If you want to try this out for yourself, visit us at https://www.tidepool.so/ to learn more or get started with a free, self-service account today.