Getting started with automated evaluations

Albert Zhang

24 April 2024

At Braintrust, when we chat with engineers building AI applications, one of the most common questions we hear is “How do we get started with automated evaluations?”

In this post, we will discuss the state of evals today and lay out some high-leverage ways to quickly get started with automated evaluations.

The state of evals today

Prior to Braintrust, we see AI teams leverage a few common approaches to evals:

Vibes-based: engineers and PMs remember some interesting test cases and eyeball the results
Benchmarks and/or black box: MMLU for general tasks, HellaSwag for common sense, TruthfulQA for truthfulness, HumanEval for code generation, many more
Stitched together manual review: a combination of examples saved in spreadsheets, a script to run through test cases, and humans (engineers/PMs/SMEs) manually checking examples

While the above approaches are all helpful, we find that all three fall short in important ways. Vibes and manual review do not scale, and general benchmarks are not sufficiently application-specific and are hard to customize. This means engineering teams struggle to understand product performance, resulting in a very slow dev loop and frustrating behavior like:

Making updates without having a good sense of how they impact end users
Playing whack-a-mole to identify regressions
Manually scoring responses one by one
Manually tracking experiments and examples over time (or worse, not tracking them)

Automated evaluations

Automated evaluations are easy to set up and can make an immediate impact on AI development speed. In this section, we will walk through 3 great approaches: LLM evaluators, heuristics, and comparative evals.

LLM evaluators

LLMs are incredibly useful for evaluating responses out-of-the-box, even with minimal prompting. Anything you can ask a human to evaluate, you can (at least partially) encode into an LLM evaluator. Here are some examples:

Comparing a generated output vs. an expected output - instead of having an engineer scroll through an Excel spreadsheet and manually compare generated responses vs expected responses, you can use a factuality prompt to compare the two. Many of our customers use this type of test to detect and prevent hallucinations
Checking whether an output fully addresses a question - if you provide a task and a response, LLMs do a great job of scoring whether the response is relevant and addresses all parts of the task

The above two methods are great places to start, and we’ve seen customers successfully configure LLMs to score many other subjective characteristics - conciseness, tone, helpfulness, writing quality, and many more.

Heuristics

Heuristics are a valuable objective way to score responses. We’ve found that the best heuristics fall into one of two buckets:

Functional - ensuring the output fulfills a specific functional criteria
- Examples: testing if an output is valid markdown, if generated code is executable, if the model selected a valid option from a list, Levenshtein distance
Subjective - using objective heuristics as a proxy for subjective factors
- Examples: checking if an output exceeds a certain number of words (conciseness), checking if an output contains the word “sorry” (usefulness/tone)

Importantly - to make heuristic scoring as valuable as possible, it should be extremely easy for engineering teams to see updated scores after every change, quickly drill down into interesting examples, and add/tweak heuristics.

Comparative evals

Comparative evals compare an updated set of responses vs. a previous iteration. This is particularly helpful in understanding whether your application is improving as you make changes. Comparative evals also do not require expected responses, so they can be a great option for very subjective tasks. Here are a few examples:

Testing whether summarization is improving (example)
Comparing cost, token usage, duration (especially when switching between models)
Starting with a standard template like battle and tweaking the questions and scores over time to be use-case specific

Braintrust natively supports hill climbing, which makes it very easy to iteratively compare new outputs to previous ones.

Continuous iteration

While there is no replacement for human review, setting up basic structure around automated evals unlocks the ability for developers to start iterating quickly. The ideal AI dev loop enables teams to immediately understand performance, track experiments over time, identify and drill down into interesting examples, and codify what “good” looks like. This also makes human review time much higher leverage as you can point reviewers to useful examples and continuously utilize their scores.

Getting this foundation in place does not require a big time investment up front. A single scoring function with 10-30 examples is enough to enable teams to start iterating. We’ve seen teams start from that foundation and very quickly scale into making 50+ updates per day across their AI applications, evaluation methods, and test data.

At Braintrust, we obsess over making the AI development process as smooth and iterative as possible. Setting up evaluations in Braintrust takes less than 1 hour and makes a huge difference. If you want to learn more, sign up, check out our docs or get in touch!

AI Development Loops Eval feedback loops