Autoevals library


AutoEvals is a tool to quickly and easily evaluate AI model outputs.

It bundles together a variety of automatic evaluation methods including:

  • Heuristic (e.g. Levenshtein distance)
  • Statistical (e.g. BLEU)
  • Model-based (using LLMs)

AutoEvals uses model-graded evaluation for a variety of subjective tasks including fact checking, safety, and more. Many of these evaluations are adapted from OpenAI's excellent evals (opens in a new tab) project but are implemented so you can flexibly run them on individual examples, tweak the prompts, and debug their outputs.

You can also create your own model-graded evaluations with AutoEvals. It's easy to add custom prompts, parse outputs, and manage exceptions. AutoEvals is an open source library available on GitHub (opens in a new tab).


AutoEvals is distributed as a Python library on PyPI (opens in a new tab) and Node.js library on NPM (opens in a new tab).

npm install autoevals


yarn add autoevals


Use AutoEvals to model-grade an example LLM completion using the factuality prompt (opens in a new tab).

import { Factuality } from "autoevals";
(async () => {
  const input = "Which country has the highest population?";
  const output = "People's Republic of China";
  const expected = "China";
  const result = await Factuality({ output, expected, input });
  console.log(`Factuality score: ${result.score}`);
  console.log(`Factuality metadata: ${result.metadata?.rationale}`);

Using Braintrust with AutoEvals

Once you grade an output using AutoEvals, it's convenient to use Braintrust (opens in a new tab) to log and compare your evaluation results.

Create a file named example.eval.js (it must take the form *.eval.[ts|tsx|js|jsx]):

import { Eval } from "braintrust";
import { Factuality } from "autoevals";
Eval("AutoEvals", {
  data: () => [
      input: "Which country has the highest population?",
      expected: "China",
  task: () => "People's Republic of China",
  scores: [Factuality],

Then, run

npx braintrust run example.eval.js

Supported Evaluation Methods

Model-Based Classification

  • Battle
  • ClosedQA
  • Humor
  • Factuality
  • Security
  • Summarization
  • SQL
  • Translation
  • Fine-tuned binary classifiers*


  • BERTScore*
  • Ada Embedding distance*


  • Levenshtein distance
  • Numeric difference
  • JSON diff
  • Jaccard distance*


  • BLEU*
  • ROUGE*

* Coming soon!

Custom Evaluation Prompts

AutoEvals supports custom evaluation prompts for model-graded evaluation. To use them, simply pass in a prompt and scoring mechanism:

import { LLMClassifierFromTemplate } from "autoevals";
(async () => {
  const promptTemplate = `You are a technical project manager who helps software engineers generate better titles for their GitHub issues.
You will look at the issue description, and pick which of two titles better describes it.
I'm going to provide you with the issue description, and two possible titles.
Issue Description: {{input}}
1: {{output}}
2: {{expected}}`;
  const choiceScores = { 1: 1, 2: 0 };
  const evaluator = LLMClassifierFromTemplate({
    useCoT: false,
  const input = `As suggested by Nicolo, we should standardize the error responses coming from GoTrue, postgres, and realtime (and any other/future APIs) so that it's better DX when writing a client,
We can make this change on the servers themselves, but since postgrest and gotrue are fully/partially external may be harder to change, it might be an option to transform the errors within the client libraries/supabase-js, could be messy?
Nicolo also dropped this as a reference:`;
  const output = `Standardize error responses from GoTrue, Postgres, and Realtime APIs for better DX`;
  const expected = `Standardize Error Responses across APIs`;
  const response = await evaluator({ input, output, expected });
  console.log("Score", response.score);
  console.log("Metadata", response.metadata);

Why does this library exist?

There is nothing particularly novel about the evaluation methods in this library. They are all well-known and well-documented. However, there are a few things that are particularly difficult when evaluating in practice:

  • Normalizing metrics between 0 and 1 is tough. For example, check out the calculation in (opens in a new tab) to see how it's done for numeric differences.
  • Parsing the outputs on model-graded evaluations is also challenging. There are frameworks that do this, but it's hard to debug one output at a time, propagate errors, and tweak the prompts. AutoEvals makes these tasks easy.
  • Collecting metrics behind a uniform interface makes it easy to swap out evaluation methods and compare them. Prior to AutoEvals, we couldn't find an open source library where you can simply pass in input, output, and expected values through a bunch of different evaluation methods.