Evaluation Suite

LOTUS includes LLM-as-judge tools for evaluating model outputs, application responses, and content quality directly from pandas DataFrames.

The evaluation suite has two DataFrame accessors:

  • llm_as_judge evaluates each row independently.

  • pairwise_judge compares two response columns and chooses the better response for each row.

Setup

import pandas as pd
import lotus
from lotus.models import LM

lm = LM(model="gpt-4o-mini")
lotus.settings.configure(lm=lm)

df = pd.DataFrame({
    "question": [
        "What is cross-validation?",
        "What is gradient descent?",
    ],
    "answer": [
        "Cross-validation estimates generalization by evaluating on held-out splits.",
        "Gradient descent iteratively updates parameters to reduce a loss function.",
    ],
})

Choose the Right Evaluator

Use llm_as_judge when each row has one response to score, classify, or annotate.

scored = df.llm_as_judge(
    "Rate the accuracy of {answer} for {question} from 1 to 10. "
    "Return only the score."
)

Use pairwise_judge when each row has two responses and you want a direct comparison.

pairwise_df = pd.DataFrame({
    "question": ["What is cross-validation?"],
    "model_a": ["It evaluates a model on several held-out splits."],
    "model_b": ["It checks whether a model knows the answer."],
})

compared = pairwise_df.pairwise_judge(
    col1="model_a",
    col2="model_b",
    judge_instruction="Which response better answers {question}?",
    permute_cols=True,
    n_trials=2,
)

Caching Behavior

Evaluation calls temporarily disable LOTUS operator caching inside the judge loop so repeated trials can produce independent judgments. The global cache setting is restored after the evaluation call finishes.