Evaluation Suite
LOTUS includes LLM-as-judge tools for evaluating model outputs, application responses, and content quality directly from pandas DataFrames.
The evaluation suite has two DataFrame accessors:
llm_as_judgeevaluates each row independently.pairwise_judgecompares two response columns and chooses the better response for each row.
Setup
import pandas as pd
import lotus
from lotus.models import LM
lm = LM(model="gpt-4o-mini")
lotus.settings.configure(lm=lm)
df = pd.DataFrame({
"question": [
"What is cross-validation?",
"What is gradient descent?",
],
"answer": [
"Cross-validation estimates generalization by evaluating on held-out splits.",
"Gradient descent iteratively updates parameters to reduce a loss function.",
],
})
Choose the Right Evaluator
Use llm_as_judge when each row has one response to score, classify, or
annotate.
scored = df.llm_as_judge(
"Rate the accuracy of {answer} for {question} from 1 to 10. "
"Return only the score."
)
Use pairwise_judge when each row has two responses and you want a direct
comparison.
pairwise_df = pd.DataFrame({
"question": ["What is cross-validation?"],
"model_a": ["It evaluates a model on several held-out splits."],
"model_b": ["It checks whether a model knows the answer."],
})
compared = pairwise_df.pairwise_judge(
col1="model_a",
col2="model_b",
judge_instruction="Which response better answers {question}?",
permute_cols=True,
n_trials=2,
)
Caching Behavior
Evaluation calls temporarily disable LOTUS operator caching inside the judge loop so repeated trials can produce independent judgments. The global cache setting is restored after the evaluation call finishes.