LLM as judge

llm_as_judge evaluates each row with a natural language judge instruction. Use column references such as {answer} and {question} in the instruction.

Basic Usage

import pandas as pd
import lotus
from lotus.models import LM

lotus.settings.configure(lm=LM(model="gpt-4o-mini"))

df = pd.DataFrame({
    "question": [
        "Explain supervised learning.",
        "Explain cross-validation.",
    ],
    "answer": [
        "Supervised learning trains on labeled examples.",
        "Cross-validation evaluates a model on multiple held-out splits.",
    ],
})

results = df.llm_as_judge(
    "Rate the accuracy and completeness of {answer} for {question} "
    "from 1 to 10. Return only the score.",
    n_trials=2,
)

print(results)

Output Columns

For each trial, LOTUS adds one output column named {suffix}_{trial}. The default suffix is _judge, so the first trial is _judge_0.

Set return_raw_outputs=True to add raw_output{suffix}_{trial}. Set return_explanations=True to add explanation{suffix}_{trial}.

Structured Output

Pass a Pydantic model as response_format when you want structured judge outputs.

from pydantic import BaseModel, Field

class Evaluation(BaseModel):
    score: int = Field(description="Score from 1 to 10")
    reasoning: str = Field(description="Reason for the score")

results = df.llm_as_judge(
    "Evaluate {answer} for {question}.",
    response_format=Evaluation,
    suffix="_evaluation",
)

first = results.loc[0, "_evaluation_0"]
print(first.score)
print(first.reasoning)

response_format is not supported with ReasoningStrategy.COT or ReasoningStrategy.ZS_COT. Put reasoning fields in the structured output model instead.

Few-Shot Examples

Pass examples with the same input columns and an Answer column.

examples = pd.DataFrame({
    "question": ["What is supervised learning?"],
    "answer": ["It uses labeled examples to train a model."],
    "Answer": ["9"],
})

results = df.llm_as_judge(
    "Rate {answer} for {question} from 1 to 10.",
    examples=examples,
)

If you use ReasoningStrategy.COT with examples, include a Reasoning column in the examples DataFrame.

Extra Context Columns

extra_cols_to_include lets you include columns in the judge input even when they are not referenced directly in the instruction.

results = df.llm_as_judge(
    "Evaluate the answer: {answer}",
    extra_cols_to_include=["question"],
)

Parameters

DataFrame.llm_as_judge(
    judge_instruction,
    response_format=None,
    n_trials=1,
    system_prompt=None,
    postprocessor=map_postprocess,
    return_raw_outputs=False,
    return_explanations=False,
    suffix="_judge",
    examples=None,
    cot_reasoning=None,
    strategy=None,
    extra_cols_to_include=None,
    safe_mode=False,
    progress_bar_desc="Evaluating",
    **model_kwargs,
)

judge_instruction: Natural language judge instruction.
response_format: Optional Pydantic model for structured output.
n_trials: Number of independent judge trials.
system_prompt: Optional system prompt for the judge.
postprocessor: Function that parses raw model outputs.
return_raw_outputs: Include raw model text columns.
return_explanations: Include explanation columns.
suffix: Base suffix for output columns.
examples: Few-shot examples with an Answer column.
cot_reasoning: Reasoning strings for direct function use.
strategy: Optional reasoning strategy.
extra_cols_to_include: Extra columns to include in judge inputs.
safe_mode: Estimate cost before execution.
progress_bar_desc: Progress bar label.
model_kwargs: Extra keyword arguments passed to the LM.