LLM as judge

llm_as_judge evaluates each row with a natural language judge instruction. Use column references such as {answer} and {question} in the instruction.

Basic Usage

import pandas as pd
import lotus
from lotus.models import LM

lotus.settings.configure(lm=LM(model="gpt-4o-mini"))

df = pd.DataFrame({
    "question": [
        "Explain supervised learning.",
        "Explain cross-validation.",
    ],
    "answer": [
        "Supervised learning trains on labeled examples.",
        "Cross-validation evaluates a model on multiple held-out splits.",
    ],
})

results = df.llm_as_judge(
    "Rate the accuracy and completeness of {answer} for {question} "
    "from 1 to 10. Return only the score.",
    n_trials=2,
)

print(results)

Output Columns

For each trial, LOTUS adds one output column named {suffix}_{trial}. The default suffix is _judge, so the first trial is _judge_0.

Set return_raw_outputs=True to add raw_output{suffix}_{trial}. Set return_explanations=True to add explanation{suffix}_{trial}.

Structured Output

Pass a Pydantic model as response_format when you want structured judge outputs.

from pydantic import BaseModel, Field

class Evaluation(BaseModel):
    score: int = Field(description="Score from 1 to 10")
    reasoning: str = Field(description="Reason for the score")

results = df.llm_as_judge(
    "Evaluate {answer} for {question}.",
    response_format=Evaluation,
    suffix="_evaluation",
)

first = results.loc[0, "_evaluation_0"]
print(first.score)
print(first.reasoning)

response_format is not supported with ReasoningStrategy.COT or ReasoningStrategy.ZS_COT. Put reasoning fields in the structured output model instead.

Few-Shot Examples

Pass examples with the same input columns and an Answer column.

examples = pd.DataFrame({
    "question": ["What is supervised learning?"],
    "answer": ["It uses labeled examples to train a model."],
    "Answer": ["9"],
})

results = df.llm_as_judge(
    "Rate {answer} for {question} from 1 to 10.",
    examples=examples,
)

If you use ReasoningStrategy.COT with examples, include a Reasoning column in the examples DataFrame.

Extra Context Columns

extra_cols_to_include lets you include columns in the judge input even when they are not referenced directly in the instruction.

results = df.llm_as_judge(
    "Evaluate the answer: {answer}",
    extra_cols_to_include=["question"],
)

Parameters

DataFrame.llm_as_judge(
    judge_instruction,
    response_format=None,
    n_trials=1,
    system_prompt=None,
    postprocessor=map_postprocess,
    return_raw_outputs=False,
    return_explanations=False,
    suffix="_judge",
    examples=None,
    cot_reasoning=None,
    strategy=None,
    extra_cols_to_include=None,
    safe_mode=False,
    progress_bar_desc="Evaluating",
    **model_kwargs,
)
  • judge_instruction: Natural language judge instruction.

  • response_format: Optional Pydantic model for structured output.

  • n_trials: Number of independent judge trials.

  • system_prompt: Optional system prompt for the judge.

  • postprocessor: Function that parses raw model outputs.

  • return_raw_outputs: Include raw model text columns.

  • return_explanations: Include explanation columns.

  • suffix: Base suffix for output columns.

  • examples: Few-shot examples with an Answer column.

  • cot_reasoning: Reasoning strings for direct function use.

  • strategy: Optional reasoning strategy.

  • extra_cols_to_include: Extra columns to include in judge inputs.

  • safe_mode: Estimate cost before execution.

  • progress_bar_desc: Progress bar label.

  • model_kwargs: Extra keyword arguments passed to the LM.

API Reference

class lotus.evals.llm_as_judge.LLMAsJudgeDataframe(pandas_obj: DataFrame)

Bases: object

Judge the given docs based on the judging criteria, context and grading scale.

Parameters:
  • judge_instruction (str) – The natural language instruction that guides the judging process. This instruction tells the model how to judge each input document.

  • response_format (BaseModel | None) – The response format for the judge. If None, the judge will return a string. Defaults to None.

  • n_trials (int) – The number of trials to run. Defaults to 1.

  • system_prompt (str | None, optional) – The system prompt to use.

  • postprocessor (Callable, optional) – A function to post-process the model outputs. Should take (outputs, model, use_cot) and return SemanticMapPostprocessOutput. Defaults to map_postprocess.

  • return_raw_outputs (bool, optional) – Whether to return the raw outputs of the model. Defaults to False.

  • return_explanations (bool, optional) – Whether to return the explanations of the model. Defaults to False.

  • suffix (str, optional) – The suffix for the output column names. Defaults to “_judge”.

  • examples (pd.DataFrame | None, optional) – Example DataFrame for few-shot learning. Should have the same column structure as the input DataFrame plus an “Answer” column. Defaults to None.

  • strategy (ReasoningStrategy | None, optional) – The reasoning strategy to use. Can be None, COT, or ZS_COT. Defaults to None.

  • extra_cols_to_include (list[str] | None, optional) – Extra columns to include in the input for judge. Defaults to None.

  • safe_mode (bool, optional) – Whether to enable safe mode with cost estimation. Defaults to False.

  • progress_bar_desc (str, optional) – Description for the progress bar. Defaults to “Mapping”.

  • **model_kwargs – Any: Additional keyword arguments to pass to the model.

Returns:

A DataFrame containing the original data plus the judged

outputs. Additional columns will be added for explanations and raw outputs if requested.

Return type:

pd.DataFrame

Raises:

ValueError – If the language model is not configured, if specified columns don’t exist in the DataFrame, or if the examples DataFrame doesn’t have the required “Answer” column.

__call__(judge_instruction: str, response_format: ~pydantic.main.BaseModel | None = None, n_trials: int = 1, system_prompt: str | None = None, postprocessor: ~typing.Callable[[list[str], ~lotus.models.lm.LM, bool], ~lotus.types.SemanticMapPostprocessOutput] = <function map_postprocess>, return_raw_outputs: bool = False, return_explanations: bool = False, suffix: str = '_judge', examples: ~pandas.core.frame.DataFrame | None = None, cot_reasoning: list[str] | None = None, strategy: ~lotus.types.ReasoningStrategy | None = None, extra_cols_to_include: list[str] | None = None, safe_mode: bool = False, progress_bar_desc: str = 'Evaluating', **model_kwargs: ~typing.Any) DataFrame

Call self as a function.

lotus.evals.llm_as_judge.llm_as_judge(docs: list[dict[str, ~typing.Any]], model: ~lotus.models.lm.LM, judge_instruction: str, response_format: ~pydantic.main.BaseModel | None = None, n_trials: int = 1, system_prompt: str | None = None, postprocessor: ~typing.Callable[[list[str], ~lotus.models.lm.LM, bool], ~lotus.types.SemanticMapPostprocessOutput] = <function map_postprocess>, examples_multimodal_data: list[dict[str, ~typing.Any]] | None = None, examples_answers: list[str] | None = None, cot_reasoning: list[str] | None = None, strategy: ~lotus.types.ReasoningStrategy | None = None, safe_mode: bool = False, progress_bar_desc: str = 'Evaluating', **model_kwargs: ~typing.Any) list[SemanticMapOutput | list[BaseModel]]

Judge the given docs based on the judging criteria, context and grading scale.

Parameters:
  • docs (list[dict[str, Any]]) – The list of documents to judge. Each document should be a dictionary containing multimodal information (text, images, etc.).

  • model (lotus.models.LM) – The language model instance to use for judging. Must be properly configured with appropriate API keys and settings.

  • judge_instruction (str) – The natural language instruction that guides the judging process. This instruction tells the model how to judge each input document.

  • response_format (BaseModel | None) – The response format for the judge. If None, the judge will return a string. Defaults to None.

  • n_trials (int) – The number of trials to run. Defaults to 1.

  • system_prompt (str | None, optional) – The system prompt to use.

  • postprocessor (Callable, optional) – A function to post-process the model outputs. Should take (outputs, model, use_cot) and return SemanticMapPostprocessOutput. Defaults to map_postprocess.

  • examples_multimodal_data (list[dict[str, Any]] | None, optional) – Example documents for few-shot learning. Each example should have the same structure as the input docs. Defaults to None.

  • examples_answers (list[str] | None, optional) – Expected outputs for the example documents. Should have the same length as examples_multimodal_data. Defaults to None.

  • cot_reasoning (list[str] | None, optional) – Chain-of-thought reasoning for the example documents. Used when strategy includes COT reasoning. Defaults to None.

  • strategy (ReasoningStrategy | None, optional) – The reasoning strategy to use. Can be None, COT, or ZS_COT. Defaults to None.

  • safe_mode (bool, optional) – Whether to enable safe mode with cost estimation. Defaults to False.

  • progress_bar_desc (str, optional) – Description for the progress bar. Defaults to “Mapping”.

  • **model_kwargs – Any: Additional keyword arguments to pass to the model.

Returns:

The output of the judge. Will be of shape (n_trials, n_docs).

Return type:

list[SemanticMapOutput | list[BaseModel]]