LLM as judge
llm_as_judge evaluates each row with a natural language judge instruction.
Use column references such as {answer} and {question} in the
instruction.
Basic Usage
import pandas as pd
import lotus
from lotus.models import LM
lotus.settings.configure(lm=LM(model="gpt-4o-mini"))
df = pd.DataFrame({
"question": [
"Explain supervised learning.",
"Explain cross-validation.",
],
"answer": [
"Supervised learning trains on labeled examples.",
"Cross-validation evaluates a model on multiple held-out splits.",
],
})
results = df.llm_as_judge(
"Rate the accuracy and completeness of {answer} for {question} "
"from 1 to 10. Return only the score.",
n_trials=2,
)
print(results)
Output Columns
For each trial, LOTUS adds one output column named {suffix}_{trial}.
The default suffix is _judge, so the first trial is _judge_0.
Set return_raw_outputs=True to add raw_output{suffix}_{trial}.
Set return_explanations=True to add explanation{suffix}_{trial}.
Structured Output
Pass a Pydantic model as response_format when you want structured judge
outputs.
from pydantic import BaseModel, Field
class Evaluation(BaseModel):
score: int = Field(description="Score from 1 to 10")
reasoning: str = Field(description="Reason for the score")
results = df.llm_as_judge(
"Evaluate {answer} for {question}.",
response_format=Evaluation,
suffix="_evaluation",
)
first = results.loc[0, "_evaluation_0"]
print(first.score)
print(first.reasoning)
response_format is not supported with ReasoningStrategy.COT or
ReasoningStrategy.ZS_COT. Put reasoning fields in the structured output
model instead.
Few-Shot Examples
Pass examples with the same input columns and an Answer column.
examples = pd.DataFrame({
"question": ["What is supervised learning?"],
"answer": ["It uses labeled examples to train a model."],
"Answer": ["9"],
})
results = df.llm_as_judge(
"Rate {answer} for {question} from 1 to 10.",
examples=examples,
)
If you use ReasoningStrategy.COT with examples, include a Reasoning
column in the examples DataFrame.
Extra Context Columns
extra_cols_to_include lets you include columns in the judge input even
when they are not referenced directly in the instruction.
results = df.llm_as_judge(
"Evaluate the answer: {answer}",
extra_cols_to_include=["question"],
)
Parameters
DataFrame.llm_as_judge(
judge_instruction,
response_format=None,
n_trials=1,
system_prompt=None,
postprocessor=map_postprocess,
return_raw_outputs=False,
return_explanations=False,
suffix="_judge",
examples=None,
cot_reasoning=None,
strategy=None,
extra_cols_to_include=None,
safe_mode=False,
progress_bar_desc="Evaluating",
**model_kwargs,
)
judge_instruction: Natural language judge instruction.response_format: Optional Pydantic model for structured output.n_trials: Number of independent judge trials.system_prompt: Optional system prompt for the judge.postprocessor: Function that parses raw model outputs.return_raw_outputs: Include raw model text columns.return_explanations: Include explanation columns.suffix: Base suffix for output columns.examples: Few-shot examples with anAnswercolumn.cot_reasoning: Reasoning strings for direct function use.strategy: Optional reasoning strategy.extra_cols_to_include: Extra columns to include in judge inputs.safe_mode: Estimate cost before execution.progress_bar_desc: Progress bar label.model_kwargs: Extra keyword arguments passed to the LM.
API Reference
- class lotus.evals.llm_as_judge.LLMAsJudgeDataframe(pandas_obj: DataFrame)
Bases:
objectJudge the given docs based on the judging criteria, context and grading scale.
- Parameters:
judge_instruction (str) – The natural language instruction that guides the judging process. This instruction tells the model how to judge each input document.
response_format (BaseModel | None) – The response format for the judge. If None, the judge will return a string. Defaults to None.
n_trials (int) – The number of trials to run. Defaults to 1.
system_prompt (str | None, optional) – The system prompt to use.
postprocessor (Callable, optional) – A function to post-process the model outputs. Should take (outputs, model, use_cot) and return SemanticMapPostprocessOutput. Defaults to map_postprocess.
return_raw_outputs (bool, optional) – Whether to return the raw outputs of the model. Defaults to False.
return_explanations (bool, optional) – Whether to return the explanations of the model. Defaults to False.
suffix (str, optional) – The suffix for the output column names. Defaults to “_judge”.
examples (pd.DataFrame | None, optional) – Example DataFrame for few-shot learning. Should have the same column structure as the input DataFrame plus an “Answer” column. Defaults to None.
strategy (ReasoningStrategy | None, optional) – The reasoning strategy to use. Can be None, COT, or ZS_COT. Defaults to None.
extra_cols_to_include (list[str] | None, optional) – Extra columns to include in the input for judge. Defaults to None.
safe_mode (bool, optional) – Whether to enable safe mode with cost estimation. Defaults to False.
progress_bar_desc (str, optional) – Description for the progress bar. Defaults to “Mapping”.
**model_kwargs – Any: Additional keyword arguments to pass to the model.
- Returns:
- A DataFrame containing the original data plus the judged
outputs. Additional columns will be added for explanations and raw outputs if requested.
- Return type:
pd.DataFrame
- Raises:
ValueError – If the language model is not configured, if specified columns don’t exist in the DataFrame, or if the examples DataFrame doesn’t have the required “Answer” column.
- __call__(judge_instruction: str, response_format: ~pydantic.main.BaseModel | None = None, n_trials: int = 1, system_prompt: str | None = None, postprocessor: ~typing.Callable[[list[str], ~lotus.models.lm.LM, bool], ~lotus.types.SemanticMapPostprocessOutput] = <function map_postprocess>, return_raw_outputs: bool = False, return_explanations: bool = False, suffix: str = '_judge', examples: ~pandas.core.frame.DataFrame | None = None, cot_reasoning: list[str] | None = None, strategy: ~lotus.types.ReasoningStrategy | None = None, extra_cols_to_include: list[str] | None = None, safe_mode: bool = False, progress_bar_desc: str = 'Evaluating', **model_kwargs: ~typing.Any) DataFrame
Call self as a function.
- lotus.evals.llm_as_judge.llm_as_judge(docs: list[dict[str, ~typing.Any]], model: ~lotus.models.lm.LM, judge_instruction: str, response_format: ~pydantic.main.BaseModel | None = None, n_trials: int = 1, system_prompt: str | None = None, postprocessor: ~typing.Callable[[list[str], ~lotus.models.lm.LM, bool], ~lotus.types.SemanticMapPostprocessOutput] = <function map_postprocess>, examples_multimodal_data: list[dict[str, ~typing.Any]] | None = None, examples_answers: list[str] | None = None, cot_reasoning: list[str] | None = None, strategy: ~lotus.types.ReasoningStrategy | None = None, safe_mode: bool = False, progress_bar_desc: str = 'Evaluating', **model_kwargs: ~typing.Any) list[SemanticMapOutput | list[BaseModel]]
Judge the given docs based on the judging criteria, context and grading scale.
- Parameters:
docs (list[dict[str, Any]]) – The list of documents to judge. Each document should be a dictionary containing multimodal information (text, images, etc.).
model (lotus.models.LM) – The language model instance to use for judging. Must be properly configured with appropriate API keys and settings.
judge_instruction (str) – The natural language instruction that guides the judging process. This instruction tells the model how to judge each input document.
response_format (BaseModel | None) – The response format for the judge. If None, the judge will return a string. Defaults to None.
n_trials (int) – The number of trials to run. Defaults to 1.
system_prompt (str | None, optional) – The system prompt to use.
postprocessor (Callable, optional) – A function to post-process the model outputs. Should take (outputs, model, use_cot) and return SemanticMapPostprocessOutput. Defaults to map_postprocess.
examples_multimodal_data (list[dict[str, Any]] | None, optional) – Example documents for few-shot learning. Each example should have the same structure as the input docs. Defaults to None.
examples_answers (list[str] | None, optional) – Expected outputs for the example documents. Should have the same length as examples_multimodal_data. Defaults to None.
cot_reasoning (list[str] | None, optional) – Chain-of-thought reasoning for the example documents. Used when strategy includes COT reasoning. Defaults to None.
strategy (ReasoningStrategy | None, optional) – The reasoning strategy to use. Can be None, COT, or ZS_COT. Defaults to None.
safe_mode (bool, optional) – Whether to enable safe mode with cost estimation. Defaults to False.
progress_bar_desc (str, optional) – Description for the progress bar. Defaults to “Mapping”.
**model_kwargs – Any: Additional keyword arguments to pass to the model.
- Returns:
The output of the judge. Will be of shape (n_trials, n_docs).
- Return type:
list[SemanticMapOutput | list[BaseModel]]