Evaluation Advanced Features
This page collects evaluation features that apply across the evaluation suite.
Reasoning Strategies
Use ReasoningStrategy.COT or ReasoningStrategy.ZS_COT when you want
chain-of-thought style reasoning from the judge.
from lotus.types import ReasoningStrategy
results = df.llm_as_judge(
"Evaluate the quality of {answer} for {question}.",
strategy=ReasoningStrategy.COT,
return_explanations=True,
)
Reasoning strategies cannot be combined with response_format in
llm_as_judge. For structured outputs with reasoning, add a reasoning field
to the Pydantic response model and do not set a CoT strategy.
Structured Output
llm_as_judge accepts a Pydantic response_format.
from pydantic import BaseModel, Field
class SafetyResult(BaseModel):
is_safe: bool = Field(description="Whether the content is safe")
risk_level: str = Field(description="low, medium, or high")
reasoning: str = Field(description="Explanation for the decision")
results = df.llm_as_judge(
"Evaluate whether {content} is safe for a general audience.",
response_format=SafetyResult,
)
Few-Shot Examples
Both evaluation accessors accept examples DataFrames. Include the same
input columns as the evaluated DataFrame plus an Answer column.
examples = pd.DataFrame({
"question": ["What is gradient descent?"],
"answer": ["An optimization method that follows the loss gradient."],
"Answer": ["9"],
})
results = df.llm_as_judge(
"Rate {answer} for {question} from 1 to 10.",
examples=examples,
)
If the examples are used with ReasoningStrategy.COT, include a
Reasoning column.
Custom System Prompts
Use system_prompt to set judge role, rubric context, or domain expertise.
results = df.llm_as_judge(
"Evaluate {answer} for {question}.",
system_prompt=(
"You are an expert computer science instructor. "
"Grade for correctness, completeness, and clarity."
),
)
Pairwise Cascades
pairwise_judge supports filter cascades through cascade_args and
helper_examples. This routes confident comparisons through a helper model
and sends uncertain comparisons to the main LM.
from lotus.types import CascadeArgs
cascade_args = CascadeArgs(
recall_target=0.9,
precision_target=0.9,
sampling_percentage=0.5,
failure_probability=0.2,
)
results, stats = df.pairwise_judge(
"model_a",
"model_b",
"Which response better answers {question}?",
cascade_args=cascade_args,
return_stats=True,
)
Cache Isolation
Evaluation trials disable LOTUS operator caching while the judge calls run. This prevents repeated trials from returning cached judgments. LOTUS restores the original cache setting after evaluation completes.