Evaluation Advanced Features ============================ This page collects evaluation features that apply across the evaluation suite. Reasoning Strategies -------------------- Use ``ReasoningStrategy.COT`` or ``ReasoningStrategy.ZS_COT`` when you want chain-of-thought style reasoning from the judge. .. code-block:: python from lotus.types import ReasoningStrategy results = df.llm_as_judge( "Evaluate the quality of {answer} for {question}.", strategy=ReasoningStrategy.COT, return_explanations=True, ) Reasoning strategies cannot be combined with ``response_format`` in ``llm_as_judge``. For structured outputs with reasoning, add a reasoning field to the Pydantic response model and do not set a CoT strategy. Structured Output ----------------- ``llm_as_judge`` accepts a Pydantic ``response_format``. .. code-block:: python from pydantic import BaseModel, Field class SafetyResult(BaseModel): is_safe: bool = Field(description="Whether the content is safe") risk_level: str = Field(description="low, medium, or high") reasoning: str = Field(description="Explanation for the decision") results = df.llm_as_judge( "Evaluate whether {content} is safe for a general audience.", response_format=SafetyResult, ) Few-Shot Examples ----------------- Both evaluation accessors accept ``examples`` DataFrames. Include the same input columns as the evaluated DataFrame plus an ``Answer`` column. .. code-block:: python examples = pd.DataFrame({ "question": ["What is gradient descent?"], "answer": ["An optimization method that follows the loss gradient."], "Answer": ["9"], }) results = df.llm_as_judge( "Rate {answer} for {question} from 1 to 10.", examples=examples, ) If the examples are used with ``ReasoningStrategy.COT``, include a ``Reasoning`` column. Custom System Prompts --------------------- Use ``system_prompt`` to set judge role, rubric context, or domain expertise. .. code-block:: python results = df.llm_as_judge( "Evaluate {answer} for {question}.", system_prompt=( "You are an expert computer science instructor. " "Grade for correctness, completeness, and clarity." ), ) Pairwise Cascades ----------------- ``pairwise_judge`` supports filter cascades through ``cascade_args`` and ``helper_examples``. This routes confident comparisons through a helper model and sends uncertain comparisons to the main LM. .. code-block:: python from lotus.types import CascadeArgs cascade_args = CascadeArgs( recall_target=0.9, precision_target=0.9, sampling_percentage=0.5, failure_probability=0.2, ) results, stats = df.pairwise_judge( "model_a", "model_b", "Which response better answers {question}?", cascade_args=cascade_args, return_stats=True, ) Cache Isolation --------------- Evaluation trials disable LOTUS operator caching while the judge calls run. This prevents repeated trials from returning cached judgments. LOTUS restores the original cache setting after evaluation completes.