LLM as judge ============ ``llm_as_judge`` evaluates each row with a natural language judge instruction. Use column references such as ``{answer}`` and ``{question}`` in the instruction. Basic Usage ----------- .. code-block:: python import pandas as pd import lotus from lotus.models import LM lotus.settings.configure(lm=LM(model="gpt-4o-mini")) df = pd.DataFrame({ "question": [ "Explain supervised learning.", "Explain cross-validation.", ], "answer": [ "Supervised learning trains on labeled examples.", "Cross-validation evaluates a model on multiple held-out splits.", ], }) results = df.llm_as_judge( "Rate the accuracy and completeness of {answer} for {question} " "from 1 to 10. Return only the score.", n_trials=2, ) print(results) Output Columns -------------- For each trial, LOTUS adds one output column named ``{suffix}_{trial}``. The default suffix is ``_judge``, so the first trial is ``_judge_0``. Set ``return_raw_outputs=True`` to add ``raw_output{suffix}_{trial}``. Set ``return_explanations=True`` to add ``explanation{suffix}_{trial}``. Structured Output ----------------- Pass a Pydantic model as ``response_format`` when you want structured judge outputs. .. code-block:: python from pydantic import BaseModel, Field class Evaluation(BaseModel): score: int = Field(description="Score from 1 to 10") reasoning: str = Field(description="Reason for the score") results = df.llm_as_judge( "Evaluate {answer} for {question}.", response_format=Evaluation, suffix="_evaluation", ) first = results.loc[0, "_evaluation_0"] print(first.score) print(first.reasoning) ``response_format`` is not supported with ``ReasoningStrategy.COT`` or ``ReasoningStrategy.ZS_COT``. Put reasoning fields in the structured output model instead. Few-Shot Examples ----------------- Pass examples with the same input columns and an ``Answer`` column. .. code-block:: python examples = pd.DataFrame({ "question": ["What is supervised learning?"], "answer": ["It uses labeled examples to train a model."], "Answer": ["9"], }) results = df.llm_as_judge( "Rate {answer} for {question} from 1 to 10.", examples=examples, ) If you use ``ReasoningStrategy.COT`` with examples, include a ``Reasoning`` column in the examples DataFrame. Extra Context Columns --------------------- ``extra_cols_to_include`` lets you include columns in the judge input even when they are not referenced directly in the instruction. .. code-block:: python results = df.llm_as_judge( "Evaluate the answer: {answer}", extra_cols_to_include=["question"], ) Parameters ---------- .. code-block:: python DataFrame.llm_as_judge( judge_instruction, response_format=None, n_trials=1, system_prompt=None, postprocessor=map_postprocess, return_raw_outputs=False, return_explanations=False, suffix="_judge", examples=None, cot_reasoning=None, strategy=None, extra_cols_to_include=None, safe_mode=False, progress_bar_desc="Evaluating", **model_kwargs, ) - ``judge_instruction``: Natural language judge instruction. - ``response_format``: Optional Pydantic model for structured output. - ``n_trials``: Number of independent judge trials. - ``system_prompt``: Optional system prompt for the judge. - ``postprocessor``: Function that parses raw model outputs. - ``return_raw_outputs``: Include raw model text columns. - ``return_explanations``: Include explanation columns. - ``suffix``: Base suffix for output columns. - ``examples``: Few-shot examples with an ``Answer`` column. - ``cot_reasoning``: Reasoning strings for direct function use. - ``strategy``: Optional reasoning strategy. - ``extra_cols_to_include``: Extra columns to include in judge inputs. - ``safe_mode``: Estimate cost before execution. - ``progress_bar_desc``: Progress bar label. - ``model_kwargs``: Extra keyword arguments passed to the LM. API Reference ------------- .. automodule:: lotus.evals.llm_as_judge :members: :undoc-members: :show-inheritance: