Pairwise judge

pairwise_judge compares two columns row by row. It returns A when col1 is better and B when col2 is better.

Basic Usage

import pandas as pd
import lotus
from lotus.models import LM

lotus.settings.configure(lm=LM(model="gpt-4o-mini"))

df = pd.DataFrame({
    "question": [
        "Explain cross-validation in one sentence.",
        "Suggest a subject line for a 1:1 meeting.",
    ],
    "model_a": [
        "Cross-validation evaluates a model across multiple held-out splits.",
        "Meeting request.",
    ],
    "model_b": [
        "Cross-validation is when the model checks its answers.",
        "Requesting time for a 1:1 next week",
    ],
})

results = df.pairwise_judge(
    col1="model_a",
    col2="model_b",
    judge_instruction="Which response better answers {question}?",
    n_trials=2,
    permute_cols=True,
)

print(results)

Position Bias Mitigation

Set permute_cols=True to run half the trials as col1 versus col2 and half as col2 versus col1. n_trials must be even when permute_cols=True.

results = df.pairwise_judge(
    "model_a",
    "model_b",
    "Which response is more helpful for {question}?",
    n_trials=4,
    permute_cols=True,
)

Output Columns

For each trial, LOTUS adds one output column named {suffix}_{trial}. The default suffix is _judge.

Set return_raw_outputs=True to include raw model outputs. Set return_explanations=True to include explanations.

Cascade Mode

pairwise_judge is implemented through semantic filtering and supports filter cascade options for lower-cost comparisons.

from lotus.types import CascadeArgs

cascade_args = CascadeArgs(
    recall_target=0.9,
    precision_target=0.9,
    sampling_percentage=0.5,
    failure_probability=0.2,
)

results, stats = df.pairwise_judge(
    col1="model_a",
    col2="model_b",
    judge_instruction="Which response better answers {question}?",
    cascade_args=cascade_args,
    return_stats=True,
)

When return_stats=True, the result is (DataFrame, stats).

Parameters

DataFrame.pairwise_judge(
    col1,
    col2,
    judge_instruction,
    n_trials=1,
    permute_cols=False,
    system_prompt=None,
    return_raw_outputs=False,
    return_explanations=False,
    default_to_col1=True,
    suffix="_judge",
    examples=None,
    helper_examples=None,
    strategy=None,
    cascade_args=None,
    return_stats=False,
    safe_mode=False,
    progress_bar_desc="Evaluating",
    additional_cot_instructions="",
    **model_kwargs,
)

col1: First response column. Results map this column to A.
col2: Second response column. Results map this column to B.
judge_instruction: Natural language comparison criteria.
n_trials: Number of comparison trials.
permute_cols: Run both response orders to reduce position bias.
system_prompt: Optional system prompt for the judge.
return_raw_outputs: Include raw model text columns.
return_explanations: Include explanation columns.
default_to_col1: Default decision when parsing is uncertain.
suffix: Base suffix for output columns.
examples: Few-shot examples for the main judge.
helper_examples: Few-shot examples for the helper LM in cascade mode.
strategy: Optional reasoning strategy.
cascade_args: Optional filter cascade configuration.
return_stats: Return cascade statistics with the DataFrame.
safe_mode: Estimate cost before execution.
progress_bar_desc: Progress bar label.
additional_cot_instructions: Extra CoT instructions for sem-filter mode.
model_kwargs: Extra keyword arguments passed to the LM.

API Reference

class lotus.evals.pairwise_judge.PairwiseJudgeDataframe(pandas_obj: Any)

Bases: object

Judge the given df’s col1 and col2, based on the judging criteria, context and grading scale.

Parameters:

col1 (str) – The column name of the first dataframe to judge.
col2 (str) – The column name of the second dataframe to judge.
judge_instruction (str) – The natural language instruction that guides the judging process. This instruction tells the model how to judge each input document.
n_trials (int) – The number of trials to run. Defaults to 1.
permute_cols (bool) – Whether to permute the columns in each trial. Defaults to False.
system_prompt (str | None, optional) – The system prompt to use.
return_raw_outputs (bool, optional) – Whether to return the raw outputs of the model. Defaults to False.
return_explanations (bool, optional) – Whether to return the explanations of the model. Defaults to False.
suffix (str, optional) – The suffix for the output column names. Defaults to “_judge”.
examples (pd.DataFrame | None, optional) – Example DataFrame for few-shot learning. Should have the same column structure as the input DataFrame plus an “Answer” column. Defaults to None.
strategy (ReasoningStrategy | None, optional) – The reasoning strategy to use. Can be None, COT, or ZS_COT. Defaults to None.
safe_mode (bool, optional) – Whether to enable safe mode with cost estimation. Defaults to False.
progress_bar_desc (str, optional) – Description for the progress bar. Defaults to “Evaluating”.
default_to_col1 (bool, optional) – [sem_filter mode only] The default filter decision when the model is uncertain. Defaults to True.
helper_examples (pd.DataFrame | None, optional) – [sem_filter mode only] Example DataFrame for the helper LM in cascade filtering. Defaults to None.
cascade_args (CascadeArgs | None, optional) – [sem_filter mode only] Arguments for cascade filtering to reduce cost via a proxy model. Defaults to None.
return_stats (bool, optional) – [sem_filter mode only] Whether to return a stats dict alongside the DataFrame as a (DataFrame, stats) tuple. Defaults to False.
additional_cot_instructions (str, optional) – [sem_filter mode only] Extra instructions appended to the chain-of-thought prompt. Defaults to “”.
**model_kwargs – Any: Additional keyword arguments to pass to the model.

Returns:

A DataFrame containing the original data: plus the judged outputs. When return_stats=True, returns a (DataFrame, stats_dict) tuple. Additional columns are added for explanations and raw outputs if requested.

Return type:

pd.DataFrame | tuple[pd.DataFrame, dict]

Raises:

ValueError – If the language model is not configured, if specified columns don’t exist in the DataFrame, or if the examples DataFrame doesn’t have the required “Answer” column.

__call__(col1: str, col2: str, judge_instruction: str, n_trials: int = 1, permute_cols: bool = False, system_prompt: str | None = None, return_raw_outputs: bool = False, return_explanations: bool = False, default_to_col1: bool = True, suffix: str = '_judge', examples: DataFrame | None = None, helper_examples: DataFrame | None = None, strategy: ReasoningStrategy | None = None, cascade_args: CascadeArgs | None = None, return_stats: bool = False, safe_mode: bool = False, progress_bar_desc: str = 'Evaluating', additional_cot_instructions: str = '', **model_kwargs: Any) → DataFrame | tuple[DataFrame, dict[str, Any]]: Call self as a function.