Pairwise judge
pairwise_judge compares two columns row by row. It returns A when
col1 is better and B when col2 is better.
Basic Usage
import pandas as pd
import lotus
from lotus.models import LM
lotus.settings.configure(lm=LM(model="gpt-4o-mini"))
df = pd.DataFrame({
"question": [
"Explain cross-validation in one sentence.",
"Suggest a subject line for a 1:1 meeting.",
],
"model_a": [
"Cross-validation evaluates a model across multiple held-out splits.",
"Meeting request.",
],
"model_b": [
"Cross-validation is when the model checks its answers.",
"Requesting time for a 1:1 next week",
],
})
results = df.pairwise_judge(
col1="model_a",
col2="model_b",
judge_instruction="Which response better answers {question}?",
n_trials=2,
permute_cols=True,
)
print(results)
Position Bias Mitigation
Set permute_cols=True to run half the trials as col1 versus col2
and half as col2 versus col1. n_trials must be even when
permute_cols=True.
results = df.pairwise_judge(
"model_a",
"model_b",
"Which response is more helpful for {question}?",
n_trials=4,
permute_cols=True,
)
Output Columns
For each trial, LOTUS adds one output column named {suffix}_{trial}.
The default suffix is _judge.
Set return_raw_outputs=True to include raw model outputs. Set
return_explanations=True to include explanations.
Cascade Mode
pairwise_judge is implemented through semantic filtering and supports
filter cascade options for lower-cost comparisons.
from lotus.types import CascadeArgs
cascade_args = CascadeArgs(
recall_target=0.9,
precision_target=0.9,
sampling_percentage=0.5,
failure_probability=0.2,
)
results, stats = df.pairwise_judge(
col1="model_a",
col2="model_b",
judge_instruction="Which response better answers {question}?",
cascade_args=cascade_args,
return_stats=True,
)
When return_stats=True, the result is (DataFrame, stats).
Parameters
DataFrame.pairwise_judge(
col1,
col2,
judge_instruction,
n_trials=1,
permute_cols=False,
system_prompt=None,
return_raw_outputs=False,
return_explanations=False,
default_to_col1=True,
suffix="_judge",
examples=None,
helper_examples=None,
strategy=None,
cascade_args=None,
return_stats=False,
safe_mode=False,
progress_bar_desc="Evaluating",
additional_cot_instructions="",
**model_kwargs,
)
col1: First response column. Results map this column toA.col2: Second response column. Results map this column toB.judge_instruction: Natural language comparison criteria.n_trials: Number of comparison trials.permute_cols: Run both response orders to reduce position bias.system_prompt: Optional system prompt for the judge.return_raw_outputs: Include raw model text columns.return_explanations: Include explanation columns.default_to_col1: Default decision when parsing is uncertain.suffix: Base suffix for output columns.examples: Few-shot examples for the main judge.helper_examples: Few-shot examples for the helper LM in cascade mode.strategy: Optional reasoning strategy.cascade_args: Optional filter cascade configuration.return_stats: Return cascade statistics with the DataFrame.safe_mode: Estimate cost before execution.progress_bar_desc: Progress bar label.additional_cot_instructions: Extra CoT instructions for sem-filter mode.model_kwargs: Extra keyword arguments passed to the LM.
API Reference
- class lotus.evals.pairwise_judge.PairwiseJudgeDataframe(pandas_obj: Any)
Bases:
objectJudge the given df’s col1 and col2, based on the judging criteria, context and grading scale.
- Parameters:
col1 (str) – The column name of the first dataframe to judge.
col2 (str) – The column name of the second dataframe to judge.
judge_instruction (str) – The natural language instruction that guides the judging process. This instruction tells the model how to judge each input document.
n_trials (int) – The number of trials to run. Defaults to 1.
permute_cols (bool) – Whether to permute the columns in each trial. Defaults to False.
system_prompt (str | None, optional) – The system prompt to use.
return_raw_outputs (bool, optional) – Whether to return the raw outputs of the model. Defaults to False.
return_explanations (bool, optional) – Whether to return the explanations of the model. Defaults to False.
suffix (str, optional) – The suffix for the output column names. Defaults to “_judge”.
examples (pd.DataFrame | None, optional) – Example DataFrame for few-shot learning. Should have the same column structure as the input DataFrame plus an “Answer” column. Defaults to None.
strategy (ReasoningStrategy | None, optional) – The reasoning strategy to use. Can be None, COT, or ZS_COT. Defaults to None.
safe_mode (bool, optional) – Whether to enable safe mode with cost estimation. Defaults to False.
progress_bar_desc (str, optional) – Description for the progress bar. Defaults to “Evaluating”.
default_to_col1 (bool, optional) – [sem_filter mode only] The default filter decision when the model is uncertain. Defaults to True.
helper_examples (pd.DataFrame | None, optional) – [sem_filter mode only] Example DataFrame for the helper LM in cascade filtering. Defaults to None.
cascade_args (CascadeArgs | None, optional) – [sem_filter mode only] Arguments for cascade filtering to reduce cost via a proxy model. Defaults to None.
return_stats (bool, optional) – [sem_filter mode only] Whether to return a stats dict alongside the DataFrame as a (DataFrame, stats) tuple. Defaults to False.
additional_cot_instructions (str, optional) – [sem_filter mode only] Extra instructions appended to the chain-of-thought prompt. Defaults to “”.
**model_kwargs – Any: Additional keyword arguments to pass to the model.
- Returns:
- A DataFrame containing the original data
plus the judged outputs. When return_stats=True, returns a (DataFrame, stats_dict) tuple. Additional columns are added for explanations and raw outputs if requested.
- Return type:
pd.DataFrame | tuple[pd.DataFrame, dict]
- Raises:
ValueError – If the language model is not configured, if specified columns don’t exist in the DataFrame, or if the examples DataFrame doesn’t have the required “Answer” column.
- __call__(col1: str, col2: str, judge_instruction: str, n_trials: int = 1, permute_cols: bool = False, system_prompt: str | None = None, return_raw_outputs: bool = False, return_explanations: bool = False, default_to_col1: bool = True, suffix: str = '_judge', examples: DataFrame | None = None, helper_examples: DataFrame | None = None, strategy: ReasoningStrategy | None = None, cascade_args: CascadeArgs | None = None, return_stats: bool = False, safe_mode: bool = False, progress_bar_desc: str = 'Evaluating', additional_cot_instructions: str = '', **model_kwargs: Any) DataFrame | tuple[DataFrame, dict[str, Any]]
Call self as a function.