LLM-based Evaluation Suite
Overview
LOTUS provides a comprehensive evaluation framework instantiating LLM-as-a-Judge methods. The evaluation module supports both single response evaluation and pairwise comparisons, making it ideal for model evaluation, response quality assessment, and A/B testing scenarios.
The evaluation framework includes two main components:
LLM-as-Judge: Evaluate individual responses using customizable criteria
Pairwise Judge: Compare two responses side-by-side to determine which is better
Key Features
Flexible Evaluation Criteria: Define custom judging instructions in natural language
Structured Output Support: Use Pydantic models for consistent, structured evaluation results
Position Bias Mitigation: Built-in column permutation to reduce ordering effects in pairwise comparisons
Multiple Trial Support: Run multiple evaluation trials for improved reliability
Chain-of-Thought Reasoning: Optional reasoning strategies for more explainable evaluations
Integration with LOTUS: Seamless integration with other LOTUS semantic operators
LLM-as-Judge
The LLM-as-Judge functionality allows you to evaluate individual responses using natural language instructions.
Basic Usage
import pandas as pd
import lotus
from lotus.models import LM
# Configure the language model
lm = LM(model="gpt-4o-mini")
lotus.settings.configure(lm=lm)
# Sample data representing responses to evaluate
data = {
"student_id": [1, 2, 3, 4],
"question": [
"Explain the difference between supervised and unsupervised learning",
"What is the purpose of cross-validation in machine learning?",
"Describe how gradient descent works",
"What are the advantages of ensemble methods?"
],
"answer": [
"Supervised learning uses labeled data to train models, while unsupervised learning finds patterns in unlabeled data. For example, classification is supervised, clustering is unsupervised.",
"Gradient descent is an optimization algorithm that minimizes cost functions by iteratively moving in the direction of steepest descent of the gradient.",
"Cross-validation helps assess model performance by splitting data into training and validation sets multiple times to get a better estimate of how the model generalizes.",
"Ensemble methods combine multiple models to improve performance. They reduce overfitting and variance, often leading to better generalization than individual models."
]
}
df = pd.DataFrame(data)
# Define evaluation criteria
judge_instruction = "Rate the accuracy and completeness of this {answer} to the {question} on a scale of 1-10, where 10 is excellent. Only output the score."
# Run evaluation
results = df.llm_as_judge(
judge_instruction=judge_instruction,
n_trials=2, # Run multiple trials for reliability
)
print(results)
Structured Output with Response Formats
For more detailed and consistent evaluations, use Pydantic models to define structured output formats:
from pydantic import BaseModel, Field
class EvaluationScore(BaseModel):
score: int = Field(description="Score from 1-10")
reasoning: str = Field(description="Detailed reasoning for the score")
strengths: list[str] = Field(description="Key strengths of the answer")
improvements: list[str] = Field(description="Areas for improvement")
# Use structured output format
results = df.llm_as_judge(
judge_instruction="Evaluate the student {answer} for the {question}",
response_format=EvaluationScore,
suffix="_evaluation",
)
# Access structured fields
for idx, row in results.iterrows():
evaluation = row['_evaluation_0']
print(f"Score: {evaluation.score}")
print(f"Reasoning: {evaluation.reasoning}")
print(f"Strengths: {evaluation.strengths}")
print(f"Improvements: {evaluation.improvements}")
Pairwise Judge
The Pairwise Judge functionality enables side-by-side comparison of two responses to determine which is better according to specified criteria.
Basic Pairwise Comparison
import pandas as pd
import lotus
from lotus.models import LM
# Configure the language model
lm = LM(model="gpt-4o-mini")
lotus.settings.configure(lm=lm)
# Example dataset with prompts and two candidate responses
data = {
"prompt": [
"Write a one-sentence summary of the benefits of regular exercise.",
"Explain the difference between supervised and unsupervised learning in one sentence.",
"Suggest a polite email subject line to schedule a 1:1 meeting.",
],
"model_a": [
"Regular exercise improves physical health and mental well-being by boosting energy, mood, and resilience.",
"Supervised learning uses labeled data to learn mappings, while unsupervised learning finds patterns without labels.",
"Meeting request.",
],
"model_b": [
"Exercise is good.",
"Supervised learning and unsupervised learning are both machine learning approaches.",
"Requesting a 1:1: finding time to connect next week?",
],
}
df = pd.DataFrame(data)
# Define comparison criteria
judge_instruction = (
"Given the prompt {prompt}, compare the two responses.\\n"
"- Response A: {model_a}\\n"
"- Response B: {model_b}\\n\\n"
"Choose the better response based on helpfulness, correctness, and clarity. "
"Output only 'A' or 'B' or 'Tie' if the responses are equally good."
)
# Run pairwise evaluation
results = df.pairwise_judge(
col1="model_a",
col2="model_b",
judge_instruction=judge_instruction,
n_trials=2,
permute_cols=True, # Mitigate position bias by evaluating both (A,B) and (B,A)
)
print(results)
Position Bias Mitigation
Position bias occurs when judges systematically prefer responses in certain positions (e.g., always preferring the first response). The permute_cols parameter helps mitigate this:
# This will evaluate both (model_a, model_b) and (model_b, model_a) orderings
results = df.pairwise_judge(
col1="model_a",
col2="model_b",
judge_instruction=judge_instruction,
n_trials=4, # Must be even when permute_cols=True
permute_cols=True,
)
Advanced Features
Chain-of-Thought Reasoning
Enable chain-of-thought reasoning for more explainable evaluations:
from lotus.types import ReasoningStrategy
results = df.llm_as_judge(
judge_instruction="Evaluate the quality of this {answer}",
strategy=ReasoningStrategy.COT, # Enable chain-of-thought
n_trials=1,
)
results = df.pairwise_judge(
col1="model_a",
col2="model_b",
judge_instruction=judge_instruction,
n_trials=4, # Must be even when permute_cols=True
permute_cols=True,
strategy=ReasoningStrategy.COT,
)
Few-Shot Learning
Provide examples to guide the evaluation process:
# Create examples DataFrame
examples_data = {
"question": ["What is machine learning?"],
"answer": ["Machine learning is a subset of AI that enables computers to learn from data."],
"Answer": ["8"] # Expected score - note the capital 'A'
}
examples_df = pd.DataFrame(examples_data)
# Use examples in evaluation
results = df.llm_as_judge(
judge_instruction="Rate this {answer} to the {question} from 1-10",
examples=examples_df,
)
Custom System Prompts
Customize the system prompt for specific evaluation contexts:
custom_system_prompt = (
"You are an expert educator with 20 years of experience in computer science. "
"Evaluate student responses with attention to technical accuracy and clarity."
)
results = df.llm_as_judge(
judge_instruction="Evaluate this {answer}",
system_prompt=custom_system_prompt,
)
API Reference
llm_as_judge
- DataFrame.llm_as_judge(judge_instruction, response_format=None, n_trials=1, system_prompt=None, suffix='_judge', examples=None, strategy=None, safe_mode=False, **model_kwargs)
Evaluate responses using LLM-as-Judge methodology.
- Parameters:
judge_instruction (str) – Natural language instruction for evaluation. Use {column_name} to reference DataFrame columns.
response_format (BaseModel | None) – Pydantic model for structured output. If None, returns string.
n_trials (int) – Number of evaluation trials to run.
system_prompt (str | None) – Custom system prompt for the judge.
suffix (str) – Suffix for output column names.
examples (pd.DataFrame | None) – Example DataFrame for few-shot learning. Must include “Answer” column.
strategy (ReasoningStrategy | None) – Reasoning strategy (None, COT, ZS_COT).
safe_mode (bool) – Enable cost estimation before execution.
model_kwargs – Additional arguments passed to the language model.
- Returns:
DataFrame with original data plus evaluation results.
- Return type:
pd.DataFrame
pairwise_judge
- DataFrame.pairwise_judge(col1, col2, judge_instruction, response_format=None, n_trials=1, permute_cols=False, system_prompt=None, suffix='_judge', examples=None, strategy=None, safe_mode=False, **model_kwargs)
Compare two responses using pairwise evaluation.
- Parameters:
col1 (str) – Name of the first column to compare.
col2 (str) – Name of the second column to compare.
judge_instruction (str) – Natural language instruction for comparison. Use {column_name} to reference DataFrame columns.
response_format (BaseModel | None) – Pydantic model for structured output. If None, returns string.
n_trials (int) – Number of evaluation trials to run.
permute_cols (bool) – Whether to permute column order to mitigate position bias. If True, n_trials must be even.
system_prompt (str | None) – Custom system prompt for the judge.
suffix (str) – Suffix for output column names.
examples (pd.DataFrame | None) – Example DataFrame for few-shot learning. Must include “Answer” column.
strategy (ReasoningStrategy | None) – Reasoning strategy (None, COT, ZS_COT).
safe_mode (bool) – Enable cost estimation before execution.
model_kwargs – Additional arguments passed to the language model.
- Returns:
DataFrame with original data plus comparison results.
- Return type:
pd.DataFrame
Best Practices
Evaluation Design
Clear Instructions: Write specific, unambiguous evaluation criteria
Multiple Trials: Use multiple trials to improve reliability and account for model variability
Position Bias: Use
permute_cols=Truein pairwise comparisons to mitigate ordering effectsStructured Output: Use Pydantic models for consistent, parseable results
Appropriate Models: Choose models with strong reasoning capabilities for complex evaluations
Performance Considerations
Batch Size: Larger DataFrames will result in more API calls
Model Selection: Balance evaluation quality with cost and latency
Safe Mode: Enable safe mode for cost estimation on large datasets
Caching: LOTUS automatically caches results to avoid redundant evaluations
Common Patterns
A/B Testing:
# Compare two model versions
results = df.pairwise_judge(
col1="model_v1_output",
col2="model_v2_output",
judge_instruction="Which response better answers {user_query}?",
permute_cols=True,
n_trials=4
)
Content Moderation:
class ModerationResult(BaseModel):
is_safe: bool = Field(description="Whether the content is safe")
risk_level: str = Field(description="Risk level: low, medium, high")
reasoning: str = Field(description="Explanation for the decision")
results = df.llm_as_judge(
judge_instruction="Evaluate if this {content} is safe for a general audience",
response_format=ModerationResult
)
Response Quality Assessment:
class QualityScore(BaseModel):
helpfulness: int = Field(description="Helpfulness score 1-10")
accuracy: int = Field(description="Accuracy score 1-10")
clarity: int = Field(description="Clarity score 1-10")
overall: int = Field(description="Overall score 1-10")
results = df.llm_as_judge(
judge_instruction="Evaluate the quality of this {response} to {question}",
response_format=QualityScore
)