LLM-based Evaluation Suite
===================

Overview
--------
LOTUS provides a comprehensive evaluation framework instantiating LLM-as-a-Judge methods. The evaluation module supports both single response evaluation and pairwise comparisons, making it ideal for model evaluation, response quality assessment, and A/B testing scenarios.

The evaluation framework includes two main components:

- **LLM-as-Judge**: Evaluate individual responses using customizable criteria
- **Pairwise Judge**: Compare two responses side-by-side to determine which is better

Key Features
------------

- **Flexible Evaluation Criteria**: Define custom judging instructions in natural language
- **Structured Output Support**: Use Pydantic models for consistent, structured evaluation results
- **Position Bias Mitigation**: Built-in column permutation to reduce ordering effects in pairwise comparisons
- **Multiple Trial Support**: Run multiple evaluation trials for improved reliability
- **Chain-of-Thought Reasoning**: Optional reasoning strategies for more explainable evaluations
- **Integration with LOTUS**: Seamless integration with other LOTUS semantic operators

LLM-as-Judge
============

The LLM-as-Judge functionality allows you to evaluate individual responses using natural language instructions.

Basic Usage
-----------

.. code-block:: python

    import pandas as pd
    import lotus
    from lotus.models import LM

    # Configure the language model
    lm = LM(model="gpt-4o-mini")
    lotus.settings.configure(lm=lm)

    # Sample data representing responses to evaluate
    data = {
        "student_id": [1, 2, 3, 4],
        "question": [
            "Explain the difference between supervised and unsupervised learning",
            "What is the purpose of cross-validation in machine learning?",
            "Describe how gradient descent works",
            "What are the advantages of ensemble methods?"
        ],
        "answer": [
            "Supervised learning uses labeled data to train models, while unsupervised learning finds patterns in unlabeled data. For example, classification is supervised, clustering is unsupervised.",
            "Gradient descent is an optimization algorithm that minimizes cost functions by iteratively moving in the direction of steepest descent of the gradient.",
            "Cross-validation helps assess model performance by splitting data into training and validation sets multiple times to get a better estimate of how the model generalizes.",
            "Ensemble methods combine multiple models to improve performance. They reduce overfitting and variance, often leading to better generalization than individual models."
        ]
    }

    df = pd.DataFrame(data)
    
    # Define evaluation criteria
    judge_instruction = "Rate the accuracy and completeness of this {answer} to the {question} on a scale of 1-10, where 10 is excellent. Only output the score."

    # Run evaluation
    results = df.llm_as_judge(
        judge_instruction=judge_instruction,
        n_trials=2,  # Run multiple trials for reliability
    )

    print(results)

Structured Output with Response Formats
---------------------------------------

For more detailed and consistent evaluations, use Pydantic models to define structured output formats:

.. code-block:: python

    from pydantic import BaseModel, Field

    class EvaluationScore(BaseModel):
        score: int = Field(description="Score from 1-10")
        reasoning: str = Field(description="Detailed reasoning for the score")
        strengths: list[str] = Field(description="Key strengths of the answer")
        improvements: list[str] = Field(description="Areas for improvement")

    # Use structured output format
    results = df.llm_as_judge(
        judge_instruction="Evaluate the student {answer} for the {question}",
        response_format=EvaluationScore,
        suffix="_evaluation",
    )

    # Access structured fields
    for idx, row in results.iterrows():
        evaluation = row['_evaluation_0']
        print(f"Score: {evaluation.score}")
        print(f"Reasoning: {evaluation.reasoning}")
        print(f"Strengths: {evaluation.strengths}")
        print(f"Improvements: {evaluation.improvements}")

Pairwise Judge
==============

The Pairwise Judge functionality enables side-by-side comparison of two responses to determine which is better according to specified criteria.

Basic Pairwise Comparison
-------------------------

.. code-block:: python

    import pandas as pd
    import lotus
    from lotus.models import LM

    # Configure the language model
    lm = LM(model="gpt-4o-mini")
    lotus.settings.configure(lm=lm)

    # Example dataset with prompts and two candidate responses
    data = {
        "prompt": [
            "Write a one-sentence summary of the benefits of regular exercise.",
            "Explain the difference between supervised and unsupervised learning in one sentence.",
            "Suggest a polite email subject line to schedule a 1:1 meeting.",
        ],
        "model_a": [
            "Regular exercise improves physical health and mental well-being by boosting energy, mood, and resilience.",
            "Supervised learning uses labeled data to learn mappings, while unsupervised learning finds patterns without labels.",
            "Meeting request.",
        ],
        "model_b": [
            "Exercise is good.",
            "Supervised learning and unsupervised learning are both machine learning approaches.",
            "Requesting a 1:1: finding time to connect next week?",
        ],
    }

    df = pd.DataFrame(data)

    # Define comparison criteria
    judge_instruction = (
        "Given the prompt {prompt}, compare the two responses.\\n"
        "- Response A: {model_a}\\n"
        "- Response B: {model_b}\\n\\n"
        "Choose the better response based on helpfulness, correctness, and clarity. "
        "Output only 'A' or 'B' or 'Tie' if the responses are equally good."
    )

    # Run pairwise evaluation
    results = df.pairwise_judge(
        col1="model_a",
        col2="model_b",
        judge_instruction=judge_instruction,
        n_trials=2,
        permute_cols=True,  # Mitigate position bias by evaluating both (A,B) and (B,A)
    )

    print(results)

Position Bias Mitigation
------------------------

Position bias occurs when judges systematically prefer responses in certain positions (e.g., always preferring the first response). The ``permute_cols`` parameter helps mitigate this:

.. code-block:: python

    # This will evaluate both (model_a, model_b) and (model_b, model_a) orderings
    results = df.pairwise_judge(
        col1="model_a",
        col2="model_b",
        judge_instruction=judge_instruction,
        n_trials=4,  # Must be even when permute_cols=True
        permute_cols=True,
    )


Advanced Features
=================

Chain-of-Thought Reasoning
---------------------------

Enable chain-of-thought reasoning for more explainable evaluations:

.. code-block:: python

    from lotus.types import ReasoningStrategy

    results = df.llm_as_judge(
        judge_instruction="Evaluate the quality of this {answer}",
        strategy=ReasoningStrategy.COT,  # Enable chain-of-thought
        n_trials=1,
    )

    results = df.pairwise_judge(
        col1="model_a",
        col2="model_b",
        judge_instruction=judge_instruction,
        n_trials=4,  # Must be even when permute_cols=True
        permute_cols=True,
        strategy=ReasoningStrategy.COT,
    )

Few-Shot Learning
-----------------

Provide examples to guide the evaluation process:

.. code-block:: python

    # Create examples DataFrame
    examples_data = {
        "question": ["What is machine learning?"],
        "answer": ["Machine learning is a subset of AI that enables computers to learn from data."],
        "Answer": ["8"]  # Expected score - note the capital 'A'
    }
    examples_df = pd.DataFrame(examples_data)

    # Use examples in evaluation
    results = df.llm_as_judge(
        judge_instruction="Rate this {answer} to the {question} from 1-10",
        examples=examples_df,
    )

Custom System Prompts
---------------------

Customize the system prompt for specific evaluation contexts:

.. code-block:: python

    custom_system_prompt = (
        "You are an expert educator with 20 years of experience in computer science. "
        "Evaluate student responses with attention to technical accuracy and clarity."
    )

    results = df.llm_as_judge(
        judge_instruction="Evaluate this {answer}",
        system_prompt=custom_system_prompt,
    )

API Reference
=============

llm_as_judge
------------

.. function:: DataFrame.llm_as_judge(judge_instruction, response_format=None, n_trials=1, system_prompt=None, suffix="_judge", examples=None, strategy=None, safe_mode=False, **model_kwargs)

   Evaluate responses using LLM-as-Judge methodology.

   :param judge_instruction: Natural language instruction for evaluation. Use {column_name} to reference DataFrame columns.
   :type judge_instruction: str
   :param response_format: Pydantic model for structured output. If None, returns string.
   :type response_format: BaseModel | None
   :param n_trials: Number of evaluation trials to run.
   :type n_trials: int
   :param system_prompt: Custom system prompt for the judge.
   :type system_prompt: str | None
   :param suffix: Suffix for output column names.
   :type suffix: str
   :param examples: Example DataFrame for few-shot learning. Must include "Answer" column.
   :type examples: pd.DataFrame | None
   :param strategy: Reasoning strategy (None, COT, ZS_COT).
   :type strategy: ReasoningStrategy | None
   :param safe_mode: Enable cost estimation before execution.
   :type safe_mode: bool
   :param model_kwargs: Additional arguments passed to the language model.
   :return: DataFrame with original data plus evaluation results.
   :rtype: pd.DataFrame

pairwise_judge
--------------

.. function:: DataFrame.pairwise_judge(col1, col2, judge_instruction, response_format=None, n_trials=1, permute_cols=False, system_prompt=None, suffix="_judge", examples=None, strategy=None, safe_mode=False, **model_kwargs)

   Compare two responses using pairwise evaluation.

   :param col1: Name of the first column to compare.
   :type col1: str
   :param col2: Name of the second column to compare.
   :type col2: str
   :param judge_instruction: Natural language instruction for comparison. Use {column_name} to reference DataFrame columns.
   :type judge_instruction: str
   :param response_format: Pydantic model for structured output. If None, returns string.
   :type response_format: BaseModel | None
   :param n_trials: Number of evaluation trials to run.
   :type n_trials: int
   :param permute_cols: Whether to permute column order to mitigate position bias. If True, n_trials must be even.
   :type permute_cols: bool
   :param system_prompt: Custom system prompt for the judge.
   :type system_prompt: str | None
   :param suffix: Suffix for output column names.
   :type suffix: str
   :param examples: Example DataFrame for few-shot learning. Must include "Answer" column.
   :type examples: pd.DataFrame | None
   :param strategy: Reasoning strategy (None, COT, ZS_COT).
   :type strategy: ReasoningStrategy | None
   :param safe_mode: Enable cost estimation before execution.
   :type safe_mode: bool
   :param model_kwargs: Additional arguments passed to the language model.
   :return: DataFrame with original data plus comparison results.
   :rtype: pd.DataFrame

Best Practices
==============

Evaluation Design
-----------------

1. **Clear Instructions**: Write specific, unambiguous evaluation criteria
2. **Multiple Trials**: Use multiple trials to improve reliability and account for model variability
3. **Position Bias**: Use ``permute_cols=True`` in pairwise comparisons to mitigate ordering effects
4. **Structured Output**: Use Pydantic models for consistent, parseable results
5. **Appropriate Models**: Choose models with strong reasoning capabilities for complex evaluations

Performance Considerations
--------------------------

1. **Batch Size**: Larger DataFrames will result in more API calls
2. **Model Selection**: Balance evaluation quality with cost and latency
3. **Safe Mode**: Enable safe mode for cost estimation on large datasets
4. **Caching**: LOTUS automatically caches results to avoid redundant evaluations

Common Patterns
---------------

**A/B Testing**:

.. code-block:: python

    # Compare two model versions
    results = df.pairwise_judge(
        col1="model_v1_output",
        col2="model_v2_output", 
        judge_instruction="Which response better answers {user_query}?",
        permute_cols=True,
        n_trials=4
    )

**Content Moderation**:

.. code-block:: python

    class ModerationResult(BaseModel):
        is_safe: bool = Field(description="Whether the content is safe")
        risk_level: str = Field(description="Risk level: low, medium, high")
        reasoning: str = Field(description="Explanation for the decision")

    results = df.llm_as_judge(
        judge_instruction="Evaluate if this {content} is safe for a general audience",
        response_format=ModerationResult
    )

**Response Quality Assessment**:

.. code-block:: python

    class QualityScore(BaseModel):
        helpfulness: int = Field(description="Helpfulness score 1-10")
        accuracy: int = Field(description="Accuracy score 1-10") 
        clarity: int = Field(description="Clarity score 1-10")
        overall: int = Field(description="Overall score 1-10")

    results = df.llm_as_judge(
        judge_instruction="Evaluate the quality of this {response} to {question}",
        response_format=QualityScore
    )