Evaluation Suite
================

LOTUS includes LLM-as-judge tools for evaluating model outputs, application
responses, and content quality directly from pandas DataFrames.

The evaluation suite has two DataFrame accessors:

- ``llm_as_judge`` evaluates each row independently.
- ``pairwise_judge`` compares two response columns and chooses the better
  response for each row.

Setup
-----

.. code-block:: python

    import pandas as pd
    import lotus
    from lotus.models import LM

    lm = LM(model="gpt-4o-mini")
    lotus.settings.configure(lm=lm)

    df = pd.DataFrame({
        "question": [
            "What is cross-validation?",
            "What is gradient descent?",
        ],
        "answer": [
            "Cross-validation estimates generalization by evaluating on held-out splits.",
            "Gradient descent iteratively updates parameters to reduce a loss function.",
        ],
    })

Choose the Right Evaluator
--------------------------

Use ``llm_as_judge`` when each row has one response to score, classify, or
annotate.

.. code-block:: python

    scored = df.llm_as_judge(
        "Rate the accuracy of {answer} for {question} from 1 to 10. "
        "Return only the score."
    )

Use ``pairwise_judge`` when each row has two responses and you want a direct
comparison.

.. code-block:: python

    pairwise_df = pd.DataFrame({
        "question": ["What is cross-validation?"],
        "model_a": ["It evaluates a model on several held-out splits."],
        "model_b": ["It checks whether a model knows the answer."],
    })

    compared = pairwise_df.pairwise_judge(
        col1="model_a",
        col2="model_b",
        judge_instruction="Which response better answers {question}?",
        permute_cols=True,
        n_trials=2,
    )

Caching Behavior
----------------

Evaluation calls temporarily disable LOTUS operator caching inside the judge
loop so repeated trials can produce independent judgments. The global cache
setting is restored after the evaluation call finishes.

Related Pages
-------------

- :doc:`llm_as_judge`
- :doc:`pairwise_judge`
- :doc:`evaluation_advanced`