sem_agg
========

``sem_agg`` aggregates many rows into one answer. It is useful for
summarization, synthesis, and reasoning across text-heavy DataFrames.

Motivation
----------

Traditional aggregations compute values such as sums, counts, and averages.
Many language-heavy tasks need a different kind of aggregation: read many
rows, identify the shared themes, and produce one synthesized answer.

Use ``sem_agg`` when the output depends on the dataset as a whole rather than
one row at a time. Common uses include summarizing a collection of documents,
writing a cross-record report, identifying themes across tickets, or producing
one structured summary per group.

Article Summary Example
-----------------------

.. code-block:: python

    import pandas as pd
    import lotus
    from lotus.models import LM

    lotus.settings.configure(lm=LM(model="gpt-4o-mini"))

    articles = pd.DataFrame({
        "ArticleTitle": [
            "Advancements in Quantum Computing",
            "Climate Change and Renewable Energy",
            "The Rise of Artificial Intelligence",
            "A Journey into Deep Space Exploration",
        ],
        "ArticleContent": [
            (
                "Quantum computing harnesses the properties of quantum mechanics "
                "to perform computations at speeds unimaginable with classical "
                "machines. Emerging quantum algorithms show promise in solving "
                "previously intractable problems."
            ),
            (
                "Global temperatures continue to rise, and societies worldwide "
                "are turning to renewable resources like solar and wind power. "
                "The shift to green technology is expected to reshape economies."
            ),
            (
                "Artificial Intelligence has grown rapidly across industries. "
                "Machine learning models improve efficiency and uncover hidden "
                "patterns, while privacy and bias concerns remain important."
            ),
            (
                "Deep space exploration studies the cosmos beyond our solar "
                "system. Recent missions focus on exoplanets, black holes, and "
                "interstellar objects."
            ),
        ],
    })

    summary = articles.sem_agg(
        "Provide a concise summary of all {ArticleContent} in a single "
        "paragraph, highlighting key technological progress and implications "
        "for the future."
    )

    print(summary["_output"].iloc[0])

Output:

.. code-block:: text

    Recent technological advances are reshaping computation, energy, AI, and
    space exploration. Quantum computing may unlock new classes of algorithms,
    renewable energy can reduce climate impact and reshape economies, AI is
    improving data-driven decision making while raising governance concerns,
    and deep-space research is expanding what future missions may make possible.

The result is a one-row DataFrame. The default output column is ``_output``.


Grouped Aggregation
-------------------

Use ``group_by`` to produce one aggregation per group.

.. code-block:: python

    grouped = articles.assign(
        Category=["Tech", "Env", "Tech", "Space"]
    ).sem_agg(
        "Summarize the {ArticleContent} for this category.",
        group_by=["Category"],
    )

``grouped`` has one output row per category.

Long Context Handling
------------------
When documents exceed the language model's context length, sem_agg supports automatic strategies to handle large contents:

.. code-block:: python

    from lotus.types import LongContextStrategy

    # Use TRUNCATE strategy (default) - simply cuts off excess content
    result_truncate = df.sem_agg(
        "Summarize the key points from {content}",
        long_context_strategy=LongContextStrategy.TRUNCATE
    )

    # Use CHUNK strategy - intelligently splits largest column
    result_chunk = df.sem_agg(
        "Summarize the key points from {content}",
        long_context_strategy=LongContextStrategy.CHUNK
    )

**LongContext Strategies:**

- **TRUNCATE**: Simple truncation that cuts documents at the token limit with "..." appended
- **CHUNK**: Intelligent splitting that identifies the largest column and splits it while preserving other columns

**When to Use:**

- Use **TRUNCATE** when the most important information is at the beginning of documents
- Use **CHUNK** when all parts of the document are potentially important and you need to preserve complete information

Structured Output
-----------------

Pass ``response_format`` when the final answer should follow a Pydantic model
or JSON schema. By default, ``split_fields_into_cols=True`` turns structured
fields into separate DataFrame columns.

.. code-block:: python

    from pydantic import BaseModel, Field

    class ArticleSummary(BaseModel):
        theme: str = Field(description="Main theme across the articles")
        future_impact: str = Field(description="Likely future implication")

    structured = articles.sem_agg(
        "Summarize the shared theme and future impact of {ArticleContent}.",
        response_format=ArticleSummary,
    )

Set ``split_fields_into_cols=False`` if you want the structured model response
to stay in the output column instead of becoming separate fields.

Return Value
------------

``sem_agg`` returns one row for the full DataFrame or one row per group. With
plain text output, the result column is ``suffix``. With structured output and
``split_fields_into_cols=True``, fields become individual columns.

Required Parameters
-------------------

- ``user_instruction``: Natural language aggregation instruction. Reference
  columns with ``{column_name}``.

Optional Parameters
-------------------

- ``all_cols``: Use all DataFrame columns instead of only columns referenced in
  ``user_instruction``.
- ``suffix``: Output column name for plain text output. Defaults to
  ``"_output"``.
- ``group_by``: Columns to group by before aggregation. Produces one output row
  per group.
- ``safe_mode``: Accepted for API consistency; aggregation safe mode is not
  fully implemented.
- ``progress_bar_desc``: Progress bar label.
- ``long_context_strategy``: Strategy for long inputs. Defaults to
  ``LongContextStrategy.CHUNK``.
- ``split_fields_into_cols``: Split structured output fields into columns when
  ``response_format`` is provided.
- ``response_format``: Pydantic model or JSON schema for structured output.