sem_agg

sem_agg aggregates many rows into one answer. It is useful for summarization, synthesis, and reasoning across text-heavy DataFrames.

Motivation

Traditional aggregations compute values such as sums, counts, and averages. Many language-heavy tasks need a different kind of aggregation: read many rows, identify the shared themes, and produce one synthesized answer.

Use sem_agg when the output depends on the dataset as a whole rather than one row at a time. Common uses include summarizing a collection of documents, writing a cross-record report, identifying themes across tickets, or producing one structured summary per group.

Article Summary Example

import pandas as pd
import lotus
from lotus.models import LM

lotus.settings.configure(lm=LM(model="gpt-4o-mini"))

articles = pd.DataFrame({
    "ArticleTitle": [
        "Advancements in Quantum Computing",
        "Climate Change and Renewable Energy",
        "The Rise of Artificial Intelligence",
        "A Journey into Deep Space Exploration",
    ],
    "ArticleContent": [
        (
            "Quantum computing harnesses the properties of quantum mechanics "
            "to perform computations at speeds unimaginable with classical "
            "machines. Emerging quantum algorithms show promise in solving "
            "previously intractable problems."
        ),
        (
            "Global temperatures continue to rise, and societies worldwide "
            "are turning to renewable resources like solar and wind power. "
            "The shift to green technology is expected to reshape economies."
        ),
        (
            "Artificial Intelligence has grown rapidly across industries. "
            "Machine learning models improve efficiency and uncover hidden "
            "patterns, while privacy and bias concerns remain important."
        ),
        (
            "Deep space exploration studies the cosmos beyond our solar "
            "system. Recent missions focus on exoplanets, black holes, and "
            "interstellar objects."
        ),
    ],
})

summary = articles.sem_agg(
    "Provide a concise summary of all {ArticleContent} in a single "
    "paragraph, highlighting key technological progress and implications "
    "for the future."
)

print(summary["_output"].iloc[0])

Output:

Recent technological advances are reshaping computation, energy, AI, and
space exploration. Quantum computing may unlock new classes of algorithms,
renewable energy can reduce climate impact and reshape economies, AI is
improving data-driven decision making while raising governance concerns,
and deep-space research is expanding what future missions may make possible.

The result is a one-row DataFrame. The default output column is _output.

Grouped Aggregation

Use group_by to produce one aggregation per group.

grouped = articles.assign(
    Category=["Tech", "Env", "Tech", "Space"]
).sem_agg(
    "Summarize the {ArticleContent} for this category.",
    group_by=["Category"],
)

grouped has one output row per category.

Long Context Handling

When documents exceed the language model’s context length, sem_agg supports automatic strategies to handle large contents:

from lotus.types import LongContextStrategy

# Use TRUNCATE strategy (default) - simply cuts off excess content
result_truncate = df.sem_agg(
    "Summarize the key points from {content}",
    long_context_strategy=LongContextStrategy.TRUNCATE
)

# Use CHUNK strategy - intelligently splits largest column
result_chunk = df.sem_agg(
    "Summarize the key points from {content}",
    long_context_strategy=LongContextStrategy.CHUNK
)

LongContext Strategies:

TRUNCATE: Simple truncation that cuts documents at the token limit with “…” appended
CHUNK: Intelligent splitting that identifies the largest column and splits it while preserving other columns

When to Use:

Use TRUNCATE when the most important information is at the beginning of documents
Use CHUNK when all parts of the document are potentially important and you need to preserve complete information

Structured Output

Pass response_format when the final answer should follow a Pydantic model or JSON schema. By default, split_fields_into_cols=True turns structured fields into separate DataFrame columns.

from pydantic import BaseModel, Field

class ArticleSummary(BaseModel):
    theme: str = Field(description="Main theme across the articles")
    future_impact: str = Field(description="Likely future implication")

structured = articles.sem_agg(
    "Summarize the shared theme and future impact of {ArticleContent}.",
    response_format=ArticleSummary,
)

Set split_fields_into_cols=False if you want the structured model response to stay in the output column instead of becoming separate fields.

Return Value

sem_agg returns one row for the full DataFrame or one row per group. With plain text output, the result column is suffix. With structured output and split_fields_into_cols=True, fields become individual columns.

Required Parameters

user_instruction: Natural language aggregation instruction. Reference columns with {column_name}.

Optional Parameters

all_cols: Use all DataFrame columns instead of only columns referenced in user_instruction.
suffix: Output column name for plain text output. Defaults to "_output".
group_by: Columns to group by before aggregation. Produces one output row per group.
safe_mode: Accepted for API consistency; aggregation safe mode is not fully implemented.
progress_bar_desc: Progress bar label.
long_context_strategy: Strategy for long inputs. Defaults to LongContextStrategy.CHUNK.
split_fields_into_cols: Split structured output fields into columns when response_format is provided.
response_format: Pydantic model or JSON schema for structured output.