LazyFrame API
=============

``LazyFrame`` is LOTUS' lazy execution API for semantic operator programs. It
lets you define a pipeline first, then execute it later on a DataFrame. Nothing
runs until you call ``execute()``.

Why LazyFrame?
--------------

Eager LOTUS execution is useful when you are exploring data and want each
operator to run immediately, just like pandas. LazyFrame is useful when you
have a multi-step LLM program and want LOTUS to see the whole plan before any
expensive model calls happen.

That global plan makes several things possible:

- inspect the semantic and pandas operations that will run
- move cheap pandas filters before expensive semantic filters
- optimize prompts across the whole pipeline instead of one operator at a time
- pre-learn cascade thresholds so cheaper models can handle easy rows
- save an optimized pipeline and reuse it in a later session

In other words, LazyFrame gives LOTUS the same kind of planning boundary that
a query engine has: you describe what should happen, then LOTUS decides how to
execute it efficiently.

What You Can Build
------------------

LazyFrame is useful for LLM-based data workflows where the result depends on a
pipeline rather than a single prompt. Examples include:

- filtering agent traces and aggregating the failures into a taxonomy
- running LLM-as-judge or pairwise-judge evaluations over model outputs
- building RAG-style pipelines that search, transform, and aggregate evidence
- extracting structured tables from long documents or web pages
- combining semantic operators with pandas cleanup, grouping, and slicing

Quick Start
-----------

This example builds a semantic filter pipeline over GitHub-style issue titles.
The pipeline is defined first and executed later.

.. code-block:: python

    import pandas as pd
    import lotus
    from lotus.ast import LazyFrame
    from lotus.models import LM

    lm = LM(model="gpt-4.1-nano")
    lotus.settings.configure(lm=lm)

    issues = pd.DataFrame({
        "issue_title": [
            "Fix typo in README",
            "Add dark mode support to dashboard",
            "Refactor entire auth system to use OAuth2",
            "Update copyright year in LICENSE",
            "Implement distributed transaction support across microservices",
            "Change button color on settings page",
            "Migrate database from Postgres 13 to 16 with zero downtime",
            "Add missing comma in error message",
            "Build custom query planner to replace third-party dependency",
            "Bump lodash to fix known CVE",
            "Support multi-region active-active replication",
            "Remove unused import in utils.py",
        ]
    })

    pipeline = LazyFrame().sem_filter(
        "The {issue_title} describes a small, self-contained task that a new "
        "open source contributor could tackle without deep knowledge of the codebase"
    )

    good_first_issues = pipeline.execute(issues)

Output:

+----+----------------------------------------------+
|    | issue_title                                  |
+====+==============================================+
| 0  | Fix typo in README                           |
+----+----------------------------------------------+
| 3  | Update copyright year in LICENSE             |
+----+----------------------------------------------+
| 5  | Change button color on settings page         |
+----+----------------------------------------------+
| 7  | Add missing comma in error message           |
+----+----------------------------------------------+
| 9  | Bump lodash to fix known CVE                 |
+----+----------------------------------------------+
| 11 | Remove unused import in utils.py             |
+----+----------------------------------------------+

This has the same user-facing result as eager ``issues.sem_filter(...)``, but
the lazy version can also be inspected, optimized, saved, and reused.

How Lazy Execution Works
------------------------

Each LazyFrame operation appends a node to a logical plan. Semantic operators,
pandas operations, evaluation operators, joins, and custom functions are all
represented in that plan. When you call ``execute()``, LOTUS walks the plan and
materializes the final DataFrame.

You can inspect the plan before execution:

.. code-block:: python

    pipeline.print_tree()

Output:

.. code-block:: text

    sem_filter('The {issue_title} describes a small, self-containe...')
        -- Source(bound=False)

This is useful when a pipeline has multiple semantic operators or nested
LazyFrames and you want to confirm the execution plan before spending LM calls.

Source Data
-----------

You can pass data at execution time, bind it when constructing the LazyFrame,
or provide a schema that is checked at execution time.

.. code-block:: python

    # Pass data at execution time.
    pipeline = LazyFrame().sem_filter("{issue_title} is documentation-only")
    result = pipeline.execute(issues)

    # Bind data in the LazyFrame.
    pipeline = LazyFrame(df=issues).sem_filter("{issue_title} is documentation-only")
    result = pipeline.execute({})

    # Validate execution input.
    pipeline = LazyFrame(schema={"issue_title": "object"}).sem_filter(
        "{issue_title} is documentation-only"
    )
    result = pipeline.execute(issues)

Chaining Operators
------------------

LazyFrame supports LOTUS semantic operators and common pandas operations in the
same pipeline.

.. code-block:: python

    pipeline = (
        LazyFrame()
        .assign(title_length=lambda df: df["issue_title"].str.len())
        .filter(lambda df: df["title_length"] < 80)
        .sem_filter("{issue_title} is a good first issue")
        .sem_map("Summarize {issue_title} as a contributor task", suffix="_task")
        .head(5)
    )

The semantic operator methods mirror the DataFrame API, including
``sem_filter``, ``sem_map``, ``sem_extract``, ``sem_agg``, ``sem_topk``,
``sem_join``, ``sem_sim_join``, ``sem_search``, ``sem_index``,
``load_sem_index``, ``sem_cluster_by``, ``sem_dedup``, and
``sem_partition_by``. LazyFrame also supports evaluation operators:
``llm_as_judge`` and ``pairwise_judge``.

Multi-Source Pipelines
----------------------

For one source, pass a DataFrame directly to ``execute()``.

.. code-block:: python

    result = pipeline.execute(issues)

For multiple sources, create one source LazyFrame per input and pass a
dictionary keyed by those source objects.

.. code-block:: python

    issues_lf = LazyFrame()
    labels_lf = LazyFrame()

    joined = issues_lf.sem_join(
        labels_lf,
        "The issue {issue_title:left} should receive the label {label:right}",
    )

    result = joined.execute({
        issues_lf: issues,
        labels_lf: labels,
    })

Composition
-----------

Use ``LazyFrame.concat`` to combine LazyFrame results and ``LazyFrame.from_fn``
when you need to apply a custom callable after one or more LazyFrames are
resolved.

.. code-block:: python

    docs = LazyFrame().sem_filter("{issue_title} is about documentation")
    frontend = LazyFrame().sem_filter("{issue_title} is about UI work")

    combined = LazyFrame.concat([docs, frontend], ignore_index=True)
    result = combined.execute({docs: issues, frontend: issues})

.. code-block:: python

    def dedupe_by_title(df):
        return df.drop_duplicates(subset=["issue_title"])

    deduped = LazyFrame.from_fn(dedupe_by_title, combined)
    result = deduped.execute({docs: issues, frontend: issues})

Persistence
-----------

Save and load pipelines with ``save()`` and ``LazyFrame.load()``. This is most
useful after optimization, because the optimized instructions and learned
cascade thresholds are stored with the pipeline.

.. code-block:: python

    pipeline.save("good_first_issue_pipeline.pkl")

    loaded = LazyFrame.load("good_first_issue_pipeline.pkl")
    result = loaded.execute(issues)

Pipelines that include local callables, lambdas, or closures may not be
portable across Python environments because they are serialized with pickle.

Related Pages
-------------

- :doc:`lazyframe_optimizations`
- :doc:`lazyframe_api`