LazyFrame API
LazyFrame is LOTUS’ lazy execution API for semantic operator programs. It
lets you define a pipeline first, then execute it later on a DataFrame. Nothing
runs until you call execute().
Why LazyFrame?
Eager LOTUS execution is useful when you are exploring data and want each operator to run immediately, just like pandas. LazyFrame is useful when you have a multi-step LLM program and want LOTUS to see the whole plan before any expensive model calls happen.
That global plan makes several things possible:
inspect the semantic and pandas operations that will run
move cheap pandas filters before expensive semantic filters
optimize prompts across the whole pipeline instead of one operator at a time
pre-learn cascade thresholds so cheaper models can handle easy rows
save an optimized pipeline and reuse it in a later session
In other words, LazyFrame gives LOTUS the same kind of planning boundary that a query engine has: you describe what should happen, then LOTUS decides how to execute it efficiently.
What You Can Build
LazyFrame is useful for LLM-based data workflows where the result depends on a pipeline rather than a single prompt. Examples include:
filtering agent traces and aggregating the failures into a taxonomy
running LLM-as-judge or pairwise-judge evaluations over model outputs
building RAG-style pipelines that search, transform, and aggregate evidence
extracting structured tables from long documents or web pages
combining semantic operators with pandas cleanup, grouping, and slicing
Quick Start
This example builds a semantic filter pipeline over GitHub-style issue titles. The pipeline is defined first and executed later.
import pandas as pd
import lotus
from lotus.ast import LazyFrame
from lotus.models import LM
lm = LM(model="gpt-4.1-nano")
lotus.settings.configure(lm=lm)
issues = pd.DataFrame({
"issue_title": [
"Fix typo in README",
"Add dark mode support to dashboard",
"Refactor entire auth system to use OAuth2",
"Update copyright year in LICENSE",
"Implement distributed transaction support across microservices",
"Change button color on settings page",
"Migrate database from Postgres 13 to 16 with zero downtime",
"Add missing comma in error message",
"Build custom query planner to replace third-party dependency",
"Bump lodash to fix known CVE",
"Support multi-region active-active replication",
"Remove unused import in utils.py",
]
})
pipeline = LazyFrame().sem_filter(
"The {issue_title} describes a small, self-contained task that a new "
"open source contributor could tackle without deep knowledge of the codebase"
)
good_first_issues = pipeline.execute(issues)
Output:
issue_title |
|
|---|---|
0 |
Fix typo in README |
3 |
Update copyright year in LICENSE |
5 |
Change button color on settings page |
7 |
Add missing comma in error message |
9 |
Bump lodash to fix known CVE |
11 |
Remove unused import in utils.py |
This has the same user-facing result as eager issues.sem_filter(...), but
the lazy version can also be inspected, optimized, saved, and reused.
How Lazy Execution Works
Each LazyFrame operation appends a node to a logical plan. Semantic operators,
pandas operations, evaluation operators, joins, and custom functions are all
represented in that plan. When you call execute(), LOTUS walks the plan and
materializes the final DataFrame.
You can inspect the plan before execution:
pipeline.print_tree()
Output:
sem_filter('The {issue_title} describes a small, self-containe...')
-- Source(bound=False)
This is useful when a pipeline has multiple semantic operators or nested LazyFrames and you want to confirm the execution plan before spending LM calls.
Source Data
You can pass data at execution time, bind it when constructing the LazyFrame, or provide a schema that is checked at execution time.
# Pass data at execution time.
pipeline = LazyFrame().sem_filter("{issue_title} is documentation-only")
result = pipeline.execute(issues)
# Bind data in the LazyFrame.
pipeline = LazyFrame(df=issues).sem_filter("{issue_title} is documentation-only")
result = pipeline.execute({})
# Validate execution input.
pipeline = LazyFrame(schema={"issue_title": "object"}).sem_filter(
"{issue_title} is documentation-only"
)
result = pipeline.execute(issues)
Chaining Operators
LazyFrame supports LOTUS semantic operators and common pandas operations in the same pipeline.
pipeline = (
LazyFrame()
.assign(title_length=lambda df: df["issue_title"].str.len())
.filter(lambda df: df["title_length"] < 80)
.sem_filter("{issue_title} is a good first issue")
.sem_map("Summarize {issue_title} as a contributor task", suffix="_task")
.head(5)
)
The semantic operator methods mirror the DataFrame API, including
sem_filter, sem_map, sem_extract, sem_agg, sem_topk,
sem_join, sem_sim_join, sem_search, sem_index,
load_sem_index, sem_cluster_by, sem_dedup, and
sem_partition_by. LazyFrame also supports evaluation operators:
llm_as_judge and pairwise_judge.
Multi-Source Pipelines
For one source, pass a DataFrame directly to execute().
result = pipeline.execute(issues)
For multiple sources, create one source LazyFrame per input and pass a dictionary keyed by those source objects.
issues_lf = LazyFrame()
labels_lf = LazyFrame()
joined = issues_lf.sem_join(
labels_lf,
"The issue {issue_title:left} should receive the label {label:right}",
)
result = joined.execute({
issues_lf: issues,
labels_lf: labels,
})
Composition
Use LazyFrame.concat to combine LazyFrame results and LazyFrame.from_fn
when you need to apply a custom callable after one or more LazyFrames are
resolved.
docs = LazyFrame().sem_filter("{issue_title} is about documentation")
frontend = LazyFrame().sem_filter("{issue_title} is about UI work")
combined = LazyFrame.concat([docs, frontend], ignore_index=True)
result = combined.execute({docs: issues, frontend: issues})
def dedupe_by_title(df):
return df.drop_duplicates(subset=["issue_title"])
deduped = LazyFrame.from_fn(dedupe_by_title, combined)
result = deduped.execute({docs: issues, frontend: issues})
Persistence
Save and load pipelines with save() and LazyFrame.load(). This is most
useful after optimization, because the optimized instructions and learned
cascade thresholds are stored with the pipeline.
pipeline.save("good_first_issue_pipeline.pkl")
loaded = LazyFrame.load("good_first_issue_pipeline.pkl")
result = loaded.execute(issues)
Pipelines that include local callables, lambdas, or closures may not be portable across Python environments because they are serialized with pickle.