LazyFrame API Reference

LazyFrame

LazyFrame builder for LOTUS AST operations.

LazyFrame is an immutable builder that records a sequence of semantic and pandas operations as AST nodes. Nothing is executed until .execute() (or .run().execute()) is called.

Example:

lf = LazyFrame().sem_filter("{text} is about sports").sem_map("Summarize {text}")
result = lf.execute(df)

Bases: object

Immutable lazy version of DataFrame and semantic operators.

Operations are recorded as AST nodes and only materialised when .execute() is called.

Parameters:

df – Optional bound DataFrame. When provided the source data is stored directly on the LazyFrame so no external input is required at execution time.
schema – Optional {col_name: dtype} dict validated at execution time against the source DataFrame.
_nodes – Internal — pre-built node list (used by copy/optimise).
_source – Internal — explicit source node reference.

Example:

>>> lf = LazyFrame().sem_filter("{text} is about sports").sem_map("Summarize {text}")
>>> result = lf.execute(df)

add_source(df: DataFrame | None = None, schema: dict[str, str] | None = None) → LazyFrame

Set the LazyFrame source (optional bound DataFrame and schema).

Replaces the single source node. Use this to bind a df or add schema validation.

assign(**kwargs: Any) → LazyFrame

Add column assignments and return a new LazyFrame.

Values may be scalars, callables (df -> Series), or other LazyFrame instances (resolved lazily at execution time).

classmethod concat(objs: list['LazyFrame'] | 'LazyFrame', **kwargs: Any) → LazyFrame: Concatenate one or more LazyFrame results via pd.concat.

copy() → LazyFrame

Return a deep copy of this LazyFrame.

SourceNode.lazyframe_ref values are restored to match the original graph so copied pipelines still resolve dict[LazyFrame, DataFrame] inputs correctly (including nested child LazyFrames).

execute(inputs: DataFrame | dict[LazyFrame, DataFrame], *, cache: Cache | None = None) → DataFrame | Any

Execute the LazyFrame and return the result.

Parameters:: inputs – Single DataFrame (for this LazyFrame) or dict of LazyFrame -> DataFrame.

filter(predicate: Callable[[DataFrame], Series]) → LazyFrame: Add a pandas boolean filter operation.

classmethod from_fn(fn: Callable[[...], Any], *args: Any, **kwargs: Any) → LazyFrame: Create a LazyFrame node that applies a callable to resolved inputs.

llm_as_judge(judge_instruction: str, *, response_format: Any | None = None, n_trials: int = 1, system_prompt: str | None = None, postprocessor: Callable[[list[str], Any, bool], SemanticMapPostprocessOutput] | None = None, return_raw_outputs: bool = False, return_explanations: bool = False, suffix: str = '_judge', examples: DataFrame | None = None, cot_reasoning: list[str] | None = None, strategy: ReasoningStrategy | None = None, extra_cols_to_include: list[str] | None = None, safe_mode: bool = False, progress_bar_desc: str = 'Evaluating', mark_optimizable: list[str] | None = None, **model_kwargs: Any) → LazyFrame: Add an LLM-as-judge evaluation operation.

classmethod load(path: str | Path) → LazyFrame

Load a LazyFrame pipeline from a file saved with save().

Parameters:: path – File path previously written by save().
Returns:: A reconstructed LazyFrame with the same pipeline structure.

load_sem_index(col_name: str, index_dir: str) → LazyFrame: Add a semantic index load operation.

mark_optimizable(node_idx: int, params: list[str]) → LazyFrame

Mark specific parameters on a node for GEPA optimization.

Parameters:

node_idx – Index of the node in the LazyFrame’s node list.
params – List of parameter names to optimize, e.g. [“user_instruction”]. Pass an empty list to explicitly exclude the node from optimization.

Returns:

New LazyFrame with the targeted node annotated.

optimize(optimizers: list[BaseOptimizer] = [], *, inplace: bool = False, train_data: DataFrame | dict[LazyFrame, DataFrame] | None = None, auto_include_default_optimizers: bool = True) → LazyFrame

Apply optimizations to this LazyFrame.

Parameters:

optimizers – List of optimizers to apply.
inplace – If True, modify this LazyFrame in place.
train_data – Optional training data for optimizers that require it.
auto_include_default_optimizer – If True (default), include the following optimizers: - PredicatePushdownOptimizer

Returns:

The optimized LazyFrame (same object if inplace, new otherwise).

pairwise_judge(col1: str, col2: str, judge_instruction: str, *, n_trials: int = 1, permute_cols: bool = False, system_prompt: str | None = None, return_raw_outputs: bool = False, return_explanations: bool = False, suffix: str = '_judge', examples: DataFrame | None = None, strategy: ReasoningStrategy | None = None, safe_mode: bool = False, progress_bar_desc: str = 'Evaluating', default_to_col1: bool = True, helper_examples: DataFrame | None = None, cascade_args: CascadeArgs | None = None, return_stats: bool = False, additional_cot_instructions: str = '', mark_optimizable: list[str] | None = None, **model_kwargs: Any) → LazyFrame: Add a pairwise judge evaluation operation.

print_tree() → None: Print the LazyFrame structure as a tree.

run(inputs: pd.DataFrame | dict['LazyFrame', pd.DataFrame], *, cache: Cache | None = None) → LazyFrameRun

Create a LazyFrameRun for this LazyFrame.

Parameters:: inputs – Single DataFrame (for this LazyFrame) or dict mapping LazyFrame objects to DataFrames.

save(path: str | Path) → None

Serialize this LazyFrame pipeline to a file.

The pipeline structure (all AST nodes) is persisted using pickle. Bound DataFrames, callables, and nested LazyFrame references are included — the file is not portable across different Python environments if custom callables are used.

Parameters:: path – Destination file path (e.g. "pipeline.pkl").

sem_agg(user_instruction: str, *, all_cols: bool = False, suffix: str = '_output', group_by: list[str] | None = None, safe_mode: bool = False, progress_bar_desc: str = 'Aggregating', long_context_strategy: LongContextStrategy | None = LongContextStrategy.CHUNK, response_format: type[BaseModel] | dict | None = None, split_fields_into_cols: bool = True, mark_optimizable: list[str] | None = None) → LazyFrame: Add a semantic aggregation operation.

sem_cluster_by(col_name: str, ncentroids: int, *, return_scores: bool = False, return_centroids: bool = False, niter: int = 20, verbose: bool = False) → LazyFrame: Add a semantic clustering operation.

sem_dedup(col_name: str, threshold: float) → LazyFrame: Add a semantic deduplication operation.

sem_extract(input_cols: list[str], output_cols: dict[str, str | None], *, extract_quotes: bool = False, postprocessor: Callable[[list[str], Any, bool], SemanticExtractPostprocessOutput] | None = None, return_raw_outputs: bool = False, safe_mode: bool = False, progress_bar_desc: str = 'Extracting', return_explanations: bool = False, strategy: ReasoningStrategy | None = None, mark_optimizable: list[str] | None = None) → LazyFrame: Add a semantic extract operation.

sem_filter(user_instruction: str, *, return_raw_outputs: bool = False, return_explanations: bool = False, return_all: bool = False, default: bool = True, suffix: str = '_filter', examples: DataFrame | None = None, helper_examples: DataFrame | None = None, strategy: ReasoningStrategy | None = None, cascade_args: CascadeArgs | None = None, return_stats: bool = False, safe_mode: bool = False, progress_bar_desc: str = 'Filtering', additional_cot_instructions: str = '', system_prompt: str | None = None, output_tokens: tuple[str, str] = ('True', 'False'), mark_optimizable: list[str] | None = None) → LazyFrame: Add a semantic filter operation.

sem_index(col_name: str, index_dir: str) → LazyFrame: Add a semantic index operation.

sem_join(right: 'LazyFrame' | pd.DataFrame, join_instruction: str, *, return_explanations: bool = False, how: str = 'inner', suffix: str = '_join', examples: pd.DataFrame | None = None, strategy: ReasoningStrategy | None = None, default: bool = True, cascade_args: CascadeArgs | None = None, return_stats: bool = False, safe_mode: bool = False, progress_bar_desc: str = 'Join comparisons', mark_optimizable: list[str] | None = None) → LazyFrame: Add a semantic join operation.

sem_map(user_instruction: str, *, system_prompt: str | None = None, postprocessor: Callable[[list[str], Any, bool], SemanticMapPostprocessOutput] | None = None, return_explanations: bool = False, return_raw_outputs: bool = False, suffix: str = '_map', examples: DataFrame | None = None, strategy: ReasoningStrategy | None = None, safe_mode: bool = False, progress_bar_desc: str = 'Mapping', mark_optimizable: list[str] | None = None, **model_kwargs: Any) → LazyFrame: Add a semantic map operation.

sem_partition_by(partition_fn: Callable[[DataFrame], list[int]]) → LazyFrame: Add a semantic partition operation.

sem_search(col_name: str, query: str, *, K: int | None = None, n_rerank: int | None = None, return_scores: bool = False, suffix: str = '_sim_score', mark_optimizable: list[str] | None = None) → LazyFrame: Add a semantic search operation.

sem_sim_join(right: 'LazyFrame' | pd.DataFrame, left_on: str, right_on: str, K: int, *, lsuffix: str = '', rsuffix: str = '', score_suffix: str = '', keep_index: bool = False) → LazyFrame: Add a semantic similarity join operation.

sem_topk(user_instruction: str, K: int, *, method: str = 'quick', strategy: ReasoningStrategy | None = None, group_by: list[str] | None = None, cascade_threshold: float | None = None, return_stats: bool = False, safe_mode: bool = False, return_explanations: bool = False, mark_optimizable: list[str] | None = None) → LazyFrame: Add a semantic top-k operation.

show() → str: Return the LazyFrame structure as a tree-like string.

Optimizers

Optimizer module for LOTUS LazyFrames.

class lotus.ast.optimizer.BaseOptimizer

Bases: ABC

Base class for LazyFrame optimizers.

Each optimizer implements a specific optimization strategy that transforms a list of nodes to improve performance.

abstract optimize(nodes: list[BaseNode], train_data: dict['LazyFrame', pd.DataFrame] | pd.DataFrame | None = None) → list[BaseNode]

Apply optimization to a list of nodes.

Parameters:

nodes – List of nodes to optimize
train_data – Optional training data dict (LazyFrame -> DataFrame). Only provided if requires_train_data is True.

Returns:

Optimized list of nodes (may be the same list if no changes)

requires_train_data: bool = False

class lotus.ast.optimizer.CascadeOptimizer

Bases: BaseOptimizer

Optimizer that pre-warms cascade thresholds on training data.

Runs the LazyFrame pipeline once on training data. SemFilterNode and SemJoinNode nodes that have cascade_args but no cached thresholds automatically learn and store them in self.cascade_args during __call__. Future executions then reuse the cached thresholds, skipping the threshold-learning sample.

This works recursively — nested LazyFrames (e.g. the right side of a sem_join) are resolved by the standard runner and their nodes self-update in the same way.

Requires train_data — a single DataFrame or a dict mapping LazyFrames to DataFrames.

Example:

from lotus.ast.optimizer import CascadeOptimizer

optimizer = CascadeOptimizer()
optimized_lf = lf.optimize([optimizer], train_data=df)
# Subsequent executions reuse the cached thresholds
result = optimized_lf.execute(df)

optimize(nodes: list[BaseNode], train_data: DataFrame | dict[LazyFrame, DataFrame] | None = None) → list[BaseNode]: Run the pipeline on train_data so cascade nodes learn and cache thresholds.

requires_train_data: bool = True

Bases: BaseOptimizer

GEPA-based prompt/instruction optimizer for LOTUS LazyFrames.

Automatically optimizes natural language instructions in semantic operator nodes using LLM-guided evolutionary search (GEPA).

By default, user_instruction on sem_filter/sem_map/sem_agg/sem_topk, join_instruction on sem_join, and query on sem_search are optimized. For sem_filter cascades that use HELPER_LM, the helper prompt target cascade_args.helper_filter_instruction is also optimized. The same helper prompt target is exposed for pairwise_judge nodes in mode="sem_filter" when using helper-LM cascades. Use mark_optimizable on the LazyFrame to customize which parameters to optimize or to exclude specific nodes entirely.

Parameters:

eval_fn – Scoring function called once per (candidate, example) pair. Signature: (output_df, example) -> float or (output_df, example) -> (float, side_info_dict). Higher scores are better. Return a side_info dict alongside the score to give the GEPA reflection LLM diagnostic context (e.g. expected vs. actual output, precision/recall breakdown).
valset – Optional held-out validation set (list of examples) for GEPA generalization mode. When provided, GEPA selects the best candidate based on valset performance rather than training performance.
gepa_config – GEPAConfig object controlling max LLM calls, model, temperature, etc. Defaults to GEPA’s built-in defaults when None.
objective – Natural language goal string passed to the reflection LLM. Auto-generated from the LazyFrame structure when None.
background – Domain context / constraints string for the reflection LLM. Auto-generated with LOTUS operator reference when None.

Example::

def eval_fn(output_df, example):: # Score: fraction of reviews kept that are actually positive positive_kept = sum(“great” in r or “ok” in r for r in output_df[“review”]) return positive_kept / max(len(output_df), 1)

optimizer = GEPAOptimizer(eval_fn=eval_fn) lf = LazyFrame(df=df).sem_filter(“{review} is a positive product review”) optimized_lf = lf.optimize([optimizer], train_data=df) result = optimized_lf.execute({})

optimize(nodes: list[BaseNode], train_data: dict[LazyFrame, pd.DataFrame] | pd.DataFrame | list[Any] | None = None) → list[BaseNode]

Optimize LazyFrame node parameters using GEPA.

Returns a new list of nodes with optimized parameter values.

requires_train_data: bool = True

class lotus.ast.optimizer.PredicatePushdownOptimizer

Bases: BaseOptimizer

Optimizer that moves pandas filters before sem_filters where safe.

This optimization reduces the number of rows processed by expensive semantic operations by filtering first with cheap pandas predicates.

A pandas filter can be pushed past a sem_filter because sem_filter only removes rows - it doesn’t add or rename columns that the filter might depend on.

optimize(nodes: list[BaseNode], train_data: DataFrame | dict[LazyFrame, DataFrame] | None = None) → list[BaseNode]

Move pandas filter nodes before sem_filter nodes where safe.

Parameters:

nodes – List of nodes to optimize
train_data – Optional training data (not used by this optimizer)

Returns:

Optimized list of nodes with filters pushed earlier

requires_train_data: bool = False