LazyFrame API Reference
LazyFrame
LazyFrame builder for LOTUS AST operations.
LazyFrame is an immutable builder that records a sequence of semantic and
pandas operations as AST nodes. Nothing is executed until .execute() (or
.run().execute()) is called.
Example:
lf = LazyFrame().sem_filter("{text} is about sports").sem_map("Summarize {text}")
result = lf.execute(df)
- class lotus.ast.lazyframe.LazyFrame(df: DataFrame | None = None, *, schema: dict[str, str] | None = None, _nodes: list[BaseNode] | None = None, _source: SourceNode | None = None, _default_cache: Cache | None = None)
Bases:
objectImmutable lazy version of DataFrame and semantic operators.
Operations are recorded as AST nodes and only materialised when
.execute()is called.- Parameters:
df – Optional bound DataFrame. When provided the source data is stored directly on the LazyFrame so no external input is required at execution time.
schema – Optional
{col_name: dtype}dict validated at execution time against the source DataFrame._nodes – Internal — pre-built node list (used by copy/optimise).
_source – Internal — explicit source node reference.
Example:
>>> lf = LazyFrame().sem_filter("{text} is about sports").sem_map("Summarize {text}") >>> result = lf.execute(df)
- add_source(df: DataFrame | None = None, schema: dict[str, str] | None = None) LazyFrame
Set the LazyFrame source (optional bound DataFrame and schema).
Replaces the single source node. Use this to bind a df or add schema validation.
- assign(**kwargs: Any) LazyFrame
Add column assignments and return a new LazyFrame.
Values may be scalars, callables
(df -> Series), or other LazyFrame instances (resolved lazily at execution time).
- classmethod concat(objs: list['LazyFrame'] | 'LazyFrame', **kwargs: Any) LazyFrame
Concatenate one or more LazyFrame results via
pd.concat.
- copy() LazyFrame
Return a deep copy of this LazyFrame.
SourceNode.lazyframe_refvalues are restored to match the original graph so copied pipelines still resolvedict[LazyFrame, DataFrame]inputs correctly (including nested child LazyFrames).
- execute(inputs: DataFrame | dict[LazyFrame, DataFrame], *, cache: Cache | None = None) DataFrame | Any
Execute the LazyFrame and return the result.
- Parameters:
inputs – Single DataFrame (for this LazyFrame) or dict of
LazyFrame -> DataFrame.
- classmethod from_fn(fn: Callable[[...], Any], *args: Any, **kwargs: Any) LazyFrame
Create a LazyFrame node that applies a callable to resolved inputs.
- llm_as_judge(judge_instruction: str, *, response_format: Any | None = None, n_trials: int = 1, system_prompt: str | None = None, postprocessor: Callable[[list[str], Any, bool], SemanticMapPostprocessOutput] | None = None, return_raw_outputs: bool = False, return_explanations: bool = False, suffix: str = '_judge', examples: DataFrame | None = None, cot_reasoning: list[str] | None = None, strategy: ReasoningStrategy | None = None, extra_cols_to_include: list[str] | None = None, safe_mode: bool = False, progress_bar_desc: str = 'Evaluating', mark_optimizable: list[str] | None = None, **model_kwargs: Any) LazyFrame
Add an LLM-as-judge evaluation operation.
- classmethod load(path: str | Path) LazyFrame
Load a LazyFrame pipeline from a file saved with
save().- Parameters:
path – File path previously written by
save().- Returns:
A reconstructed LazyFrame with the same pipeline structure.
- mark_optimizable(node_idx: int, params: list[str]) LazyFrame
Mark specific parameters on a node for GEPA optimization.
- Parameters:
node_idx – Index of the node in the LazyFrame’s node list.
params – List of parameter names to optimize, e.g. [“user_instruction”]. Pass an empty list to explicitly exclude the node from optimization.
- Returns:
New LazyFrame with the targeted node annotated.
- optimize(optimizers: list[BaseOptimizer] = [], *, inplace: bool = False, train_data: DataFrame | dict[LazyFrame, DataFrame] | None = None, auto_include_default_optimizers: bool = True) LazyFrame
Apply optimizations to this LazyFrame.
- Parameters:
optimizers – List of optimizers to apply.
inplace – If True, modify this LazyFrame in place.
train_data – Optional training data for optimizers that require it.
auto_include_default_optimizer – If True (default), include the following optimizers: - PredicatePushdownOptimizer
- Returns:
The optimized LazyFrame (same object if inplace, new otherwise).
- pairwise_judge(col1: str, col2: str, judge_instruction: str, *, n_trials: int = 1, permute_cols: bool = False, system_prompt: str | None = None, return_raw_outputs: bool = False, return_explanations: bool = False, suffix: str = '_judge', examples: DataFrame | None = None, strategy: ReasoningStrategy | None = None, safe_mode: bool = False, progress_bar_desc: str = 'Evaluating', default_to_col1: bool = True, helper_examples: DataFrame | None = None, cascade_args: CascadeArgs | None = None, return_stats: bool = False, additional_cot_instructions: str = '', mark_optimizable: list[str] | None = None, **model_kwargs: Any) LazyFrame
Add a pairwise judge evaluation operation.
- print_tree() None
Print the LazyFrame structure as a tree.
- run(inputs: pd.DataFrame | dict['LazyFrame', pd.DataFrame], *, cache: Cache | None = None) LazyFrameRun
Create a
LazyFrameRunfor this LazyFrame.- Parameters:
inputs – Single DataFrame (for this LazyFrame) or dict mapping LazyFrame objects to DataFrames.
- save(path: str | Path) None
Serialize this LazyFrame pipeline to a file.
The pipeline structure (all AST nodes) is persisted using pickle. Bound DataFrames, callables, and nested LazyFrame references are included — the file is not portable across different Python environments if custom callables are used.
- Parameters:
path – Destination file path (e.g.
"pipeline.pkl").
- sem_agg(user_instruction: str, *, all_cols: bool = False, suffix: str = '_output', group_by: list[str] | None = None, safe_mode: bool = False, progress_bar_desc: str = 'Aggregating', long_context_strategy: LongContextStrategy | None = LongContextStrategy.CHUNK, response_format: type[BaseModel] | dict | None = None, split_fields_into_cols: bool = True, mark_optimizable: list[str] | None = None) LazyFrame
Add a semantic aggregation operation.
- sem_cluster_by(col_name: str, ncentroids: int, *, return_scores: bool = False, return_centroids: bool = False, niter: int = 20, verbose: bool = False) LazyFrame
Add a semantic clustering operation.
- sem_extract(input_cols: list[str], output_cols: dict[str, str | None], *, extract_quotes: bool = False, postprocessor: Callable[[list[str], Any, bool], SemanticExtractPostprocessOutput] | None = None, return_raw_outputs: bool = False, safe_mode: bool = False, progress_bar_desc: str = 'Extracting', return_explanations: bool = False, strategy: ReasoningStrategy | None = None, mark_optimizable: list[str] | None = None) LazyFrame
Add a semantic extract operation.
- sem_filter(user_instruction: str, *, return_raw_outputs: bool = False, return_explanations: bool = False, return_all: bool = False, default: bool = True, suffix: str = '_filter', examples: DataFrame | None = None, helper_examples: DataFrame | None = None, strategy: ReasoningStrategy | None = None, cascade_args: CascadeArgs | None = None, return_stats: bool = False, safe_mode: bool = False, progress_bar_desc: str = 'Filtering', additional_cot_instructions: str = '', system_prompt: str | None = None, output_tokens: tuple[str, str] = ('True', 'False'), mark_optimizable: list[str] | None = None) LazyFrame
Add a semantic filter operation.
- sem_join(right: 'LazyFrame' | pd.DataFrame, join_instruction: str, *, return_explanations: bool = False, how: str = 'inner', suffix: str = '_join', examples: pd.DataFrame | None = None, strategy: ReasoningStrategy | None = None, default: bool = True, cascade_args: CascadeArgs | None = None, return_stats: bool = False, safe_mode: bool = False, progress_bar_desc: str = 'Join comparisons', mark_optimizable: list[str] | None = None) LazyFrame
Add a semantic join operation.
- sem_map(user_instruction: str, *, system_prompt: str | None = None, postprocessor: Callable[[list[str], Any, bool], SemanticMapPostprocessOutput] | None = None, return_explanations: bool = False, return_raw_outputs: bool = False, suffix: str = '_map', examples: DataFrame | None = None, strategy: ReasoningStrategy | None = None, safe_mode: bool = False, progress_bar_desc: str = 'Mapping', mark_optimizable: list[str] | None = None, **model_kwargs: Any) LazyFrame
Add a semantic map operation.
- sem_partition_by(partition_fn: Callable[[DataFrame], list[int]]) LazyFrame
Add a semantic partition operation.
- sem_search(col_name: str, query: str, *, K: int | None = None, n_rerank: int | None = None, return_scores: bool = False, suffix: str = '_sim_score', mark_optimizable: list[str] | None = None) LazyFrame
Add a semantic search operation.
- sem_sim_join(right: 'LazyFrame' | pd.DataFrame, left_on: str, right_on: str, K: int, *, lsuffix: str = '', rsuffix: str = '', score_suffix: str = '', keep_index: bool = False) LazyFrame
Add a semantic similarity join operation.
- sem_topk(user_instruction: str, K: int, *, method: str = 'quick', strategy: ReasoningStrategy | None = None, group_by: list[str] | None = None, cascade_threshold: float | None = None, return_stats: bool = False, safe_mode: bool = False, return_explanations: bool = False, mark_optimizable: list[str] | None = None) LazyFrame
Add a semantic top-k operation.
- show() str
Return the LazyFrame structure as a tree-like string.
Optimizers
Optimizer module for LOTUS LazyFrames.
- class lotus.ast.optimizer.BaseOptimizer
Bases:
ABCBase class for LazyFrame optimizers.
Each optimizer implements a specific optimization strategy that transforms a list of nodes to improve performance.
- abstract optimize(nodes: list[BaseNode], train_data: dict['LazyFrame', pd.DataFrame] | pd.DataFrame | None = None) list[BaseNode]
Apply optimization to a list of nodes.
- Parameters:
nodes – List of nodes to optimize
train_data – Optional training data dict (LazyFrame -> DataFrame). Only provided if requires_train_data is True.
- Returns:
Optimized list of nodes (may be the same list if no changes)
- requires_train_data: bool = False
- class lotus.ast.optimizer.CascadeOptimizer
Bases:
BaseOptimizerOptimizer that pre-warms cascade thresholds on training data.
Runs the LazyFrame pipeline once on training data.
SemFilterNodeandSemJoinNodenodes that havecascade_argsbut no cached thresholds automatically learn and store them inself.cascade_argsduring__call__. Future executions then reuse the cached thresholds, skipping the threshold-learning sample.This works recursively — nested LazyFrames (e.g. the right side of a sem_join) are resolved by the standard runner and their nodes self-update in the same way.
Requires
train_data— a single DataFrame or a dict mapping LazyFrames to DataFrames.Example:
from lotus.ast.optimizer import CascadeOptimizer optimizer = CascadeOptimizer() optimized_lf = lf.optimize([optimizer], train_data=df) # Subsequent executions reuse the cached thresholds result = optimized_lf.execute(df)
- optimize(nodes: list[BaseNode], train_data: DataFrame | dict[LazyFrame, DataFrame] | None = None) list[BaseNode]
Run the pipeline on train_data so cascade nodes learn and cache thresholds.
- requires_train_data: bool = True
- class lotus.ast.optimizer.GEPAOptimizer(eval_fn: UserEvalFn, *, valset: dict[LazyFrame, pd.DataFrame] | pd.DataFrame | list[Any] | None = None, gepa_config: GEPAConfig | None = None, objective: str | None = None, background: str | None = None, cache: Cache | None = None, include_output_in_side_info: bool = True)
Bases:
BaseOptimizerGEPA-based prompt/instruction optimizer for LOTUS LazyFrames.
Automatically optimizes natural language instructions in semantic operator nodes using LLM-guided evolutionary search (GEPA).
By default,
user_instructionon sem_filter/sem_map/sem_agg/sem_topk,join_instructionon sem_join, andqueryon sem_search are optimized. For sem_filter cascades that useHELPER_LM, the helper prompt targetcascade_args.helper_filter_instructionis also optimized. The same helper prompt target is exposed forpairwise_judgenodes inmode="sem_filter"when using helper-LM cascades. Usemark_optimizableon theLazyFrameto customize which parameters to optimize or to exclude specific nodes entirely.- Parameters:
eval_fn – Scoring function called once per (candidate, example) pair. Signature:
(output_df, example) -> floator(output_df, example) -> (float, side_info_dict). Higher scores are better. Return aside_infodict alongside the score to give the GEPA reflection LLM diagnostic context (e.g. expected vs. actual output, precision/recall breakdown).valset – Optional held-out validation set (list of examples) for GEPA generalization mode. When provided, GEPA selects the best candidate based on valset performance rather than training performance.
gepa_config –
GEPAConfigobject controlling max LLM calls, model, temperature, etc. Defaults to GEPA’s built-in defaults whenNone.objective – Natural language goal string passed to the reflection LLM. Auto-generated from the LazyFrame structure when
None.background – Domain context / constraints string for the reflection LLM. Auto-generated with LOTUS operator reference when
None.
- Example::
- def eval_fn(output_df, example):
# Score: fraction of reviews kept that are actually positive positive_kept = sum(“great” in r or “ok” in r for r in output_df[“review”]) return positive_kept / max(len(output_df), 1)
optimizer = GEPAOptimizer(eval_fn=eval_fn) lf = LazyFrame(df=df).sem_filter(“{review} is a positive product review”) optimized_lf = lf.optimize([optimizer], train_data=df) result = optimized_lf.execute({})
- optimize(nodes: list[BaseNode], train_data: dict[LazyFrame, pd.DataFrame] | pd.DataFrame | list[Any] | None = None) list[BaseNode]
Optimize LazyFrame node parameters using GEPA.
Returns a new list of nodes with optimized parameter values.
- requires_train_data: bool = True
- class lotus.ast.optimizer.PredicatePushdownOptimizer
Bases:
BaseOptimizerOptimizer that moves pandas filters before sem_filters where safe.
This optimization reduces the number of rows processed by expensive semantic operations by filtering first with cheap pandas predicates.
A pandas filter can be pushed past a sem_filter because sem_filter only removes rows - it doesn’t add or rename columns that the filter might depend on.
- optimize(nodes: list[BaseNode], train_data: DataFrame | dict[LazyFrame, DataFrame] | None = None) list[BaseNode]
Move pandas filter nodes before sem_filter nodes where safe.
- Parameters:
nodes – List of nodes to optimize
train_data – Optional training data (not used by this optimizer)
- Returns:
Optimized list of nodes with filters pushed earlier
- requires_train_data: bool = False