Optimized Execution with LazyFrames =================================== LazyFrame separates pipeline definition from execution. That gives LOTUS a complete logical plan before any expensive LLM calls happen. ``optimize()`` uses that plan to rewrite execution order, tune prompts, and prepare cascades. Why Optimize? ------------- LLM data processing pipelines are sensitive to both cost and prompt quality. A pipeline that is correct on a few examples may become expensive or brittle at dataset scale. Optimization helps in three ways: - **Reduce cost** by doing cheap deterministic work before LM calls. - **Improve accuracy** by tuning semantic instructions against training data. - **Reuse learned state** such as cascade thresholds in future runs. This matters most for multi-step programs: filtering traces before aggregating failure modes, judging many model outputs, running RAG over retrieved evidence, or extracting structured fields from many documents. Optimization Flow ----------------- LazyFrame optimization has three steps: 1. Build the pipeline. Nothing executes yet. 2. Call ``optimize()`` with optional optimizers and training data. 3. Call ``execute()`` on the optimized pipeline. Original pipeline: .. code-block:: python from lotus.ast import LazyFrame pipeline = LazyFrame().sem_filter( "The {issue_title} describes a small, self-contained task that a new " "open source contributor could tackle without deep knowledge of the codebase" ) pipeline.print_tree() Updated pipeline: .. code-block:: python from lotus.ast.optimizer import GEPAOptimizer, CascadeOptimizer optimized = pipeline.optimize( [GEPAOptimizer(eval_fn=eval_fn), CascadeOptimizer()], train_data=training_issues, ) optimized.print_tree() result = optimized.execute(issues) ``pipeline`` is the original logical plan. ``optimized`` is the updated plan returned by LOTUS after applying the selected optimizers plus default predicate pushdown. Printing both trees is the easiest way to inspect what changed before you run the optimized pipeline on the full dataset. ``optimize()`` returns a new LazyFrame by default. Pass ``inplace=True`` only when you want to update the existing object. Predicate Pushdown ------------------ Predicate pushdown moves cheap pandas filters before semantic operators when that rewrite is safe. It is on by default whenever you call ``optimize()``. You do not need to include it in the optimizer list. This helps because pandas filters are local and inexpensive, while semantic filters call an LM. If a pandas filter removes half the rows, pushing it before ``sem_filter`` can remove half the LM calls. .. code-block:: python pipeline = ( LazyFrame() .sem_filter("{issue_title} is a good first issue") .filter(lambda df: df["priority"] != "critical") ) pipeline.print_tree() optimized = pipeline.optimize() # predicate pushdown still runs optimized.print_tree() Output: .. code-block:: text # Original plan Source sem_filter('{issue_title} is a good first issue') filter(...) # Optimized plan Source filter(...) sem_filter('{issue_title} is a good first issue') Disable default optimizers when you need exact original plan order. .. code-block:: python optimized = pipeline.optimize( [], auto_include_default_optimizers=False, ) GEPA Prompt Optimization ------------------------ ``GEPAOptimizer`` uses `GEPA `_ to tune natural language instructions using training data and an evaluation function. This is useful when a high-level prompt is easy to write but not accurate enough for your metric. The evaluation function receives ``(output_df, example)`` and returns either a score or ``(score, side_info)``. Higher scores are better. ``side_info`` gives the optimizer diagnostic context, such as precision and recall. .. code-block:: python from lotus.ast.optimizer import GEPAOptimizer GOOD_FIRST_ISSUE_IDS = {0, 3, 5, 7, 9, 11} def eval_fn(output_df, example): kept = set(output_df.index) true_positive = len(kept & GOOD_FIRST_ISSUE_IDS) precision = true_positive / max(len(kept), 1) recall = true_positive / max(len(GOOD_FIRST_ISSUE_IDS), 1) f1 = 2 * precision * recall / max(precision + recall, 1e-9) return f1, {"precision": precision, "recall": recall} optimizer = GEPAOptimizer( eval_fn=eval_fn, objective="Maximize F1 for identifying good first issues.", ) pipeline = LazyFrame().sem_filter( "{issue_title} is an easy starter task" ) optimized = pipeline.optimize([optimizer], train_data=issues) GEPA can optimize instructions on semantic operators such as ``sem_filter``, ``sem_map``, ``sem_agg``, ``sem_topk``, ``sem_join``, ``sem_search``, and the evaluation operators. The benefit is end-to-end prompt tuning: if a pipeline has multiple semantic steps, the prompts can be improved together instead of tuning each operator in isolation. Choosing What GEPA Can Change ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Use ``mark_optimizable`` to restrict which parameters GEPA can rewrite. Use an empty list to pin a node so GEPA leaves it unchanged. .. code-block:: python pipeline = ( LazyFrame() .sem_filter( "{issue_title} is a good first issue", mark_optimizable=["user_instruction"], ) .sem_map( "Rewrite {issue_title} as a task title", suffix="_task", mark_optimizable=[], ) ) Cascade Thresholds ------------------ Cascades reduce cost by routing easy examples to a cheaper proxy model and only sending uncertain examples to the main LM. A cascade needs thresholds to decide what counts as high-confidence. ``CascadeOptimizer`` runs the pipeline once on training data to learn and store those thresholds. Later executions reuse the thresholds and skip the learning pass. .. code-block:: python import lotus from lotus.ast import LazyFrame from lotus.ast.optimizer import CascadeOptimizer from lotus.models import LM from lotus.types import CascadeArgs lotus.settings.configure( lm=LM(model="gpt-4o"), helper_lm=LM(model="gpt-4o-mini"), ) cascade_args = CascadeArgs( recall_target=0.9, precision_target=0.9, sampling_percentage=0.5, failure_probability=0.2, ) pipeline = LazyFrame().sem_filter( "{issue_title} is a good first issue", cascade_args=cascade_args, ) optimized = pipeline.optimize([CascadeOptimizer()], train_data=issues) result = optimized.execute(issues) Use higher ``recall_target`` when missing true positives is costly. Use higher ``precision_target`` when false positives are costly. Higher targets usually increase main-LM calls. Saving Optimized Pipelines -------------------------- Optimized pipelines can be saved and loaded like any LazyFrame. This preserves optimized prompts and learned cascade thresholds. .. code-block:: python optimized.save("optimized_lf.pkl") loaded = LazyFrame.load("optimized_lf.pkl") result = loaded.execute(issues) API Reference ------------- See :doc:`lazyframe_api` for the full LazyFrame and optimizer API reference.