sem_filter

sem_filter keeps rows whose contents satisfy a natural language predicate. Reference DataFrame columns with {column_name}.

Motivation

Semantic filtering is a complex yet vital operation in modern data processing, requiring accurate and efficient evaluation of data rows against nuanced, natural language predicates. Unlike traditional filtering techniques, which rely on rigid and often simplistic rules, semantic filters must leverage language models to reason contextually about the data.

Filter Example

import pandas as pd
import lotus
from lotus.models import LM

lotus.settings.configure(lm=LM(model="gpt-4o-mini"))

courses = pd.DataFrame({
    "Course Name": [
        "Probability and Random Processes",
        "Optimization Methods in Engineering",
        "Digital Design and Integrated Circuits",
        "Computer Security",
    ]
})

math_heavy = courses.sem_filter(
    "{Course Name} requires a lot of math"
)

print(math_heavy)

Output:

Course Name

0

Probability and Random Processes

1

Optimization Methods in Engineering

2

Digital Design and Integrated Circuits

The result contains only the rows that the model judged as satisfying the predicate.

Returning Decisions for Every Row

By default, sem_filter drops rows that do not pass. Set return_all=True when you want to keep every row and add the model’s boolean decision as a new column.

judged = courses.sem_filter(
    "{Course Name} requires a lot of math",
    return_all=True,
    suffix="_math_heavy",
)

judged keeps the original rows and adds _math_heavy.

Explanations and Raw Outputs

Use return_explanations=True while developing a predicate or auditing the model’s decisions.

judged = courses.sem_filter(
    "{Course Name} requires a lot of math",
    return_all=True,
    return_explanations=True,
    return_raw_outputs=True,
)

When return_all=False, explanations and raw outputs are returned only for the rows that pass. When return_all=True, they are returned for all rows.

Reasoning and Custom Instructions

Reasoning strategies can improve difficult filters by asking the model to work through the decision before producing True or False.

from lotus.types import ReasoningStrategy

filtered = issues.sem_filter(
    "{issue_title} is a small, self-contained task for a new contributor",
    strategy=ReasoningStrategy.ZS_COT,
    additional_cot_instructions="Focus on codebase knowledge and blast radius.",
)

system_prompt changes the model’s role for the filter. output_tokens changes the positive and negative labels, which defaults to ("True", "False").

Cascades

Cascades reduce cost by using a cheaper helper first and routing uncertain rows to the main LM. See Optimized Processing with Approximations for the full details.

from lotus.types import CascadeArgs, ProxyModel

lotus.settings.configure(
    lm=LM(model="gpt-4o"),
    helper_lm=LM(model="gpt-4o-mini"),
)

cascade_args = CascadeArgs(
    recall_target=0.9,
    precision_target=0.9,
    sampling_percentage=0.5,
    failure_probability=0.2,
    proxy_model=ProxyModel.HELPER_LM,
    helper_filter_instruction="{issue_title} is easy for a new contributor",
)

filtered, stats = issues.sem_filter(
    "{issue_title} is a good first issue",
    cascade_args=cascade_args,
    return_stats=True,
)

helper_filter_instruction can be simpler than the main instruction. If it is omitted, the helper LM uses the main instruction.

Return Value

Without return_stats, sem_filter returns a DataFrame. With return_stats=True and a cascade, it returns (df, stats). The stats describe learned thresholds and how many rows were resolved by the helper versus the main LM.

Required Parameters

  • user_instruction: Natural language predicate. Rows where the predicate is judged true are kept. Reference columns with {column_name}.

Optional Parameters

  • return_raw_outputs: Add raw model text columns.

  • return_explanations: Add explanation columns when available.

  • return_all: Keep all rows and add the boolean decision column instead of dropping false rows.

  • default: Boolean decision to use when output parsing is uncertain.

  • suffix: Output column suffix when return_all=True.

  • examples: Few-shot examples for the main LM with an Answer column.

  • helper_examples: Few-shot examples for the helper LM in cascade mode.

  • strategy: Optional reasoning strategy.

  • cascade_args: Optional cascade configuration.

  • return_stats: Return (DataFrame, stats) when stats are available.

  • safe_mode: Estimate cost before execution.

  • progress_bar_desc: Progress bar label.

  • additional_cot_instructions: Extra instructions for CoT prompting.

  • system_prompt: Custom system prompt for the LM.

  • output_tokens: Positive and negative output tokens. Defaults to ("True", "False").

  • **model_kwargs: Extra keyword arguments passed to the configured LM.