sem_filter
==========

``sem_filter`` keeps rows whose contents satisfy a natural language predicate.
Reference DataFrame columns with ``{column_name}``.

Motivation
-----------
Semantic filtering is a complex yet vital operation in modern data processing, requiring accurate and efficient 
evaluation of data rows against nuanced, natural language predicates. Unlike traditional filtering techniques, 
which rely on rigid and often simplistic rules, semantic filters must leverage language models to reason contextually about the data. 


Filter Example
---------------
.. code-block:: python

    import pandas as pd
    import lotus
    from lotus.models import LM

    lotus.settings.configure(lm=LM(model="gpt-4o-mini"))

    courses = pd.DataFrame({
        "Course Name": [
            "Probability and Random Processes",
            "Optimization Methods in Engineering",
            "Digital Design and Integrated Circuits",
            "Computer Security",
        ]
    })

    math_heavy = courses.sem_filter(
        "{Course Name} requires a lot of math"
    )

    print(math_heavy)

Output:

+---+----------------------------------------+
|   | Course Name                            |
+===+========================================+
| 0 | Probability and Random Processes       |
+---+----------------------------------------+
| 1 | Optimization Methods in Engineering    |
+---+----------------------------------------+
| 2 | Digital Design and Integrated Circuits |
+---+----------------------------------------+

The result contains only the rows that the model judged as satisfying the
predicate.

Returning Decisions for Every Row
---------------------------------

By default, ``sem_filter`` drops rows that do not pass. Set
``return_all=True`` when you want to keep every row and add the model's boolean
decision as a new column.

.. code-block:: python

    judged = courses.sem_filter(
        "{Course Name} requires a lot of math",
        return_all=True,
        suffix="_math_heavy",
    )

``judged`` keeps the original rows and adds ``_math_heavy``.

Explanations and Raw Outputs
----------------------------

Use ``return_explanations=True`` while developing a predicate or auditing the
model's decisions.

.. code-block:: python

    judged = courses.sem_filter(
        "{Course Name} requires a lot of math",
        return_all=True,
        return_explanations=True,
        return_raw_outputs=True,
    )

When ``return_all=False``, explanations and raw outputs are returned only for
the rows that pass. When ``return_all=True``, they are returned for all rows.

Reasoning and Custom Instructions
---------------------------------

Reasoning strategies can improve difficult filters by asking the model to work
through the decision before producing ``True`` or ``False``.

.. code-block:: python

    from lotus.types import ReasoningStrategy

    filtered = issues.sem_filter(
        "{issue_title} is a small, self-contained task for a new contributor",
        strategy=ReasoningStrategy.ZS_COT,
        additional_cot_instructions="Focus on codebase knowledge and blast radius.",
    )

``system_prompt`` changes the model's role for the filter. ``output_tokens``
changes the positive and negative labels, which defaults to ``("True",
"False")``.

Cascades
--------

Cascades reduce cost by using a cheaper helper first and routing uncertain
rows to the main LM. See :doc:`approximation_cascades` for the full details.

.. code-block:: python

    from lotus.types import CascadeArgs, ProxyModel

    lotus.settings.configure(
        lm=LM(model="gpt-4o"),
        helper_lm=LM(model="gpt-4o-mini"),
    )

    cascade_args = CascadeArgs(
        recall_target=0.9,
        precision_target=0.9,
        sampling_percentage=0.5,
        failure_probability=0.2,
        proxy_model=ProxyModel.HELPER_LM,
        helper_filter_instruction="{issue_title} is easy for a new contributor",
    )

    filtered, stats = issues.sem_filter(
        "{issue_title} is a good first issue",
        cascade_args=cascade_args,
        return_stats=True,
    )

``helper_filter_instruction`` can be simpler than the main instruction. If it
is omitted, the helper LM uses the main instruction.

Return Value
------------

Without ``return_stats``, ``sem_filter`` returns a DataFrame. With
``return_stats=True`` and a cascade, it returns ``(df, stats)``. The stats
describe learned thresholds and how many rows were resolved by the helper
versus the main LM.

Required Parameters
-------------------

- ``user_instruction``: Natural language predicate. Rows where the predicate is
  judged true are kept. Reference columns with ``{column_name}``.

Optional Parameters
-------------------

- ``return_raw_outputs``: Add raw model text columns.
- ``return_explanations``: Add explanation columns when available.
- ``return_all``: Keep all rows and add the boolean decision column instead of
  dropping false rows.
- ``default``: Boolean decision to use when output parsing is uncertain.
- ``suffix``: Output column suffix when ``return_all=True``.
- ``examples``: Few-shot examples for the main LM with an ``Answer`` column.
- ``helper_examples``: Few-shot examples for the helper LM in cascade mode.
- ``strategy``: Optional reasoning strategy.
- ``cascade_args``: Optional cascade configuration.
- ``return_stats``: Return ``(DataFrame, stats)`` when stats are available.
- ``safe_mode``: Estimate cost before execution.
- ``progress_bar_desc``: Progress bar label.
- ``additional_cot_instructions``: Extra instructions for CoT prompting.
- ``system_prompt``: Custom system prompt for the LM.
- ``output_tokens``: Positive and negative output tokens. Defaults to
  ``("True", "False")``.
- ``**model_kwargs``: Extra keyword arguments passed to the configured LM.