sem_join

sem_join joins two DataFrames, or a DataFrame and a named Series, using a natural language predicate instead of an equality condition.

Motivation

Traditional join operations often rely on rigid equality conditions, making them unsuitable for scenarios requiring nuanced, context-aware relationships. The sem_join operator addresses these limitations by enabling semantic matching of rows between datasets based on natural language predicates

Join Example

import pandas as pd
import lotus
from lotus.models import LM

lotus.settings.configure(lm=LM(model="gpt-4o-mini"))

courses = pd.DataFrame({
    "Course Name": [
        "History of the Atlantic World",
        "Riemannian Geometry",
        "Operating Systems",
        "Food Science",
        "Compilers",
        "Intro to computer science",
    ]
})

skills = pd.DataFrame({
    "Skill": ["Math", "Computer Science"],
})

joined = courses.sem_join(
    skills,
    "Taking {Course Name:left} will help me learn {Skill:right}",
)

print(joined)

Output:

Course Name

Skill

0

Riemannian Geometry

Math

1

Operating Systems

Computer Science

2

Compilers

Computer Science

3

Intro to computer science

Computer Science

The result contains the matched course-skill pairs.

Column Disambiguation

Use :left and :right when a join instruction references columns from both sides.

joined = left.sem_join(
    right,
    "{title:left} and {title:right} describe the same task",
)

If there is no ambiguity, LOTUS can infer the left and right columns from the DataFrame schemas. If a referenced column exists in both DataFrames, use explicit :left and :right suffixes.

Join Semantics

sem_join currently supports inner joins. other can be a DataFrame or a named Series. For each candidate pair, LOTUS evaluates the natural language predicate and keeps the pairs judged true.

Set return_explanations=True to add an explanation{suffix} column for the pairs that matched.

joined = courses.sem_join(
    skills,
    "Taking {Course Name:left} will help me learn {Skill:right}",
    return_explanations=True,
    suffix="_match",
)

Cascades

Cascades reduce cost by using cheaper helper plans before routing uncertain pairs to the main LM. See Optimized Processing with Approximations for the full details.

from lotus.types import CascadeArgs

cascade_args = CascadeArgs(
    recall_target=0.7,
    precision_target=0.7,
    sampling_percentage=0.2,
    failure_probability=0.2,
)

joined, stats = courses.sem_join(
    skills,
    "Taking {Course Name:left} will help me learn {Skill:right}",
    cascade_args=cascade_args,
    return_stats=True,
)

For join cascades, CascadeArgs can also include map_instruction and map_examples.

Few-Shot Examples

Use examples when the join relationship is domain-specific. The examples DataFrame should contain the referenced left and right columns plus an Answer column with boolean labels.

examples = pd.DataFrame({
    "Course Name": ["Operating Systems"],
    "Skill": ["Computer Science"],
    "Answer": [True],
})

joined = courses.sem_join(
    skills,
    "Taking {Course Name:left} will help me learn {Skill:right}",
    examples=examples,
)

Return Value

sem_join returns an inner-join DataFrame containing the matched left and right rows. Columns that exist on both sides are renamed with :left and :right. With return_stats=True and a cascade, it returns (joined_df, stats).

Required Parameters

  • other: Right-hand DataFrame or named Series.

  • join_instruction: Natural language predicate over left and right rows. Use :left and :right to disambiguate columns when needed.

Optional Parameters

  • return_explanations: Add an explanation{suffix} column for matched pairs.

  • how: Join type. Only "inner" is currently supported.

  • suffix: Suffix for explanation columns.

  • examples: Few-shot examples with referenced columns and an Answer column.

  • strategy: Optional reasoning strategy.

  • default: Boolean decision to use when output parsing is uncertain.

  • cascade_args: Optional join cascade configuration.

  • return_stats: Return (DataFrame, stats) when cascade stats are available.

  • safe_mode: Estimate cost before execution.

  • progress_bar_desc: Progress bar label.