sem_topk

Overview

LOTUS supports a semantic top-k, which takes the langex ranking criteria. Programmers can optionally specify a group-by parameter to indicate a subset of columns to group over during ranking. The groupings are defined using standard equality matches over the group-by columns

Motivation

This operator is useful for re-ordering records based on complex, arbitrary natural language comparators.

Example

import pandas as pd

import lotus
from lotus.models import LM

lm = LM(model="gpt-4o-mini")

lotus.settings.configure(lm=lm)
data = {
    "Course Name": [
        "Probability and Random Processes",
        "Optimization Methods in Engineering",
        "Digital Design and Integrated Circuits",
        "Computer Security",
    ]
}
df = pd.DataFrame(data)

for method in ["quick", "heap", "naive"]:
    sorted_df, stats = df.sem_topk(
        "Which {Course Name} requires the least math?",
        K=2,
        method=method,
        return_stats=True,
    )
    print(sorted_df)
    print(stats)

Output:

Course Name

0

Computer Security

1

Digital Design and Integrated Circuits

Required Parameters

  • user_instruction : The user instruction for sorting.

  • K: The number of rows to return.

Optional Paramaters

  • method : The method to use for sorting. Options are “quick”, “heap”, “naive”, “quick-sem”.

  • group_by : The columns to group by before sorting. Each group will be sorted separately.

  • cascade_threshold: The confidence threshold for cascading to a larger model.

  • return_stats : Whether to return stats.