sem_cluster_by

Overview

The cluster operator creates groups over the input dataframe according to semantic similarity.

Motivation

Clustering is useful when you would like to group togethe similar records within the dataset.

Example

import pandas as pd

import lotus
from lotus.models import LM, SentenceTransformersRM
from lotus.vector_store import FaissVS

lm = LM(model="gpt-4o-mini")
rm = SentenceTransformersRM(model="intfloat/e5-base-v2")
vs = FaissVS()

lotus.settings.configure(lm=lm, rm=rm, vs=vs)
data = {
    "Course Name": [
        "Probability and Random Processes",
        "Optimization Methods in Engineering",
        "Digital Design and Integrated Circuits",
        "Computer Security",
        "Cooking",
        "Food Sciences",
    ]
}
df = pd.DataFrame(data)
df = df.sem_index("Course Name", "course_name_index").sem_cluster_by("Course Name", 2)
print(df)

Output:

Course Name

cluster_id

0

Probability and Random Processes

0

1

Optimization Methods in Engineering

0

2

Digital Design and Integrated Circuits

0

3

Computer Security

1

4

Cooking

1

5

Food Sciences

1

Required Parameters

  • col_name : The column name to cluster on.

  • ncentroids : The number of centroids.

Optional Parameters

  • niter : The number of iterations.

  • verbose : Whether to print verbose output.