sem_cluster_by

Overview

The cluster operator creates groups over the input dataframe according to semantic similarity.

Motivation

Clustering is useful when you would like to group togethe similar records within the dataset.

Example

import pandas as pd

import lotus
from lotus.models import LM, SentenceTransformersRM
from lotus.vector_store import FaissVS

lm = LM(model="gpt-4o-mini")
rm = SentenceTransformersRM(model="intfloat/e5-base-v2")
vs = FaissVS()

lotus.settings.configure(lm=lm, rm=rm, vs=vs)
data = {
    "Course Name": [
        "Probability and Random Processes",
        "Optimization Methods in Engineering",
        "Digital Design and Integrated Circuits",
        "Computer Security",
        "Cooking",
        "Food Sciences",
    ]
}
df = pd.DataFrame(data)
df = df.sem_index("Course Name", "course_name_index").sem_cluster_by("Course Name", 2)
print(df)

Output:

	Course Name	cluster_id
0	Probability and Random Processes	0
1	Optimization Methods in Engineering	0
2	Digital Design and Integrated Circuits	0
3	Computer Security	1
4	Cooking	1
5	Food Sciences	1

Required Parameters

col_name : The column name to cluster on.
ncentroids : The number of centroids.

Optional Parameters

niter : The number of iterations.
verbose : Whether to print verbose output.