sem_cluster_by
Overview
The cluster operator creates groups over the input dataframe according to semantic similarity.
Motivation
Clustering is useful when you would like to group togethe similar records within the dataset.
Example
import pandas as pd
import lotus
from lotus.models import LM, SentenceTransformersRM
from lotus.vector_store import FaissVS
lm = LM(model="gpt-4o-mini")
rm = SentenceTransformersRM(model="intfloat/e5-base-v2")
vs = FaissVS()
lotus.settings.configure(lm=lm, rm=rm, vs=vs)
data = {
"Course Name": [
"Probability and Random Processes",
"Optimization Methods in Engineering",
"Digital Design and Integrated Circuits",
"Computer Security",
"Cooking",
"Food Sciences",
]
}
df = pd.DataFrame(data)
df = df.sem_index("Course Name", "course_name_index").sem_cluster_by("Course Name", 2)
print(df)
Output:
Course Name |
cluster_id |
|
0 |
Probability and Random Processes |
0 |
1 |
Optimization Methods in Engineering |
0 |
2 |
Digital Design and Integrated Circuits |
0 |
3 |
Computer Security |
1 |
4 |
Cooking |
1 |
5 |
Food Sciences |
1 |
Required Parameters
col_name : The column name to cluster on.
ncentroids : The number of centroids.
Optional Parameters
niter : The number of iterations.
verbose : Whether to print verbose output.