sem_cluster_by
=====================

Overview
---------
The cluster operator creates groups over the input dataframe according
to semantic similarity. 

Motivation
-----------
Clustering is useful when you would like to group togethe similar records within the dataset.

Example
---------
.. code-block:: python

    import pandas as pd

    import lotus
    from lotus.models import LM, SentenceTransformersRM
    from lotus.vector_store import FaissVS

    lm = LM(model="gpt-4o-mini")
    rm = SentenceTransformersRM(model="intfloat/e5-base-v2")
    vs = FaissVS()

    lotus.settings.configure(lm=lm, rm=rm, vs=vs)
    data = {
        "Course Name": [
            "Probability and Random Processes",
            "Optimization Methods in Engineering",
            "Digital Design and Integrated Circuits",
            "Computer Security",
            "Cooking",
            "Food Sciences",
        ]
    }
    df = pd.DataFrame(data)
    df = df.sem_index("Course Name", "course_name_index").sem_cluster_by("Course Name", 2)
    print(df)

Output:

+---+----------------------------------------+------------+
|   |           Course Name                  | cluster_id |
+---+----------------------------------------+------------+
| 0 | Probability and Random Processes       | 0          |
+---+----------------------------------------+------------+
| 1 | Optimization Methods in Engineering    | 0          |
+---+----------------------------------------+------------+
| 2 | Digital Design and Integrated Circuits | 0          |
+---+----------------------------------------+------------+
| 3 | Computer Security                      | 1          |
+---+----------------------------------------+------------+
| 4 | Cooking                                | 1          |
+---+----------------------------------------+------------+
| 5 | Food Sciences                          | 1          |
+---+----------------------------------------+------------+


Required Parameters
--------------------
- **col_name** : The column name to cluster on.
- **ncentroids** : The number of centroids.

Optional Parameters
---------------------
- **niter** : The number of iterations.
- **verbose** : Whether to print verbose output.