sem_sim_join

Overview

The similairty join matches tuples from the right and left table according to their semantic similarity, rather than an arbitrary natural-language predicate. Akin to an equi-join in standard relational algebra, the semantic similarity join is a specialized semantic join, can be heavily optimized using the semantic index.

Motivation

This operator is useful for fast and lightweight fuzzy matching of records in two tables based on their semantic similarity.

Example

import pandas as pd

import lotus
from lotus.models import LM, LiteLLMRM
from lotus.vector_store import FaissVS

lm = LM(model="gpt-4o-mini")
rm = LiteLLMRM(model="text-embedding-3-small")
vs = FaissVS()

lotus.settings.configure(lm=lm, rm=rm, vs=vs)
data = {
    "Course Name": [
        "History of the Atlantic World",
        "Riemannian Geometry",
        "Operating Systems",
        "Food Science",
        "Compilers",
        "Intro to computer science",
    ]
}

data2 = {"Skill": ["Math", "Computer Science"]}

df1 = pd.DataFrame(data)
df2 = pd.DataFrame(data2).sem_index("Skill", "skill_index")
res = df1.sem_sim_join(df2, left_on="Course Name", right_on="Skill", K=1)
print(res)

Output:

	Course Name	_scores	Skill
0	History of the Atlantic World	0.107831	Math
1	Riemannian Geometry	0.345694	Math
2	Operating Systems	0.426621	Computer Science
3	Food Science	0.431801	Computer Science
4	Compilers	0.345494	Computer Science
5	Intro to computer science	0.676943	Computer Science

Required Parameters

other : The other DataFrame to join with.
left_on : The column name to join on in the left DataFrame.
right_on : The column name to join on in the right DataFrame.
K : The number of nearest neighbors to search for.

Optional Parameters

lsuffix : The suffix to append to the left DataFrame.
rsuffix : The suffix to append to the right DataFrame.
score_suffix : The suffix to append to the similarity score column.