sem_sim_join
Overview
The similairty join matches tuples from the right and left table according to their semantic similarity, rather than an arbitrary natural-language predicate. Akin to an equi-join in standard relational algebra, the semantic similarity join is a specialized semantic join, can be heavily optimized using the semantic index.
Motivation
This operator is useful for fast and lightweight fuzzy matching of records in two tables based on their semantic similarity.
Example
import pandas as pd
import lotus
from lotus.models import LM, LiteLLMRM
from lotus.vector_store import FaissVS
lm = LM(model="gpt-4o-mini")
rm = LiteLLMRM(model="text-embedding-3-small")
vs = FaissVS()
lotus.settings.configure(lm=lm, rm=rm, vs=vs)
data = {
"Course Name": [
"History of the Atlantic World",
"Riemannian Geometry",
"Operating Systems",
"Food Science",
"Compilers",
"Intro to computer science",
]
}
data2 = {"Skill": ["Math", "Computer Science"]}
df1 = pd.DataFrame(data)
df2 = pd.DataFrame(data2).sem_index("Skill", "skill_index")
res = df1.sem_sim_join(df2, left_on="Course Name", right_on="Skill", K=1)
print(res)
Output:
Course Name |
_scores |
Skill |
|
0 |
History of the Atlantic World |
0.107831 |
Math |
1 |
Riemannian Geometry |
0.345694 |
Math |
2 |
Operating Systems |
0.426621 |
Computer Science |
3 |
Food Science |
0.431801 |
Computer Science |
4 |
Compilers |
0.345494 |
Computer Science |
5 |
Intro to computer science |
0.676943 |
Computer Science |
Required Parameters
other : The other DataFrame to join with.
left_on : The column name to join on in the left DataFrame.
right_on : The column name to join on in the right DataFrame.
K : The number of nearest neighbors to search for.
Optional Parameters
lsuffix : The suffix to append to the left DataFrame.
rsuffix : The suffix to append to the right DataFrame.
score_suffix : The suffix to append to the similarity score column.