sem_sim_join

Overview

The similairty join matches tuples from the right and left table according to their semantic similarity, rather than an arbitrary natural-language predicate. Akin to an equi-join in standard relational algebra, the semantic similarity join is a specialized semantic join, can be heavily optimized using the semantic index.

Motivation

This operator is useful for fast and lightweight fuzzy matching of records in two tables based on their semantic similarity.

Example

import pandas as pd

import lotus
from lotus.models import LM, LiteLLMRM
from lotus.vector_store import FaissVS

lm = LM(model="gpt-4o-mini")
rm = LiteLLMRM(model="text-embedding-3-small")
vs = FaissVS()

lotus.settings.configure(lm=lm, rm=rm, vs=vs)
data = {
    "Course Name": [
        "History of the Atlantic World",
        "Riemannian Geometry",
        "Operating Systems",
        "Food Science",
        "Compilers",
        "Intro to computer science",
    ]
}

data2 = {"Skill": ["Math", "Computer Science"]}

df1 = pd.DataFrame(data)
df2 = pd.DataFrame(data2).sem_index("Skill", "skill_index")
res = df1.sem_sim_join(df2, left_on="Course Name", right_on="Skill", K=1)
print(res)

Output:

Course Name

_scores

Skill

0

History of the Atlantic World

0.107831

Math

1

Riemannian Geometry

0.345694

Math

2

Operating Systems

0.426621

Computer Science

3

Food Science

0.431801

Computer Science

4

Compilers

0.345494

Computer Science

5

Intro to computer science

0.676943

Computer Science

Required Parameters

  • other : The other DataFrame to join with.

  • left_on : The column name to join on in the left DataFrame.

  • right_on : The column name to join on in the right DataFrame.

  • K : The number of nearest neighbors to search for.

Optional Parameters

  • lsuffix : The suffix to append to the left DataFrame.

  • rsuffix : The suffix to append to the right DataFrame.

  • score_suffix : The suffix to append to the similarity score column.