sem_dedup

Overview

Semantic deduplication is a process designed to identify and eliminate semantically redundant entries from datasets, focusing on meaning rather than exact textual matches. Entity de-duplication can be implemented as a semantic self-join, but we provide an additional utility function.

Motivation

Unlike traditional deduplication techniques, which rely on exact or near-exact string comparisons, semantic deduplication uses language models to compare the underlying meaning of text entries. This ensures that even paraphrased or contextually similar items can be identified as duplicates.

Example

import pandas as pd

import lotus
from lotus.models import SentenceTransformersRM
from lotus.vector_store import FaissVS

rm = SentenceTransformersRM(model="intfloat/e5-base-v2")
vs = FaissVS()

lotus.settings.configure(rm=rm, vs=vs)
data = {
    "Text": [
        "Probability and Random Processes",
        "Optimization Methods in Engineering",
        "Digital Design and Integrated Circuits",
        "Computer Security",
        "I don't know what day it is",
        "I don't know what time it is",
        "Harry potter and the Sorcerer's Stone",
    ]
}
df = pd.DataFrame(data)
df = df.sem_index("Text", "index_dir").sem_dedup("Text", threshold=0.815)
print(df)

Output:

	Text
0	Probability and Random Processes
5	I don’t know what time it is
6	Harry Potter and the Sorcerer’s Stone

Required Parameters

col_name : The column name to deduplicate on
threshold : The threshold for similarity score