sem_dedup
Overview
Semantic deduplication is a process designed to identify and eliminate semantically redundant entries from datasets, focusing on meaning rather than exact textual matches. Entity de-duplication can be implemented as a semantic self-join, but we provide an additional utility function.
Motivation
Unlike traditional deduplication techniques, which rely on exact or near-exact string comparisons, semantic deduplication uses language models to compare the underlying meaning of text entries. This ensures that even paraphrased or contextually similar items can be identified as duplicates.
Example
import pandas as pd
import lotus
from lotus.models import SentenceTransformersRM
from lotus.vector_store import FaissVS
rm = SentenceTransformersRM(model="intfloat/e5-base-v2")
vs = FaissVS()
lotus.settings.configure(rm=rm, vs=vs)
data = {
"Text": [
"Probability and Random Processes",
"Optimization Methods in Engineering",
"Digital Design and Integrated Circuits",
"Computer Security",
"I don't know what day it is",
"I don't know what time it is",
"Harry potter and the Sorcerer's Stone",
]
}
df = pd.DataFrame(data)
df = df.sem_index("Text", "index_dir").sem_dedup("Text", threshold=0.815)
print(df)
Output:
Text |
|
0 |
Probability and Random Processes |
5 |
I don’t know what time it is |
6 |
Harry Potter and the Sorcerer’s Stone |
Required Parameters
col_name : The column name to deduplicate on
threshold : The threshold for similarity score