sem_dedup ======================== Overview --------- Semantic deduplication is a process designed to identify and eliminate semantically redundant entries from datasets, focusing on meaning rather than exact textual matches. Entity de-duplication can be implemented as a semantic self-join, but we provide an additional utility function. Motivation ----------- Unlike traditional deduplication techniques, which rely on exact or near-exact string comparisons, semantic deduplication uses language models to compare the underlying meaning of text entries. This ensures that even paraphrased or contextually similar items can be identified as duplicates. Example -------- .. code-block:: python import pandas as pd import lotus from lotus.models import SentenceTransformersRM from lotus.vector_store import FaissVS rm = SentenceTransformersRM(model="intfloat/e5-base-v2") vs = FaissVS() lotus.settings.configure(rm=rm, vs=vs) data = { "Text": [ "Probability and Random Processes", "Optimization Methods in Engineering", "Digital Design and Integrated Circuits", "Computer Security", "I don't know what day it is", "I don't know what time it is", "Harry potter and the Sorcerer's Stone", ] } df = pd.DataFrame(data) df = df.sem_index("Text", "index_dir").sem_dedup("Text", threshold=0.815) print(df) Output: +---+------------------------------------------+ | | Text | +---+------------------------------------------+ | 0 | Probability and Random Processes | +---+------------------------------------------+ | 5 | I don't know what time it is | +---+------------------------------------------+ | 6 | Harry Potter and the Sorcerer's Stone | +---+------------------------------------------+ Required Parameters -------------------- - **col_name** : The column name to deduplicate on - **threshold** : The threshold for similarity score