sem_search
Overview
Semantic search performs similarity-based search over an indexed column. LOTUS also exposes re-ranking functionality for search, allowing users to specify the n_rerank parameter during the semantic search. The semantic search in this case will first find the top-𝐾 most relevant documents and then re-rank the top-𝐾 found documents to return the top n_rerank.
Motivation
The sem_search operator is useful for fast, lightweight filtering over your data.
Example
import pandas as pd
import lotus
from lotus.models import LM, CrossEncoderReranker, SentenceTransformersRM
lm = LM(model="gpt-4o-mini")
rm = SentenceTransformersRM(model="intfloat/e5-base-v2")
reranker = CrossEncoderReranker(model="mixedbread-ai/mxbai-rerank-large-v1")
lotus.settings.configure(lm=lm, rm=rm, reranker=reranker)
data = {
"Course Name": [
"Probability and Random Processes",
"Optimization Methods in Engineering",
"Digital Design and Integrated Circuits",
"Computer Security",
"Introduction to Computer Science",
"Introduction to Data Science",
"Introduction to Machine Learning",
"Introduction to Artificial Intelligence",
"Introduction to Robotics",
"Introduction to Computer Vision",
"Introduction to Natural Language Processing",
"Introduction to Reinforcement Learning",
"Introduction to Deep Learning",
"Introduction to Computer Networks",
]
}
df = pd.DataFrame(data)
df = df.sem_index("Course Name", "index_dir").sem_search(
"Course Name",
"Which course name is most related to computer security?",
K=8,
n_rerank=4,
)
print(df)
Output
Course Name |
|
3 |
Computer Security |
13 |
Introduction to Computer Networks |
4 |
Introduction to Computer Science |
5 |
Introduction to Data Science |
Required Parameters
col_name : The column name to search on.
query : The query string.
Optional Parameters
K: The number of documents to retrieve.
n_rerank : The number of documents to rerank.
return_scores : Whether to return the similarity scores.
suffix : The suffix to append to the new column containing the similarity scores.