LOTUS

Getting Started

  • Installation
  • Core Concepts
  • Examples

Semantic Operators

  • sem_map
  • sem_extract
  • sem_filter
  • sem_agg
  • sem_topk
  • sem_join
  • sem_search
    • Overview
    • Motivation
    • Example
    • Required Parameters
    • Optional Parameters
  • sem_sim_join
  • sem_cluster_by

Utility Operators

  • sem_partition_by
  • sem_index
  • sem_dedup
  • web_search
  • web_extract

LazyFrame

  • LazyFrame API
  • Optimized Execution with LazyFrames
  • LazyFrame API Reference

Evaluation Suite

  • Evaluation Suite
  • LLM as judge
  • Pairwise judge
  • Evaluation Advanced Features

Models

  • LLM
  • Retrieval Models
  • ReRanker Models
  • Multimodal Models
  • Vector Stores
  • Tracking LM Usage

Advanced Usage

  • Optimized Processing with Approximations
  • Prompt Strategies
  • Setting Configurations
  • Reasoning Models

Data Loading and DB Connectors

  • Database Connectors
  • File Loading with DirectoryReader
LOTUS
  • sem_search
  • View page source

sem_search

Overview

Semantic search performs similarity-based search over an indexed column. LOTUS also exposes re-ranking functionality for search, allowing users to specify the n_rerank parameter during the semantic search. The semantic search in this case will first find the top-𝐾 most relevant documents and then re-rank the top-𝐾 found documents to return the top n_rerank.

Motivation

The sem_search operator is useful for fast, lightweight filtering over your data.

Example

import pandas as pd

import lotus
from lotus.models import LM, CrossEncoderReranker, SentenceTransformersRM

lm = LM(model="gpt-4o-mini")
rm = SentenceTransformersRM(model="intfloat/e5-base-v2")
reranker = CrossEncoderReranker(model="mixedbread-ai/mxbai-rerank-large-v1")

lotus.settings.configure(lm=lm, rm=rm, reranker=reranker)
data = {
    "Course Name": [
        "Probability and Random Processes",
        "Optimization Methods in Engineering",
        "Digital Design and Integrated Circuits",
        "Computer Security",
        "Introduction to Computer Science",
        "Introduction to Data Science",
        "Introduction to Machine Learning",
        "Introduction to Artificial Intelligence",
        "Introduction to Robotics",
        "Introduction to Computer Vision",
        "Introduction to Natural Language Processing",
        "Introduction to Reinforcement Learning",
        "Introduction to Deep Learning",
        "Introduction to Computer Networks",
    ]
}
df = pd.DataFrame(data)

df = df.sem_index("Course Name", "index_dir").sem_search(
    "Course Name",
    "Which course name is most related to computer security?",
    K=8,
    n_rerank=4,
)
print(df)

Output

Course Name

3

Computer Security

13

Introduction to Computer Networks

4

Introduction to Computer Science

5

Introduction to Data Science

Required Parameters

  • col_name : The column name to search on.

  • query : The query string.

Optional Parameters

  • K: The number of documents to retrieve.

  • n_rerank : The number of documents to rerank.

  • return_scores : Whether to return the similarity scores.

  • suffix : The suffix to append to the new column containing the similarity scores.

Previous Next

© Copyright 2024, Liana Patel, Siddharth Jha, Carlos Guestrin, Matei Zaharia.

Built with Sphinx using a theme provided by Read the Docs.