LOTUS

Getting Started

  • Installation
  • Core Concepts
  • Examples

Semantic Operators

  • sem_map
  • sem_extract
  • sem_filter
  • sem_agg
  • sem_topk
  • sem_join
  • sem_search
  • sem_sim_join
  • sem_cluster_by

Utility Operators

  • sem_partition_by
  • sem_index
  • sem_dedup
  • web_search
    • Basic Search
    • Search Multiple Queries
    • Date Filtering
    • Select Columns
    • Required Setup
    • Parameters
    • API Reference
      • WebSearchCorpus
      • web_search()
  • web_extract

LazyFrame

  • LazyFrame API
  • Optimized Execution with LazyFrames
  • LazyFrame API Reference

Evaluation Suite

  • Evaluation Suite
  • LLM as judge
  • Pairwise judge
  • Evaluation Advanced Features

Models

  • LLM
  • Retrieval Models
  • ReRanker Models
  • Multimodal Models
  • Vector Stores
  • Tracking LM Usage

Advanced Usage

  • Optimized Processing with Approximations
  • Prompt Strategies
  • Setting Configurations
  • Reasoning Models

Data Loading and DB Connectors

  • Database Connectors
  • File Loading with DirectoryReader
LOTUS
  • web_search
  • View page source

web_search

web_search loads web search results into a pandas DataFrame. Use it when you need a tabular set of search results before applying semantic operators, pandas transformations, or a LazyFrame pipeline.

Use web_extract when you already have URLs or corpus-specific document IDs and want the full text.

Supported corpora are:

  • WebSearchCorpus.GOOGLE

  • WebSearchCorpus.GOOGLE_SCHOLAR

  • WebSearchCorpus.ARXIV

  • WebSearchCorpus.YOU

  • WebSearchCorpus.TAVILY

  • WebSearchCorpus.PUBMED

  • WebSearchCorpus.BING; Bing is discontinued and raises a deprecation warning in the current implementation.

Basic Search

web_search accepts one query or a list of queries and returns one DataFrame with a query column.

from lotus import WebSearchCorpus, web_search

df = web_search(
    WebSearchCorpus.ARXIV,
    query="lazy dataframe query optimization",
    K=5,
)

print(df[["title", "abstract", "query"]])

Search Multiple Queries

df = web_search(
    WebSearchCorpus.PUBMED,
    query=[
        "large language models clinical summarization",
        "retrieval augmented generation medicine",
    ],
    K=3,
)

Date Filtering

start_date and end_date filter results for Google, Google Scholar, arXiv, You.com, Tavily, and PubMed. sort_by_date is supported for arXiv.

from datetime import datetime
from lotus import WebSearchCorpus, web_search

df = web_search(
    WebSearchCorpus.ARXIV,
    "transformer architecture",
    10,
    sort_by_date=True,
    start_date=datetime(2024, 1, 1),
    end_date=datetime(2024, 12, 31),
)

Select Columns

Use cols to request a subset of result fields.

df = web_search(
    WebSearchCorpus.TAVILY,
    "AI safety evaluations",
    5,
    cols=["title", "url", "content"],
)

Common default columns include:

  • arXiv: id, title, link, abstract, published, authors, categories

  • Google and Google Scholar: title, link, snippet, date, publication_info

  • You.com: title, url, snippets, description

  • Tavily: title, url, content

  • PubMed: id, title, link, abstract, published, authors, journal, doi, methods, results, conclusions

Required Setup

  • Google and Google Scholar require SERPAPI_API_KEY and the serpapi extra.

  • arXiv requires the arxiv extra.

  • PubMed requires the pubmed extra.

  • You.com requires YOU_API_KEY and the web_search extra.

  • Tavily requires TAVILY_API_KEY and the web_search extra.

$ pip install "lotus-ai[serpapi]"
$ pip install "lotus-ai[arxiv]"
$ pip install "lotus-ai[pubmed]"
$ pip install "lotus-ai[web_search]"

Parameters

web_search(
    corpus,
    query,
    K,
    cols=None,
    sort_by_date=False,
    start_date=None,
    end_date=None,
    delay=0.1,
)

API Reference

class lotus.web_search.WebSearchCorpus(value)

An enumeration.

lotus.web_search.web_search(corpus: WebSearchCorpus, query: str | list[str], K: int, cols: list[str] | None = None, sort_by_date: bool = False, start_date: datetime | None = None, end_date: datetime | None = None, delay: float = 0.1) → DataFrame

Perform web search across different search engines.

Parameters:
  • corpus – The search engine to use (GOOGLE, GOOGLE_SCHOLAR, ARXIV, YOU, BING, TAVILY, PUBMED)

  • query – The search query string, or a list of query strings.

  • K – Maximum number of results to return per query.

  • cols – Optional list of columns to include in the results.

  • sort_by_date – Whether to sort results by date (currently only supported for ARXIV).

  • start_date – Optional start date for filtering results (as a datetime object). Supported for GOOGLE, GOOGLE_SCHOLAR, ARXIV, TAVILY, and YOU.

  • end_date – Optional end date for filtering results (as a datetime object). Supported for GOOGLE, GOOGLE_SCHOLAR, ARXIV, TAVILY, and YOU.

Returns:

A pandas DataFrame containing the search results with a query column.

Raises:

ValueError – If date format is invalid or required API keys are not set

Previous Next

© Copyright 2024, Liana Patel, Siddharth Jha, Carlos Guestrin, Matei Zaharia.

Built with Sphinx using a theme provided by Read the Docs.