web_search
web_search loads web search results into a pandas DataFrame. Use it when
you need a tabular set of search results before applying semantic operators,
pandas transformations, or a LazyFrame pipeline.
Use web_extract when you already have URLs or corpus-specific document IDs and want the full text.
Supported corpora are:
WebSearchCorpus.GOOGLEWebSearchCorpus.GOOGLE_SCHOLARWebSearchCorpus.ARXIVWebSearchCorpus.YOUWebSearchCorpus.TAVILYWebSearchCorpus.PUBMEDWebSearchCorpus.BING; Bing is discontinued and raises a deprecation warning in the current implementation.
Basic Search
web_search accepts one query or a list of queries and returns one DataFrame
with a query column.
from lotus import WebSearchCorpus, web_search
df = web_search(
WebSearchCorpus.ARXIV,
query="lazy dataframe query optimization",
K=5,
)
print(df[["title", "abstract", "query"]])
Search Multiple Queries
df = web_search(
WebSearchCorpus.PUBMED,
query=[
"large language models clinical summarization",
"retrieval augmented generation medicine",
],
K=3,
)
Date Filtering
start_date and end_date filter results for Google, Google Scholar,
arXiv, You.com, Tavily, and PubMed. sort_by_date is supported for arXiv.
from datetime import datetime
from lotus import WebSearchCorpus, web_search
df = web_search(
WebSearchCorpus.ARXIV,
"transformer architecture",
10,
sort_by_date=True,
start_date=datetime(2024, 1, 1),
end_date=datetime(2024, 12, 31),
)
Select Columns
Use cols to request a subset of result fields.
df = web_search(
WebSearchCorpus.TAVILY,
"AI safety evaluations",
5,
cols=["title", "url", "content"],
)
Common default columns include:
arXiv:
id,title,link,abstract,published,authors,categoriesGoogle and Google Scholar:
title,link,snippet,date,publication_infoYou.com:
title,url,snippets,descriptionTavily:
title,url,contentPubMed:
id,title,link,abstract,published,authors,journal,doi,methods,results,conclusions
Required Setup
Google and Google Scholar require
SERPAPI_API_KEYand theserpapiextra.arXiv requires the
arxivextra.PubMed requires the
pubmedextra.You.com requires
YOU_API_KEYand theweb_searchextra.Tavily requires
TAVILY_API_KEYand theweb_searchextra.
$ pip install "lotus-ai[serpapi]"
$ pip install "lotus-ai[arxiv]"
$ pip install "lotus-ai[pubmed]"
$ pip install "lotus-ai[web_search]"
Parameters
web_search(
corpus,
query,
K,
cols=None,
sort_by_date=False,
start_date=None,
end_date=None,
delay=0.1,
)
API Reference
- class lotus.web_search.WebSearchCorpus(value)
An enumeration.
- lotus.web_search.web_search(corpus: WebSearchCorpus, query: str | list[str], K: int, cols: list[str] | None = None, sort_by_date: bool = False, start_date: datetime | None = None, end_date: datetime | None = None, delay: float = 0.1) DataFrame
Perform web search across different search engines.
- Parameters:
corpus – The search engine to use (GOOGLE, GOOGLE_SCHOLAR, ARXIV, YOU, BING, TAVILY, PUBMED)
query – The search query string, or a list of query strings.
K – Maximum number of results to return per query.
cols – Optional list of columns to include in the results.
sort_by_date – Whether to sort results by date (currently only supported for ARXIV).
start_date – Optional start date for filtering results (as a datetime object). Supported for GOOGLE, GOOGLE_SCHOLAR, ARXIV, TAVILY, and YOU.
end_date – Optional end date for filtering results (as a datetime object). Supported for GOOGLE, GOOGLE_SCHOLAR, ARXIV, TAVILY, and YOU.
- Returns:
A pandas DataFrame containing the search results with a
querycolumn.- Raises:
ValueError – If date format is invalid or required API keys are not set