web_extract
web_extract extracts full text from URLs or corpus-specific document IDs
and returns the results as a pandas DataFrame. Use it after web_search
when search results point to documents you want to process, or use it directly
when you already know the document IDs or URLs.
Basic Extraction
doc_ids and urls each accept either a string or a list of strings. The
result has id, url, and full_text columns.
from lotus import WebSearchCorpus, web_extract
df = web_extract(
WebSearchCorpus.ARXIV,
doc_ids="2303.08774",
)
print(df[["id", "url", "full_text"]])
Extract Multiple Documents
df = web_extract(
WebSearchCorpus.TAVILY,
urls=[
"https://en.wikipedia.org/wiki/Artificial_intelligence",
"https://en.wikipedia.org/wiki/Machine_learning",
],
max_length=20_000,
)
When the provider supports batching, LOTUS sends one batched request.
Otherwise it fetches each identifier separately. delay controls the pause
between non-batched fetches.
Document IDs and URLs
For arXiv and PubMed, doc_ids are converted to canonical document URLs.
For other corpora, doc_ids are treated as URLs. Passing urls always
uses the given URL directly.
pubmed = web_extract(
WebSearchCorpus.PUBMED,
doc_ids=["12345678", "23456789"],
)
page = web_extract(
WebSearchCorpus.YOU,
urls="https://example.com/article",
)
Using Extracted Text
The returned DataFrame works with semantic operators and LazyFrames. For example, you can extract papers and then summarize their full text.
papers = web_extract(
WebSearchCorpus.ARXIV,
doc_ids=["2407.11418", "2309.06180"],
max_length=40_000,
)
summary = papers.sem_agg(
"Summarize the shared technical themes across {full_text}."
)
Parameters
web_extract(
corpus,
doc_ids=None,
urls=None,
max_length=None,
delay=0.1,
)
API Reference
- lotus.web_search.web_extract(corpus: WebSearchCorpus, doc_ids: str | list[str] | None = None, urls: str | list[str] | None = None, max_length: int | None = None, delay: float = 0.1) DataFrame
Extract full text from specific ids/urls across different search engines.
Accepts a single value or a list of values for
doc_ids/urls. When the underlying API supports batching (e.g. Tavily), a single request is made; otherwise each identifier is fetched individually.- Parameters:
corpus – The search engine to use (GOOGLE, GOOGLE_SCHOLAR, ARXIV, YOU, BING, TAVILY, PUBMED)
doc_ids – Corpus-specific identifier(s). Required for ARXIV/PUBMED when url is not provided.
urls – URL(s) to fetch. For non-ARXIV/PUBMED corpora, doc_ids is treated as urls.
max_length – Optional maximum character length for extracted full text.
- Returns:
id, url, and full_text.
- Return type:
A pandas DataFrame with columns