web_extract

web_extract extracts full text from URLs or corpus-specific document IDs and returns the results as a pandas DataFrame. Use it after web_search when search results point to documents you want to process, or use it directly when you already know the document IDs or URLs.

Basic Extraction

doc_ids and urls each accept either a string or a list of strings. The result has id, url, and full_text columns.

from lotus import WebSearchCorpus, web_extract

df = web_extract(
    WebSearchCorpus.ARXIV,
    doc_ids="2303.08774",
)

print(df[["id", "url", "full_text"]])

Extract Multiple Documents

df = web_extract(
    WebSearchCorpus.TAVILY,
    urls=[
        "https://en.wikipedia.org/wiki/Artificial_intelligence",
        "https://en.wikipedia.org/wiki/Machine_learning",
    ],
    max_length=20_000,
)

When the provider supports batching, LOTUS sends one batched request. Otherwise it fetches each identifier separately. delay controls the pause between non-batched fetches.

Document IDs and URLs

For arXiv and PubMed, doc_ids are converted to canonical document URLs. For other corpora, doc_ids are treated as URLs. Passing urls always uses the given URL directly.

pubmed = web_extract(
    WebSearchCorpus.PUBMED,
    doc_ids=["12345678", "23456789"],
)

page = web_extract(
    WebSearchCorpus.YOU,
    urls="https://example.com/article",
)

Using Extracted Text

The returned DataFrame works with semantic operators and LazyFrames. For example, you can extract papers and then summarize their full text.

papers = web_extract(
    WebSearchCorpus.ARXIV,
    doc_ids=["2407.11418", "2309.06180"],
    max_length=40_000,
)

summary = papers.sem_agg(
    "Summarize the shared technical themes across {full_text}."
)

Parameters

web_extract(
    corpus,
    doc_ids=None,
    urls=None,
    max_length=None,
    delay=0.1,
)

API Reference

Extract full text from specific ids/urls across different search engines.

Accepts a single value or a list of values for doc_ids / urls. When the underlying API supports batching (e.g. Tavily), a single request is made; otherwise each identifier is fetched individually.

Parameters:

corpus – The search engine to use (GOOGLE, GOOGLE_SCHOLAR, ARXIV, YOU, BING, TAVILY, PUBMED)
doc_ids – Corpus-specific identifier(s). Required for ARXIV/PUBMED when url is not provided.
urls – URL(s) to fetch. For non-ARXIV/PUBMED corpora, doc_ids is treated as urls.
max_length – Optional maximum character length for extracted full text.

Returns:

id, url, and full_text.

Return type:

A pandas DataFrame with columns