web_extract ============ ``web_extract`` extracts full text from URLs or corpus-specific document IDs and returns the results as a pandas DataFrame. Use it after :doc:`web_search` when search results point to documents you want to process, or use it directly when you already know the document IDs or URLs. Basic Extraction ---------------- ``doc_ids`` and ``urls`` each accept either a string or a list of strings. The result has ``id``, ``url``, and ``full_text`` columns. .. code-block:: python from lotus import WebSearchCorpus, web_extract df = web_extract( WebSearchCorpus.ARXIV, doc_ids="2303.08774", ) print(df[["id", "url", "full_text"]]) Extract Multiple Documents -------------------------- .. code-block:: python df = web_extract( WebSearchCorpus.TAVILY, urls=[ "https://en.wikipedia.org/wiki/Artificial_intelligence", "https://en.wikipedia.org/wiki/Machine_learning", ], max_length=20_000, ) When the provider supports batching, LOTUS sends one batched request. Otherwise it fetches each identifier separately. ``delay`` controls the pause between non-batched fetches. Document IDs and URLs --------------------- For arXiv and PubMed, ``doc_ids`` are converted to canonical document URLs. For other corpora, ``doc_ids`` are treated as URLs. Passing ``urls`` always uses the given URL directly. .. code-block:: python pubmed = web_extract( WebSearchCorpus.PUBMED, doc_ids=["12345678", "23456789"], ) page = web_extract( WebSearchCorpus.YOU, urls="https://example.com/article", ) Using Extracted Text -------------------- The returned DataFrame works with semantic operators and LazyFrames. For example, you can extract papers and then summarize their full text. .. code-block:: python papers = web_extract( WebSearchCorpus.ARXIV, doc_ids=["2407.11418", "2309.06180"], max_length=40_000, ) summary = papers.sem_agg( "Summarize the shared technical themes across {full_text}." ) Parameters ---------- .. code-block:: python web_extract( corpus, doc_ids=None, urls=None, max_length=None, delay=0.1, ) API Reference ------------- .. autofunction:: lotus.web_search.web_extract