web_search
===========

``web_search`` loads web search results into a pandas DataFrame. Use it when
you need a tabular set of search results before applying semantic operators,
pandas transformations, or a LazyFrame pipeline.

Use :doc:`web_extract` when you already have URLs or corpus-specific document
IDs and want the full text.

Supported corpora are:

- ``WebSearchCorpus.GOOGLE``
- ``WebSearchCorpus.GOOGLE_SCHOLAR``
- ``WebSearchCorpus.ARXIV``
- ``WebSearchCorpus.YOU``
- ``WebSearchCorpus.TAVILY``
- ``WebSearchCorpus.PUBMED``
- ``WebSearchCorpus.BING``; Bing is discontinued and raises a deprecation
  warning in the current implementation.

Basic Search
------------

``web_search`` accepts one query or a list of queries and returns one DataFrame
with a ``query`` column.

.. code-block:: python

    from lotus import WebSearchCorpus, web_search

    df = web_search(
        WebSearchCorpus.ARXIV,
        query="lazy dataframe query optimization",
        K=5,
    )

    print(df[["title", "abstract", "query"]])

Search Multiple Queries
-----------------------

.. code-block:: python

    df = web_search(
        WebSearchCorpus.PUBMED,
        query=[
            "large language models clinical summarization",
            "retrieval augmented generation medicine",
        ],
        K=3,
    )

Date Filtering
--------------

``start_date`` and ``end_date`` filter results for Google, Google Scholar,
arXiv, You.com, Tavily, and PubMed. ``sort_by_date`` is supported for arXiv.

.. code-block:: python

    from datetime import datetime
    from lotus import WebSearchCorpus, web_search

    df = web_search(
        WebSearchCorpus.ARXIV,
        "transformer architecture",
        10,
        sort_by_date=True,
        start_date=datetime(2024, 1, 1),
        end_date=datetime(2024, 12, 31),
    )

Select Columns
--------------

Use ``cols`` to request a subset of result fields.

.. code-block:: python

    df = web_search(
        WebSearchCorpus.TAVILY,
        "AI safety evaluations",
        5,
        cols=["title", "url", "content"],
    )

Common default columns include:

- arXiv: ``id``, ``title``, ``link``, ``abstract``, ``published``,
  ``authors``, ``categories``
- Google and Google Scholar: ``title``, ``link``, ``snippet``, ``date``,
  ``publication_info``
- You.com: ``title``, ``url``, ``snippets``, ``description``
- Tavily: ``title``, ``url``, ``content``
- PubMed: ``id``, ``title``, ``link``, ``abstract``, ``published``,
  ``authors``, ``journal``, ``doi``, ``methods``, ``results``,
  ``conclusions``

Required Setup
--------------

- Google and Google Scholar require ``SERPAPI_API_KEY`` and the ``serpapi``
  extra.
- arXiv requires the ``arxiv`` extra.
- PubMed requires the ``pubmed`` extra.
- You.com requires ``YOU_API_KEY`` and the ``web_search`` extra.
- Tavily requires ``TAVILY_API_KEY`` and the ``web_search`` extra.

.. code-block:: console

    $ pip install "lotus-ai[serpapi]"
    $ pip install "lotus-ai[arxiv]"
    $ pip install "lotus-ai[pubmed]"
    $ pip install "lotus-ai[web_search]"

Parameters
----------

.. code-block:: python

    web_search(
        corpus,
        query,
        K,
        cols=None,
        sort_by_date=False,
        start_date=None,
        end_date=None,
        delay=0.1,
    )

API Reference
-------------

.. autoclass:: lotus.web_search.WebSearchCorpus
   :members:

.. autofunction:: lotus.web_search.web_search