sem_extract

Overview

The sem_extract operator generates one or more columns from the input columns. Each output columns is specified by a natural language projection. Optionally, you can also extract direct quotes from the source text to support each output.

Motivation

Semantic extractions can be useful for generating structured schemas that provide a simplified view of the data from a column of unstructured documents. The quoting functionality can also be useful for tasks, such as entity extraction or fact-checking, where finding snippets or verified quotes may be preferable to synthesized answers.

Example

import pandas as pd

import lotus
from lotus.models import LM

lm = LM(model="gpt-4o-mini")
lotus.settings.configure(lm=lm)

df = pd.DataFrame(
    {
        "description": [
            "Yoshi is 25 years old",
            "Bowser is 45 years old",
            "Luigi is 15 years old",
        ]
    }
)
input_cols = ["description"]

# A description can be specified for each output column
output_cols = {
    "masked_col_1": "The name of the person",
    "masked_col_2": "The age of the person",
}

# you can optionally set extract_quotes=True to return quotes that support each output
new_df = df.sem_extract(input_cols, output_cols, extract_quotes=True)
print(new_df)

# A description can also be omitted for each output column
output_cols = {
    "name": None,
    "age": None,
}
new_df = df.sem_extract(input_cols, output_cols)
print(new_df)

Output:

description

masked_col_1

masked_col_2

masked_col_1_quote

masked_col_2_quote

0

Yoshi is 25 years old

Yoshi

25

Yoshi

25 years old

1

Bowser is 45 years old

Bowser

45

Bowser

45 years old

2

Luigi is 15 years old

Luigi

15

Luigi

15 years old

description

masked_col_1

masked_col_2

0

Yoshi is 25 years old

Yoshi

25

1

Bowser is 45 years old

Bowser

45

2

Luigi is 15 years old

Luigi

15

Required Parameters

  • input_cols : The columns that a model should extract from.

  • output_cols : A mapping from desired output column names to optional descriptions.

Optional Parameters

  • extract_quotes : Whether to extract quotes for the output columns. Defaults to False.

  • postprocessor : The postprocessor for the model outputs. Defaults to extract_postprocess.

  • return_raw_outputs : Whether to return raw outputs. Defaults to False.