sem_extract
Overview
The sem_extract operator generates one or more columns from the input columns. Each output columns is specified by a natural language projection. Optionally, you can also extract direct quotes from the source text to support each output.
Motivation
Semantic extractions can be useful for generating structured schemas that provide a simplified view of the data from a column of unstructured documents. The quoting functionality can also be useful for tasks, such as entity extraction or fact-checking, where finding snippets or verified quotes may be preferable to synthesized answers.
Example
import pandas as pd
import lotus
from lotus.models import LM
lm = LM(model="gpt-4o-mini")
lotus.settings.configure(lm=lm)
df = pd.DataFrame(
{
"description": [
"Yoshi is 25 years old",
"Bowser is 45 years old",
"Luigi is 15 years old",
]
}
)
input_cols = ["description"]
# A description can be specified for each output column
output_cols = {
"masked_col_1": "The name of the person",
"masked_col_2": "The age of the person",
}
# you can optionally set extract_quotes=True to return quotes that support each output
new_df = df.sem_extract(input_cols, output_cols, extract_quotes=True)
print(new_df)
# A description can also be omitted for each output column
output_cols = {
"name": None,
"age": None,
}
new_df = df.sem_extract(input_cols, output_cols)
print(new_df)
Output:
description |
masked_col_1 |
masked_col_2 |
masked_col_1_quote |
masked_col_2_quote |
|
|---|---|---|---|---|---|
0 |
Yoshi is 25 years old |
Yoshi |
25 |
Yoshi |
25 years old |
1 |
Bowser is 45 years old |
Bowser |
45 |
Bowser |
45 years old |
2 |
Luigi is 15 years old |
Luigi |
15 |
Luigi |
15 years old |
description |
masked_col_1 |
masked_col_2 |
|
|---|---|---|---|
0 |
Yoshi is 25 years old |
Yoshi |
25 |
1 |
Bowser is 45 years old |
Bowser |
45 |
2 |
Luigi is 15 years old |
Luigi |
15 |
Required Parameters
input_cols : The columns that a model should extract from.
output_cols : A mapping from desired output column names to optional descriptions.
Optional Parameters
extract_quotes : Whether to extract quotes for the output columns. Defaults to False.
postprocessor : The postprocessor for the model outputs. Defaults to extract_postprocess.
return_raw_outputs : Whether to return raw outputs. Defaults to False.