sem_partition_by
Overview
The sem_partition_by utility in LOTUS exposes finer-grained control over how data is processed for sem_agg. This operator let’s you assign a partition number to each row in a DataFrame. During semantic aggregation, LOTUS, will aggregate over each partition separately, before combining intermediate aggregations across partitions. Additionally, the order in which each partition aggregates is combined will follow the order of the partition numbers in increasing order. By default, LOTUS implements a hierarchical reduce strategy, assuming that all record belong to the same partition.
Motivation
Since LLMs are sensitive to the ordering of inputs, specifying an aggregation ordering using sem_partition_by can provide fine-grained control to achieve high quality results for tasks like summarization.
Example
import pandas as pd
import lotus
from lotus.models import LM, SentenceTransformersRM
lm = LM(max_tokens=2048)
rm = SentenceTransformersRM(model="intfloat/e5-base-v2")
lotus.settings.configure(lm=lm, rm=rm)
data = {
"Course Name": [
"Probability and Random Processes",
"Optimization Methods in Engineering",
"Digital Design and Integrated Circuits",
"Computer Security",
"Cooking",
"Food Sciences",
]
}
df = pd.DataFrame(data)
df = df.sem_index("Course Name", "course_name_index").sem_partition_by(lotus.utils.cluster("Course Name", 2))
out = df.sem_agg("Summarize all {Course Name}")._output[0]
print(out)
Required Parameters
partition_fn : The partitioning function, which returns a list[int], indicating the partition-id of each row.