sem_filter
Overview
sem_filter take a langex predicate, and returns data records that pass the predicate.
Motivation
Semantic filtering is a complex yet vital operation in modern data processing, requiring accurate and efficient evaluation of data rows against nuanced, natural language predicates. Unlike traditional filtering techniques, which rely on rigid and often simplistic rules, semantic filters must leverage language models to reason contextually about the data.
Filter Example
import pandas as pd
import lotus
from lotus.models import LM
lm = LM(model="gpt-4o-mini")
lotus.settings.configure(lm=lm)
data = {
"Course Name": [
"Probability and Random Processes",
"Optimization Methods in Engineering",
"Digital Design and Integrated Circuits",
"Computer Security",
]
}
df = pd.DataFrame(data)
user_instruction = "{Course Name} requires a lot of math"
df = df.sem_filter(user_instruction)
print(df)
Output:
Course Name |
|
0 |
Probability and Random Processes |
1 |
Optimization Methods in Engineering |
2 |
Digital Design and Integrated Circuits |
Example of Filter with Approximation
import pandas as pd
import lotus
from lotus.models import LM
from lotus.types import CascadeArgs
gpt_4o_mini = LM("gpt-4o-mini")
gpt_4o = LM("gpt-4o")
lotus.settings.configure(lm=gpt_4o, helper_lm=gpt_4o_mini)
data = {
"Course Name": [
"Probability and Random Processes", "Optimization Methods in Engineering", "Digital Design and Integrated Circuits",
"Computer Security", "Data Structures and Algorithms", "Machine Learning", "Artificial Intelligence", "Natural Language Processing",
"Introduction to Robotics", "Control Systems", "Linear Algebra and Differential Equations", "Database Systems", "Cloud Computing",
"Software Engineering", "Operating Systems", "Discrete Mathematics", "Numerical Methods", "Wireless Communication Systems",
"Embedded Systems", "Advanced Computer Architecture", "Graph Theory", "Cryptography and Network Security",
"Big Data Analytics", "Deep Learning", "Organic Chemistry", "Molecular Biology", "Environmental Science",
"Genetics and Evolution", "Human Physiology", "Introduction to Anthropology", "Cultural Studies", "Political Theory",
"Macroeconomics", "Microeconomics", "Introduction to Sociology", "Developmental Psychology", "Cognitive Science",
"Introduction to Philosophy", "Ethics and Moral Philosophy", "History of Western Civilization", "Art History: Renaissance to Modern",
"World Literature", "Introduction to Journalism", "Public Speaking and Communication", "Creative Writing", "Music Theory",
"Introduction to Theater", "Film Studies", "Environmental Policy and Law", "Sustainability and Renewable Energy",
"Urban Planning and Design", "International Relations", "Marketing Principles", "Organizational Behavior",
"Financial Accounting", "Corporate Finance", "Business Law", "Supply Chain Management", "Operations Research",
"Entrepreneurship and Innovation", "Introduction to Psychology", "Health Economics", "Biostatistics",
"Social Work Practice", "Public Health Policy", "Environmental Ethics", "History of Political Thought", "Quantitative Research Methods",
"Comparative Politics", "Urban Economics", "Behavioral Economics", "Sociology of Education", "Social Psychology",
"Gender Studies", "Media and Communication Studies", "Advertising and Brand Strategy",
"Sports Management", "Introduction to Archaeology", "Ecology and Conservation Biology", "Marine Biology",
"Geology and Earth Science", "Astronomy and Astrophysics", "Introduction to Meteorology",
"Introduction to Oceanography", "Quantum Physics", "Thermodynamics", "Fluid Mechanics", "Solid State Physics",
"Classical Mechanics", "Introduction to Civil Engineering", "Material Science and Engineering", "Structural Engineering",
"Environmental Engineering", "Energy Systems Engineering", "Aerodynamics", "Heat Transfer",
"Renewable Energy Systems", "Transportation Engineering", "Water Resources Management", "Principles of Accounting",
"Project Management", "International Business", "Business Analytics",
]
}
df = pd.DataFrame(data)
user_instruction = "{Course Name} requires a lot of math"
cascade_args = CascadeArgs(recall_target=0.9, precision_target=0.9, sampling_percentage=0.5, failure_probability=0.2)
df, stats = df.sem_filter(user_instruction=user_instruction, cascade_args=cascade_args, return_stats=True)
print(df)
print(stats)
Output:
Course Name |
|
0 |
Probability and Random Processes |
1 |
Optimization Methods in Engineering |
2 |
Digital Design and Integrated Circuits |
5 |
Machine Learning |
6 |
Artificial Intelligence |
7 |
Natural Language Processing |
8 |
Introduction to Robotics |
9 |
Control Systems |
10 |
Linear Algebra and Differential Equations |
15 |
Discrete Mathematics |
16 |
Numerical Methods |
17 |
Wireless Communication Systems |
19 |
Advanced Computer Architecture |
20 |
Graph Theory |
21 |
Cryptography and Network Security |
22 |
Big Data Analytics |
23 |
Deep Learning |
33 |
Microeconomics |
55 |
Corporate Finance |
58 |
Operations Research |
61 |
Health Economics |
62 |
Biostatistics |
67 |
Quantitative Research Methods |
69 |
Urban Economics |
81 |
Astronomy and Astrophysics |
84 |
Quantum Physics |
85 |
Thermodynamics |
86 |
Fluid Mechanics |
87 |
Solid State Physics |
88 |
Classical Mechanics |
89 |
Introduction to Civil Engineering |
90 |
Material Science and Engineering |
91 |
Structural Engineering |
92 |
Environmental Engineering |
93 |
Energy Systems Engineering |
94 |
Aerodynamics |
95 |
Heat Transfer |
96 |
Renewable Energy Systems |
97 |
Transportation Engineering |
102 |
Business Analytics |
Output Statistics:
{‘pos_cascade_threshold’: 0.62, ‘neg_cascade_threshold’: 0.58, ‘filters_resolved_by_helper_model’: 101, ‘filters_resolved_by_large_model’: 2, ‘num_routed_to_helper_model’: 101}
Required Parameters
user_instruction : The user instruction for filtering.
Optional Parameters
return_raw_outputs : Whether to return raw outputs. Defaults to False.
default : The default value for filtering in case of parsing errors. Defaults to True.
suffix : The suffix for the new columns. Defaults to “_filter”.
examples : The examples dataframe. Defaults to None.
helper_examples : The helper examples dataframe. Defaults to None.
strategy : The reasoning strategy. Defaults to None.
- cascade_argsThe arguments for join cascade. Defaults to None.
recall_target : The target recall. Defaults to None. precision_target : The target precision when cascading. Defaults to None. sampling_percentage : The percentage of the data to sample when cascading. Defaults to 0.1. failure_probability : The failure probability when cascading. Defaults to 0.2.
return_stats : Whether to return statistics. Defaults to False.