docetl2/docs/python/examples.md

212 lines
6.3 KiB
Markdown

# Python API Examples
This document provides two examples of how to use DocETL's Python API. Each example demonstrates a single pipeline step with multiple operations.
## Example 1: Extract and Summarize Product Review Themes
This example extracts themes and quotes from product reviews, then summarizes the top themes across ALL documents using the special `_all` reduce key:
```python
from docetl.api import Pipeline, Dataset, MapOp, ReduceOp, PipelineStep, PipelineOutput
# Define dataset - A CSV of product reviews
dataset = Dataset(
type="file",
path="product_reviews.csv" # Contains columns: review_id, product_id, rating, review_text
)
# Define operations
operations = [
# Extract themes and quotes from each review
MapOp(
name="extract_themes",
type="map",
prompt="""
Analyze this product review and extract the key themes and representative quotes:
Review: {{ input.review_text }}
Rating: {{ input.rating }}
Identify 2-3 major themes (e.g., usability, quality, value) and extract direct quotes that best represent each theme.
""",
output={
"schema": {
"themes": "list[string]",
"quotes": "list[string]",
"sentiment": "string"
}
}
),
# Summarize all themes across reviews using _all key
ReduceOp(
name="summarize_themes",
type="reduce",
reduce_key="_all", # Special key to reduce across all items
prompt="""
Analyze and synthesize the themes and quotes from these product reviews:
{% for item in inputs %}
Review ID: {{ item.review_id }}
Product: {{ item.product_id }}
Rating: {{ item.rating }}
Themes: {{ item.themes | join(", ") }}
Quotes:
{% for quote in item.quotes %}
- "{{ quote }}"
{% endfor %}
Sentiment: {{ item.sentiment }}
{% endfor %}
Summarize the most frequent themes, the most representative quotes for each theme, and the overall sentiment.
""",
output={
"schema": {
"summary": "string"
}
}
)
]
# Define pipeline step (can consist of multiple operations)
step = PipelineStep(
name="review_analysis",
input="product_reviews",
operations=["extract_themes", "summarize_themes"]
)
# Define output
output = PipelineOutput(type="file", path="review_analysis_summary.json")
# Create and run pipeline
pipeline = Pipeline(
name="review_analysis_pipeline",
datasets={"product_reviews": dataset},
operations=operations,
steps=[step],
output=output,
default_model="gpt-4o-mini"
)
# Run the pipeline
cost = pipeline.run()
print(f"Pipeline execution completed. Total cost: ${cost:.2f}")
```
## Example 2: Map-Unnest-Resolve-Reduce on Theme Keys
This example extracts theme-quote pairs from reviews, unnests them, resolves similar themes, and then reduces on the theme key. Here we will also optimize the pipeline.
```python
from docetl.api import Pipeline, Dataset, MapOp, UnnestOp, ResolveOp, ReduceOp, PipelineStep, PipelineOutput
# Define dataset - A JSON file with product reviews
dataset = Dataset(
type="file",
path="product_reviews.csv" # Same csv as previously
)
# Define operations
operations = [
# Extract theme-quote pairs from each review
MapOp(
name="extract_theme_quotes",
type="map",
prompt="""
Extract theme and quote pairs from this product review:
Review: {{ input.review_text }}
Product: {{ input.product_name }}
Rating: {{ input.rating }}
For each distinct theme in the review, extract a direct quote that best represents that theme.
Return each theme and its representative quote as a separate object in the "theme_quotes" array.
""",
output={
"schema": {
"theme_quotes": "array" # Array of objects with theme and quote properties
}
}
),
# Unnest to create separate items for each theme-quote pair
UnnestOp(
name="unnest_theme_quotes",
type="unnest",
array_path="theme_quotes"
),
# Resolve similar themes using fuzzy matching
ResolveOp(
name="resolve_themes",
type="resolve",
comparison_prompt="""
Determine if these two themes are the same or closely related:
Theme 1: {{ input1.theme }}
Theme 2: {{ input2.theme }}
Consider semantic similarity, synonyms, and conceptual overlap.
""",
resolution_prompt="""
Given the following list of similar themes, determine a canonical name that best represents all of them:
{% for item in inputs %}
Theme: {{ item.theme }}
{% endfor %}
Choose a clear, concise name that accurately captures the core concept shared across all these related themes.
"""
),
# Reduce by theme to aggregate quotes and insights
ReduceOp(
name="aggregate_by_theme",
type="reduce",
reduce_key="theme",
prompt="""
Analyze all quotes related to the theme "{{ reduce_key }}":
{% for item in inputs %}
Product: {{ item.product_name }}
Rating: {{ item.rating }}
Quote: "{{ item.quote }}"
{% endfor %}
Summarize the key insights about this theme across all products and ratings.
""",
output={
"schema": {
"summary": "string"
}
}
)
]
# Define pipeline with a single step
step = PipelineStep(
name="theme_analysis",
input="product_reviews",
operations=["extract_theme_quotes", "unnest_theme_quotes", "resolve_themes", "aggregate_by_theme"]
)
# Define output
output = PipelineOutput(type="file", path="theme_analysis_results.json")
# Create the pipeline
pipeline = Pipeline(
name="theme_analysis_pipeline",
datasets={"product_reviews": dataset},
operations=operations,
steps=[step],
output=output,
default_model="gpt-4o"
)
# Optimize the pipeline before running
optimized_pipeline = pipeline.optimize()
# Run the optimized pipeline
cost = optimized_pipeline.run()
print(f"Pipeline execution completed. Total cost: ${cost:.2f}")
```
Note that datasets can be json or CSV.