Compare commits

...

3 Commits

Author SHA1 Message Date
Shreya Shankar d1bd0a000e get ready to bump up version 2025-12-27 22:22:02 -06:00
Shreya Shankar 366a118458 feat: add claude-code skill 2025-12-27 22:16:55 -06:00
Shreya Shankar 9b6bd41508 feat: add claude-code skill 2025-12-27 22:01:07 -06:00
10 changed files with 1131 additions and 142 deletions

View File

@ -0,0 +1,647 @@
---
name: docetl
description: Build and run LLM-powered data processing pipelines with DocETL. Use when users say "docetl", want to analyze unstructured data, process documents, extract information, or run ETL tasks on text. Helps with data collection, pipeline creation, execution, and optimization.
---
# DocETL Pipeline Development
DocETL is a system for creating LLM-powered data processing pipelines. This skill helps you build end-to-end pipelines: from data preparation to execution and optimization.
## Workflow Overview
1. **Understand the task** - What data? What processing?
2. **Prepare data** - Transform into DocETL dataset (JSON/CSV)
3. **Read and understand the data** - Examine actual documents to write specific prompts
4. **Author pipeline** - Create YAML configuration with detailed, data-specific prompts
5. **Verify environment** - Check API keys in `.env`
6. **Execute pipeline** - Run with `docetl run`
7. **Optimize (optional)** - Use MOAR optimizer for cost/accuracy tradeoffs
## Step 1: Data Preparation
DocETL datasets must be **JSON arrays** or **CSV files**.
### JSON Format
```json
[
{"id": 1, "text": "First document content...", "metadata": "value"},
{"id": 2, "text": "Second document content...", "metadata": "value"}
]
```
### CSV Format
```csv
id,text,metadata
1,"First document content...","value"
2,"Second document content...","value"
```
### Data Collection Scripts
If user needs to collect data, write a Python script:
```python
import json
# Collect/transform data
documents = []
for source in sources:
documents.append({
"id": source.id,
"text": source.content, # DO NOT truncate text
# Add relevant fields
})
# Save as DocETL dataset
with open("dataset.json", "w") as f:
json.dump(documents, f, indent=2)
```
**Important:** Never truncate document text in collection scripts. DocETL operations like `split` handle long documents properly. Truncation loses information.
## Step 2: Read and Understand the Data
**CRITICAL**: Before writing any prompts, READ the actual input data to understand:
- The structure and format of documents
- The vocabulary and terminology used
- What information is present vs. absent
- Edge cases and variations
```python
import json
with open("dataset.json") as f:
data = json.load(f)
# Examine several examples
for doc in data[:5]:
print(doc)
```
This understanding is essential for writing specific, effective prompts.
## Step 3: Pipeline Structure
Create a YAML file with this structure:
```yaml
default_model: gpt-5-nano
system_prompt:
dataset_description: <describe the data based on what you observed>
persona: <role for the LLM to adopt>
datasets:
input_data:
type: file
path: "dataset.json" # or dataset.csv
operations:
- name: <operation_name>
type: <operation_type>
prompt: |
<Detailed, specific prompt based on the actual data>
output:
schema:
<field_name>: <type>
pipeline:
steps:
- name: process
input: input_data
operations:
- <operation_name>
output:
type: file
path: "output.json"
intermediate_dir: "intermediates" # ALWAYS set this for debugging
```
### Key Configuration
- **default_model**: Use `gpt-5-nano` unless user specifies otherwise
- **intermediate_dir**: Always set to log intermediate results
- **system_prompt**: Describe the data based on what you actually observed
## Step 4: Writing Effective Prompts
**Prompts must be specific to the data, not generic.** After reading the input data:
### Bad (Generic) Prompt
```yaml
prompt: |
Extract key information from this document.
{{ input.text }}
```
### Good (Specific) Prompt
```yaml
prompt: |
You are analyzing a medical transcript from a doctor-patient visit.
The transcript follows this format:
- Doctor statements are prefixed with "DR:"
- Patient statements are prefixed with "PT:"
- Timestamps appear in brackets like [00:05:23]
From the following transcript, extract:
1. All medications mentioned (brand names or generic)
2. Dosages if specified
3. Patient-reported side effects or concerns
Transcript:
{{ input.transcript }}
Be thorough - patients often mention medication names informally.
If a medication is unclear, include it with a note.
```
### Prompt Writing Guidelines
1. **Describe the data format** you observed
2. **Be specific about what to extract** - list exact fields
3. **Mention edge cases** you noticed in the data
4. **Provide examples** if the task is ambiguous
5. **Set expectations** for handling missing/unclear information
## Step 5: Choosing Operations
Many tasks only need a **single map operation**. Use good judgement:
| Task | Recommended Approach |
|------|---------------------|
| Extract info from each doc | Single `map` |
| Multiple extractions | Multiple `map` operations chained |
| Extract then summarize | `map``reduce` |
| Filter then process | `filter``map` |
| Split long docs | `split``map``reduce` |
| Deduplicate entities | `map``unnest``resolve` |
## Operation Reference
### Map Operation
Applies an LLM transformation to each document independently.
```yaml
- name: extract_info
type: map
prompt: |
Analyze this document:
{{ input.text }}
Extract the main topic and 3 key points.
output:
schema:
topic: string
key_points: list[string]
model: gpt-5-nano # optional, uses default_model if not set
skip_on_error: true # recommended for large-scale runs
validate: # optional
- len(output["key_points"]) == 3
num_retries_on_validate_failure: 2 # optional
```
**Key parameters:**
- `prompt`: Jinja2 template, use `{{ input.field }}` to reference fields
- `output.schema`: Define output structure
- `skip_on_error`: Set `true` to continue on LLM errors (recommended at scale)
- `validate`: Python expressions to validate output
- `sample`: Process only N documents (for testing)
- `limit`: Stop after producing N outputs
### Filter Operation
Keeps or removes documents based on LLM criteria. Output schema must have exactly one boolean field.
```yaml
- name: filter_relevant
type: filter
skip_on_error: true
prompt: |
Document: {{ input.text }}
Is this document relevant to climate change?
Respond true or false.
output:
schema:
is_relevant: boolean
```
### Reduce Operation
Aggregates documents by a key using an LLM.
**Always include `fold_prompt` and `fold_batch_size`** for reduce operations. This handles cases where the group is too large to fit in context.
```yaml
- name: summarize_by_category
type: reduce
reduce_key: category # use "_all" to aggregate everything
skip_on_error: true
prompt: |
Summarize these {{ inputs | length }} items for category "{{ inputs[0].category }}":
{% for item in inputs %}
- {{ item.title }}: {{ item.description }}
{% endfor %}
fold_prompt: |
You have an existing summary and new items to incorporate.
Existing summary:
{{ output.summary }}
New items to add:
{% for item in inputs %}
- {{ item.title }}: {{ item.description }}
{% endfor %}
Update the summary to include the new information.
fold_batch_size: 10 # Estimate based on doc size and model context window
output:
schema:
summary: string
validate:
- len(output["summary"].strip()) > 0
num_retries_on_validate_failure: 2
```
**Estimating `fold_batch_size`:**
- Consider document size and model context window
- For gpt-4o-mini (128k context): ~50-100 small docs, ~10-20 medium docs
- For gpt-4o (128k context): similar to gpt-4o-mini
- Use WebSearch to look up context window sizes for unfamiliar models
**Key parameters:**
- `reduce_key`: Field to group by (or list of fields, or `_all`)
- `fold_prompt`: Template for incrementally adding items to existing output (required)
- `fold_batch_size`: Number of items per fold iteration (required)
- `associative`: Set to `false` if order matters
### Split Operation
Divides long text into smaller chunks. No LLM call.
```yaml
- name: split_document
type: split
split_key: content
method: token_count # or "delimiter"
method_kwargs:
num_tokens: 500
model: gpt-5-nano
```
**Output adds:**
- `{split_key}_chunk`: The chunk content
- `{op_name}_id`: Original document ID
- `{op_name}_chunk_num`: Chunk number
### Unnest Operation
Flattens list fields into separate rows. No LLM call.
```yaml
- name: unnest_items
type: unnest
unnest_key: items # field containing the list
keep_empty: false # optional
```
**Example:** If a document has `items: ["a", "b", "c"]`, unnest creates 3 documents, each with `items: "a"`, `items: "b"`, `items: "c"`.
### Resolve Operation
Deduplicates and canonicalizes entities. Uses pairwise comparison.
```yaml
- name: dedupe_names
type: resolve
optimize: true # let optimizer find blocking rules
skip_on_error: true
comparison_prompt: |
Are these the same person?
Person 1: {{ input1.name }} ({{ input1.email }})
Person 2: {{ input2.name }} ({{ input2.email }})
Respond true or false.
resolution_prompt: |
Standardize this person's name:
{% for entry in inputs %}
- {{ entry.name }}
{% endfor %}
Return the canonical name.
output:
schema:
name: string
```
**Important:** Set `optimize: true` and run `docetl build` to generate efficient blocking rules. Without blocking, this is O(n²).
### Code Operations
Deterministic Python transformations without LLM calls.
**code_map:**
```yaml
- name: compute_stats
type: code_map
code: |
def transform(doc) -> dict:
return {
"word_count": len(doc["text"].split()),
"char_count": len(doc["text"])
}
```
**code_reduce:**
```yaml
- name: aggregate
type: code_reduce
reduce_key: category
code: |
def transform(items) -> dict:
total = sum(item["value"] for item in items)
return {"total": total, "count": len(items)}
```
**code_filter:**
```yaml
- name: filter_long
type: code_filter
code: |
def transform(doc) -> bool:
return len(doc["text"]) > 100
```
### Retrievers (LanceDB)
Augment LLM operations with retrieved context from a LanceDB index. Useful for:
- Finding related documents to compare against
- Providing additional context for extraction/classification
- Cross-referencing facts across a dataset
**Define a retriever:**
```yaml
retrievers:
facts_index:
type: lancedb
dataset: extracted_facts # dataset to index
index_dir: workloads/wiki/lance_index
build_index: if_missing # if_missing | always | never
index_types: ["fts", "embedding"] # or "hybrid"
fts:
index_phrase: "{{ input.fact }}: {{ input.source }}"
query_phrase: "{{ input.fact }}"
embedding:
model: openai/text-embedding-3-small
index_phrase: "{{ input.fact }}"
query_phrase: "{{ input.fact }}"
query:
mode: hybrid
top_k: 5
```
**Use in operations:**
```yaml
- name: find_conflicts
type: map
retriever: facts_index
prompt: |
Check if this fact conflicts with any retrieved facts:
Current fact: {{ input.fact }} (from {{ input.source }})
Related facts from other articles:
{{ retrieval_context }}
Return whether there's a genuine conflict.
output:
schema:
has_conflict: boolean
```
**Key points:**
- `{{ retrieval_context }}` is injected into prompts automatically
- Index is built on first use (when `build_index: if_missing`)
- Supports full-text (`fts`), vector (`embedding`), or `hybrid` search
- Use `save_retriever_output: true` to debug what was retrieved
- **Can index intermediate outputs**: Retriever can index the output of a previous pipeline step, enabling patterns like "extract facts → index facts → retrieve similar facts for each"
## Documentation Reference
For detailed parameters, advanced features, and more examples, read the docs:
- **Operations**: `docs/operators/` folder (map.md, reduce.md, filter.md, etc.)
- **Concepts**: `docs/concepts/` folder (pipelines.md, operators.md, schemas.md)
- **Examples**: `docs/examples/` folder
- **Optimization**: `docs/optimization/` folder
## Step 6: Environment Setup
Before running, verify API keys exist:
```bash
# Check for .env file
cat .env
```
Required keys depend on the model:
- OpenAI: `OPENAI_API_KEY`
- Anthropic: `ANTHROPIC_API_KEY`
- Google: `GEMINI_API_KEY`
If missing, prompt user to create `.env`:
```
OPENAI_API_KEY=sk-...
```
## Step 7: Execution
**Always ask user before running** - LLM calls cost money.
```bash
docetl run pipeline.yaml
```
Options:
- `--max_threads N` - Control parallelism
Check intermediate results in the `intermediate_dir` folder to debug.
## Step 8: Optimization (Optional)
Use MOAR optimizer to find the Pareto frontier of **cost vs. accuracy** tradeoffs. MOAR experiments with different pipeline rewrites and models to find optimal configurations.
Add to pipeline YAML:
```yaml
optimizer_config:
type: moar
save_dir: ./optimization_results
available_models:
- gpt-5-nano
- gpt-4o-mini
- gpt-4o
evaluation_file: evaluate.py # User must provide
metric_key: score
max_iterations: 20
model: gpt-5-nano
```
Create evaluation file (`evaluate.py`):
```python
def evaluate(outputs: list[dict]) -> dict:
# Score the outputs (0-1 scale recommended)
correct = sum(1 for o in outputs if is_correct(o))
return {"score": correct / len(outputs)}
```
Run optimization:
```bash
docetl build pipeline.yaml --optimizer moar
```
MOAR will produce multiple pipeline variants on the Pareto frontier - user can choose based on their cost/accuracy preferences.
## Output Schemas
**Keep schemas minimal and simple** unless the user explicitly requests more fields. Default to 1-3 output fields per operation. Only add more fields if the user specifically asks for them.
**Nesting limit:** Maximum 2 levels deep (e.g., `list[{field: str}]` is allowed, but no deeper).
```yaml
# Good - minimal, focused on the core task
output:
schema:
summary: string
# Good - a few fields when task requires it
output:
schema:
topic: string
keywords: list[string]
# Acceptable - 2 levels of nesting (list of objects)
output:
schema:
items: "list[{name: str, value: int}]"
# Bad - too many fields (unless user explicitly requested all of these)
output:
schema:
conflicts_found: bool
num_conflicts: int
conflicts: "list[{claim_a: str, source_a: str, claim_b: str, source_b: str}]"
analysis_summary: str
# Bad - more than 2 levels of nesting (not supported)
output:
schema:
data: "list[{nested: {too: {deep: str}}}]"
```
**Guidelines:**
- Start with the minimum fields needed to answer the user's question
- Avoid complex nested objects unless explicitly requested
- If you need structured data, prefer multiple simple operations over one complex schema
- Complex schemas increase LLM failures and costs
Supported types: `string`, `int`, `float`, `bool`, `list[type]`, `enum`
## Validation
**Always add validation to LLM-powered operations** (map, reduce, filter, resolve). Validation catches malformed outputs and retries automatically.
```yaml
- name: extract_keywords
type: map
prompt: |
Extract 3-5 keywords from: {{ input.text }}
output:
schema:
keywords: list[string]
validate:
- len(output["keywords"]) >= 3
- len(output["keywords"]) <= 5
num_retries_on_validate_failure: 2
```
Common validation patterns:
```yaml
# List length constraints
- len(output["items"]) >= 1
- len(output["items"]) <= 10
# Enum/allowed values
- output["sentiment"] in ["positive", "negative", "neutral"]
# String not empty
- len(output["summary"].strip()) > 0
# Numeric ranges
- output["score"] >= 0
- output["score"] <= 100
```
## Jinja2 Templating
For map operations, use `input`:
```yaml
prompt: |
Document: {{ input.text }}
{% if input.metadata %}
Context: {{ input.metadata }}
{% endif %}
```
For reduce operations, use `inputs` (list):
```yaml
prompt: |
Summarize these {{ inputs | length }} items:
{% for item in inputs %}
- {{ item.summary }}
{% endfor %}
```
## Troubleshooting
### Pipeline won't run
- Check `.env` has correct API keys
- Verify dataset file exists and is valid JSON/CSV
- Check YAML syntax
### Bad outputs
- Read more input data examples to improve prompt specificity
- Add `validate` rules with retries
- Simplify output schema
- Add concrete examples to prompt
### High costs
- Use `gpt-5-nano` or `gpt-4o-mini`
- Add `sample: 10` to test on subset first
- Run MOAR optimizer to find cost-efficient rewrites
### Check intermediate results
Look in `intermediate_dir` folder to debug each step.
## Quick Reference
```bash
# Run pipeline
docetl run pipeline.yaml
# Run with more parallelism
docetl run pipeline.yaml --max_threads 16
# Optimize pipeline (cost/accuracy tradeoff)
docetl build pipeline.yaml --optimizer moar
# Clear LLM cache
docetl clear-cache
# Check version
docetl version
```

View File

@ -1,4 +1,4 @@
__version__ = "0.2.5"
__version__ = "0.2.6"
import warnings

View File

@ -244,5 +244,104 @@ def version():
typer.echo(f"DocETL version: {docetl.__version__}")
@app.command("install-skill")
def install_skill(
uninstall: bool = typer.Option(
False, "--uninstall", "-u", help="Remove the installed skill instead"
),
):
"""
Install the DocETL Claude Code skill to your personal skills directory.
This makes the DocETL skill available in Claude Code for any project.
The skill helps you build and run DocETL pipelines.
"""
import shutil
# Find the skill source - try multiple locations
# 1. Installed package location (via importlib.resources)
# 2. Development location (relative to this file)
skill_source = None
# Try to find via package resources first
try:
import importlib.resources as pkg_resources
# For Python 3.9+, use files()
try:
package_root = Path(pkg_resources.files("docetl")).parent
potential_source = package_root / ".claude" / "skills" / "docetl"
if potential_source.exists():
skill_source = potential_source
except (TypeError, AttributeError):
pass
except ImportError:
pass
# Fallback: try relative to this file (development mode)
if skill_source is None:
dev_source = Path(__file__).parent.parent / ".claude" / "skills" / "docetl"
if dev_source.exists():
skill_source = dev_source
if skill_source is None or not skill_source.exists():
console.print(
Panel(
"[bold red]Error:[/bold red] Could not find the DocETL skill files.\n\n"
"This may happen if the package was not installed correctly.\n"
"Try reinstalling: [bold]pip install --force-reinstall docetl[/bold]",
title="[bold red]Skill Not Found[/bold red]",
border_style="red",
)
)
raise typer.Exit(1)
# Target directory
skill_target = Path.home() / ".claude" / "skills" / "docetl"
if uninstall:
if skill_target.exists():
shutil.rmtree(skill_target)
console.print(
Panel(
f"[bold green]Success![/bold green] DocETL skill removed from:\n"
f"[dim]{skill_target}[/dim]",
title="[bold green]Skill Uninstalled[/bold green]",
border_style="green",
)
)
else:
console.print(
Panel(
"[yellow]The DocETL skill is not currently installed.[/yellow]",
title="[yellow]Nothing to Uninstall[/yellow]",
border_style="yellow",
)
)
return
# Create parent directories if needed
skill_target.parent.mkdir(parents=True, exist_ok=True)
# Copy the skill
if skill_target.exists():
shutil.rmtree(skill_target)
shutil.copytree(skill_source, skill_target)
console.print(
Panel(
f"[bold green]Success![/bold green] DocETL skill installed to:\n"
f"[dim]{skill_target}[/dim]\n\n"
"[bold]Next steps:[/bold]\n"
"1. Restart Claude Code if it's running\n"
"2. The skill will automatically activate when you work on DocETL tasks\n\n"
"[dim]To uninstall: docetl install-skill --uninstall[/dim]",
title="[bold green]Skill Installed[/bold green]",
border_style="green",
)
)
if __name__ == "__main__":
app()

View File

@ -34,6 +34,9 @@ To get started with DocETL:
2. Define your pipeline in a YAML file. Want to use an LLM like ChatGPT or Claude to help you write your pipeline? See [docetl.org/llms.txt](https://docetl.org/llms.txt) for a big prompt you can copy paste into ChatGPT or Claude, before describing your task.
3. Run your pipeline using the DocETL command-line interface
!!! tip "Fastest Way: Claude Code"
Clone this repo and run `claude` to use the built-in DocETL skill. Just describe your data processing task and Claude will create and run the pipeline for you. See [Quick Start (Claude Code)](quickstart-claude-code.md) for details.
## 🏛️ Project Origin
DocETL was created by members of the EPIC Data Lab and Data Systems and Foundations group at UC Berkeley. The EPIC (Effective Programming, Interaction, and Computation with Data) Lab focuses on developing low-code and no-code interfaces for data work, powered by next-generation predictive programming techniques. DocETL is one of the projects that emerged from our research efforts to streamline complex document processing tasks.

View File

@ -55,14 +55,6 @@ If you want to use only the parsing extra:
uv sync --extra parsing
```
If you want to use the parsing tools, you need to install the `parsing` extra:
```bash
poetry install --extras "parsing"
```
This will create a virtual environment and install all the required dependencies.
4. Set up your OpenAI API key:
Create a .env file in the project root and add your OpenAI API key:

View File

@ -0,0 +1,45 @@
# Quick Start with Claude Code
The fastest way to build DocETL pipelines is with [Claude Code](https://claude.ai/download), Anthropic's agentic coding tool. DocETL includes a built-in Claude Code skill that helps you create, run, and debug pipelines interactively.
## Option 1: Clone the Repository (Recommended)
This gives you the full development environment with the skill already configured.
1. Follow the [Installation from Source](installation.md#installation-from-source) instructions
2. Run `claude` in the repository directory
The skill is located at `.claude/skills/docetl/SKILL.md`.
## Option 2: Install via pip
If you already have DocETL installed via pip, you can install the skill separately:
```bash
pip install docetl
docetl install-skill
```
This copies the skill to `~/.claude/skills/docetl/`. Then run `claude` in any directory.
To uninstall: `docetl install-skill --uninstall`
## Usage
Simply describe what you want to do with your data. The skill activates automatically when you mention "docetl" or describe unstructured data processing tasks:
```
> I have a folder of customer support tickets in JSON format.
> I want to extract the main issue, sentiment, and suggested resolution for each.
```
Claude will:
1. **Read your data** to understand its structure
2. **Write a tailored pipeline** with prompts specific to your documents
3. **Run the pipeline** and show you the results
4. **Debug issues** if any operations fail
## Alternative: Manual Pipeline Authoring
If you prefer not to use Claude Code, see the [Quick Start Tutorial](tutorial.md) for writing pipelines by hand.

View File

@ -1,29 +1,33 @@
## Retrievers (LanceDB OSS)
Retrievers let you augment LLM operations with retrieved context from a LanceDB index built over one of your DocETL datasets. You define retrievers once at the top-level, then attach them to any LLM-powered operation using `retriever: <name>`. At runtime, DocETL performs fulltext, vector, or hybrid search and injects the results into your prompt as `{{ retrieval_context }}`.
Retrievers let you augment LLM operations with retrieved context from a LanceDB index built over a DocETL dataset. You define retrievers once at the top-level, then attach them to any LLM-powered operation using `retriever: <name>`. At runtime, DocETL performs full-text, vector, or hybrid search and injects the results into your prompt as `{{ retrieval_context }}`.
LanceDB supports built-in full-text search, vector search, and hybrid with RRF reranking. See the official docs: [LanceDB Hybrid Search docs](https://lancedb.com/docs/search/hybrid-search/).
### Key points
- Always OSS LanceDB (local `index_dir`).
- A retriever references an existing dataset from the pipeline config.
- Operations do not override retriever settings. One source of truth = consistency.
- `{{ retrieval_context }}` is available to your prompt; if not used, DocETL prepends a short “extra context” section automatically.
## Configuration (clear separation of index vs query)
- Always OSS LanceDB (local `index_dir`).
- A retriever references a dataset from the pipeline config, or the output of a previous pipeline step.
- Operations do not override retriever settings. One source of truth = consistency.
- `{{ retrieval_context }}` is available to your prompt; if not used, DocETL prepends a short "extra context" section automatically.
## Configuration
Add a top-level `retrievers` section. Each retriever has:
- `dataset`: dataset name to index
- `dataset`: dataset name to index (can be a dataset or output of a previous pipeline step)
- `index_dir`: LanceDB path
- `index_types`: which indexes to build: `fts`, `embedding`, or `hybrid` (interpreted as both `fts` and `embedding`)
- `fts.index_phrase`: Jinja for how to index each dataset row for FTS (context: `input`)
- `fts.query_phrase`: Jinja for how to build the FTS query (context: operation context)
- `embedding.model`: embedding model used for the vector index and for query vectors
- `embedding.index_phrase`: Jinja for how to index each dataset row for embedding (context: `input`)
- `embedding.query_phrase`: Jinja for how to build the embedding query text (context: operation context)
- `index_types`: which indexes to build: `fts`, `embedding`, or `hybrid` (both)
- `fts.index_phrase`: Jinja template for indexing each row for full-text search
- `fts.query_phrase`: Jinja template for building the FTS query at runtime
- `embedding.model`: embedding model for vector index and queries
- `embedding.index_phrase`: Jinja template for indexing each row for embeddings
- `embedding.query_phrase`: Jinja template for building the embedding query
- `query.mode`: `fts` | `embedding` | `hybrid` (defaults to `hybrid` when both indexes exist)
- `query.top_k`: number of results to retrieve
### Basic example
```yaml
datasets:
transcripts:
@ -37,158 +41,356 @@ retrievers:
type: lancedb
dataset: transcripts
index_dir: workloads/medical/lance_index
build_index: if_missing # if_missing | always | never
index_types: ["fts", "embedding"] # or "hybrid"
build_index: if_missing # if_missing | always | never
index_types: ["fts", "embedding"]
fts:
# How to index each row (context: input == dataset row)
index_phrase: >
{{ input.src }}
# How to build the query (map/filter/extract context: input; reduce: reduce_key & inputs)
query_phrase: >
{{ input.get("src","")[:1000] if input else "" }}
index_phrase: "{{ input.src }}"
query_phrase: "{{ input.src[:1000] }}"
embedding:
model: openai/text-embedding-3-small
# How to index each row for embedding (context: input == dataset row)
index_phrase: >
{{ input.src }}
# How to build the query text to embed (op context)
query_phrase: >
{{ input.get("src","")[:1000] if input else "" }}
index_phrase: "{{ input.src }}"
query_phrase: "{{ input.src[:1000] }}"
query:
mode: hybrid
top_k: 8
```
Notes:
- Index build is automatic the first time a retriever is used (when `build_index: if_missing`).
- `fts.index_phrase` and `embedding.index_phrase` are evaluated with `input` for each dataset record (here `input` is the dataset row).
- `fts.query_phrase` and `embedding.query_phrase` are evaluated with the operation context.
## Multi-step pipelines with retrieval
Most pipelines have a single step, but you can define multiple steps where **the output of one step becomes the input (and retriever source) for the next**. This is powerful for patterns like:
1. Extract structured data from documents
2. Build a retrieval index on that extracted data
3. Use retrieval to find related items and process them
### Example: Extract facts, then find conflicts
```yaml
datasets:
articles:
type: file
path: workloads/wiki/articles.json
default_model: gpt-4o-mini
# Retriever indexes output of step 1 (extract_facts_step)
retrievers:
facts_index:
type: lancedb
dataset: extract_facts_step # References output of a pipeline step!
index_dir: workloads/wiki/facts_lance_index
build_index: if_missing
index_types: ["fts", "embedding"]
fts:
index_phrase: "{{ input.fact }} from {{ input.title }}"
query_phrase: "{{ input.fact }}"
embedding:
model: openai/text-embedding-3-small
index_phrase: "{{ input.fact }}"
query_phrase: "{{ input.fact }}"
query:
mode: hybrid
top_k: 5
operations:
- name: extract_facts
type: map
prompt: |
Extract factual claims from this article.
Article: {{ input.title }}
Text: {{ input.text }}
output:
schema:
facts: list[string]
- name: unnest_facts
type: unnest
unnest_key: facts
- name: find_conflicts
type: map
retriever: facts_index # Uses the retriever
prompt: |
Check if this fact conflicts with similar facts from other articles.
Current fact: {{ input.facts }} (from {{ input.title }})
Similar facts from other articles:
{{ retrieval_context }}
Return true only if there's a genuine contradiction.
output:
schema:
has_conflict: boolean
pipeline:
steps:
# Step 1: Extract and unnest facts
- name: extract_facts_step
input: articles
operations:
- extract_facts
- unnest_facts
# Step 2: Use retrieval to find conflicts
- name: find_conflicts_step
input: extract_facts_step # Input is output of step 1
operations:
- find_conflicts
output:
type: file
path: workloads/wiki/conflicts.json
intermediate_dir: workloads/wiki/intermediates
```
In this example:
- **Step 1** (`extract_facts_step`) extracts facts from articles
- The **retriever** (`facts_index`) indexes the output of step 1
- **Step 2** (`find_conflicts_step`) processes each fact, using retrieval to find similar facts from other articles
## Configuration reference
Top-level (retrievers.<name>):
### Minimal example
| Parameter | Type | Required | Default | Description |
| --- | --- | --- | --- | --- |
| type | string | yes | - | Must be `lancedb`. |
| dataset | string | yes | - | Name of an existing dataset in the pipeline config. |
| index_dir | string | yes | - | Filesystem path for the LanceDB database. Created if missing. |
| build_index | enum | no | `if_missing` | `if_missing` \| `always` \| `never`. Controls when to build the index. |
| index_types | list[string] \| string | yes | - | Which indexes to build: `fts`, `embedding`, or `"hybrid"` (interpreted as both). |
Here's the simplest possible retriever config (FTS only):
FTS section (retrievers.<name>.fts):
```yaml
retrievers:
my_search: # name can be anything you want
type: lancedb
dataset: my_dataset # must match a dataset name or pipeline step
index_dir: ./my_lance_index
index_types: ["fts"]
fts:
index_phrase: "{{ input.text }}" # what to index from each row
query_phrase: "{{ input.query }}" # what to search for at runtime
```
| Parameter | Type | Required | Default | Description |
| --- | --- | --- | --- | --- |
| index_phrase | jinja string | required if `fts` in index_types | - | How to index each dataset row. Context: `row`. |
| query_phrase | jinja string | recommended for FTS/hybrid queries | - | How to construct the FTS query. Context: op context (see below). |
### Full example with all options
Embedding section (retrievers.<name>.embedding):
```yaml
retrievers:
my_search:
type: lancedb
dataset: my_dataset
index_dir: ./my_lance_index
build_index: if_missing # optional, default: if_missing
index_types: ["fts", "embedding"] # can be ["fts"], ["embedding"], or both
fts:
index_phrase: "{{ input.text }}"
query_phrase: "{{ input.query }}"
embedding:
model: openai/text-embedding-3-small
index_phrase: "{{ input.text }}" # optional, falls back to fts.index_phrase
query_phrase: "{{ input.query }}"
query: # optional section
mode: hybrid # optional, auto-selects based on index_types
top_k: 10 # optional, default: 5
```
| Parameter | Type | Required | Default | Description |
| --- | --- | --- | --- | --- |
| model | string | required if `embedding` in index_types | - | Embedding model used for both index vectors and query vectors. |
| index_phrase | jinja string | no | falls back to `fts.index_phrase` if present | How to index each dataset row for embedding. Context: `row`. |
| query_phrase | jinja string | recommended for embedding/hybrid queries | - | How to construct the text to embed at query time. Context: op context. |
---
Query section (retrievers.<name>.query):
### Required fields
| Parameter | Type | Required | Default | Description |
| --- | --- | --- | --- | --- |
| mode | enum | no | auto | `fts` \| `embedding` \| `hybrid`. If omitted: `hybrid` when both indexes exist, else whichever index exists. |
| top_k | int | no | 5 | Number of results to return. |
| Field | Description |
| --- | --- |
| `type` | Must be `lancedb` |
| `dataset` | Name of a dataset or pipeline step to index |
| `index_dir` | Path where LanceDB stores the index (created if missing) |
| `index_types` | List of index types: `["fts"]`, `["embedding"]`, or `["fts", "embedding"]` |
Notes:
- Hybrid search uses LanceDBs built-in reranking (RRF) by default.
- Jinja contexts:
- Map / Filter / Extract: `{"input": <current item>}`
- Reduce: `{"reduce_key": {...}, "inputs": [items]}`
- Jinja for indexing uses `{"input": <dataset row>}`
- Keep query phrases concise; slice long fields, e.g. `{{ input.src[:1000] }}`.
- The injected `retrieval_context` is truncated conservatively (~1000 chars per doc).
---
### Optional fields
| Field | Default | Description |
| --- | --- | --- |
| `build_index` | `if_missing` | When to build: `if_missing`, `always`, or `never` |
| `query.mode` | auto | `fts`, `embedding`, or `hybrid`. Auto-selects based on what indexes exist |
| `query.top_k` | 5 | Number of results to return |
---
### The `fts` section
Required if `"fts"` is in `index_types`. Configures full-text search.
| Field | Required | Description |
| --- | --- | --- |
| `index_phrase` | yes | Jinja template: what text to index from each dataset row |
| `query_phrase` | yes | Jinja template: what text to search for at query time |
**Jinja variables available:**
| Template | Variables | When it runs |
| --- | --- | --- |
| `index_phrase` | `input` = the dataset row | Once per row when building the index |
| `query_phrase` | `input` = current item (map/filter/extract) | At query time for each item processed |
| `query_phrase` | `reduce_key`, `inputs` (reduce operations) | At query time for each group |
**Example - Medical knowledge base:**
```yaml
datasets:
drugs:
type: file
path: drugs.json # [{"name": "Aspirin", "uses": "pain, fever"}, ...]
patient_notes:
type: file
path: notes.json # [{"symptoms": "headache and fever"}, ...]
retrievers:
drug_lookup:
type: lancedb
dataset: drugs # index the drugs dataset
index_dir: ./drug_index
index_types: ["fts"]
fts:
index_phrase: "{{ input.name }}: {{ input.uses }}" # index: "Aspirin: pain, fever"
query_phrase: "{{ input.symptoms }}" # search with patient symptoms
operations:
- name: find_treatment
type: map
retriever: drug_lookup # attach the retriever
prompt: |
Patient symptoms: {{ input.symptoms }}
Relevant drugs from knowledge base:
{{ retrieval_context }}
Recommend a treatment.
output:
schema:
recommendation: string
```
When processing `{"symptoms": "headache and fever"}`:
1. `query_phrase` renders to `"headache and fever"`
2. FTS searches the index and finds `"Aspirin: pain, fever"` as a match
3. `{{ retrieval_context }}` in your prompt contains the matched results
---
### The `embedding` section
Required if `"embedding"` is in `index_types`. Configures vector/semantic search.
| Field | Required | Description |
| --- | --- | --- |
| `model` | yes | Embedding model, e.g. `openai/text-embedding-3-small` |
| `index_phrase` | no | Jinja template for text to embed. Falls back to `fts.index_phrase` |
| `query_phrase` | yes | Jinja template for query text to embed |
**Jinja variables:** Same as FTS section.
**Example - Semantic search:**
```yaml
retrievers:
semantic_docs:
type: lancedb
dataset: documentation
index_dir: ./docs_index
index_types: ["embedding"]
embedding:
model: openai/text-embedding-3-small
index_phrase: "{{ input.content }}"
query_phrase: "{{ input.question }}"
```
---
### The `query` section (optional)
Controls search behavior. You can omit this entire section.
| Field | Default | Description |
| --- | --- | --- |
| `mode` | auto | `fts`, `embedding`, or `hybrid`. Auto-selects `hybrid` if both indexes exist |
| `top_k` | 5 | Number of results to retrieve |
**Example - Override defaults:**
```yaml
retrievers:
my_search:
# ... other config ...
query:
mode: fts # force FTS even if embedding index exists
top_k: 20 # return more results
```
---
## Using a retriever in operations
Attach the retriever to any LLM-powered op with `retriever: <name>`. Include `{{ retrieval_context }}` in your prompt or let DocETL prepend it automatically.
### Operation Parameters
When using a retriever with an operation, the following additional parameters are available:
Attach a retriever to any LLM operation (map, filter, reduce, extract) with `retriever: <retriever_name>`. The retrieved results are available as `{{ retrieval_context }}` in your prompt.
| Parameter | Type | Default | Description |
| --- | --- | --- | --- |
| retriever | string | - | Name of the retriever to use (must be defined in the `retrievers` section). |
| save_retriever_output | bool | false | If true, saves the retrieved context to `_<operation_name>_retrieved_context` in the output. Useful for debugging and verifying retrieval quality. |
| retriever | string | - | Name of the retriever to use (must match a key in `retrievers`). |
| save_retriever_output | bool | false | If true, saves retrieved context to `_<operation_name>_retrieved_context` in output. |
### Map example
### Map
```yaml
operations:
- name: tag_visit
type: map
retriever: medical_r
save_retriever_output: true # Optional: save retrieved context to output
output:
schema:
tag: string
confidence: float
prompt: |
Classify the medical visit. Use the extra context if helpful:
{{ retrieval_context }}
Transcript:
{{ input.src }}
- name: tag_visit
type: map
retriever: medical_r
save_retriever_output: true
output:
schema:
tag: string
prompt: |
Classify this medical visit. Related context:
{{ retrieval_context }}
Transcript: {{ input.src }}
```
When `save_retriever_output: true`, each output document will include a `_tag_visit_retrieved_context` field containing the exact context that was retrieved and used for that document.
### Filter example
### Extract
```yaml
- name: extract_side_effects
type: extract
retriever: medical_r
document_keys: ["src"]
prompt: "Extract side effects mentioned in the text."
- name: filter_relevant
type: filter
retriever: medical_r
prompt: |
Is this transcript relevant to medication counseling?
Context: {{ retrieval_context }}
Transcript: {{ input.src }}
output:
schema:
is_relevant: boolean
```
### Filter
### Reduce example
When using reduce, the retrieval context is computed per group.
```yaml
- name: filter_relevant
type: filter
retriever: medical_r
prompt: "Is this transcript relevant to medication counseling? Return is_relevant: boolean."
output:
schema:
is_relevant: bool
_short_explanation: string
- name: summarize_by_medication
type: reduce
retriever: medical_r
reduce_key: medication
output:
schema:
summary: string
prompt: |
Summarize key points for medication '{{ reduce_key.medication }}'.
Related context: {{ retrieval_context }}
Inputs:
{% for item in inputs %}
- {{ item.src }}
{% endfor %}
```
### Reduce
When using reduce, the retrieval context is computed per group. The Jinja context provides both `reduce_key` and `inputs`.
```yaml
- name: summarize_by_medication
type: reduce
retriever: medical_r
reduce_key: "medication"
output:
schema:
summary: string
prompt: |
Summarize key points for medication '{{ reduce_key.medication }}'.
Use the extra context if helpful:
{{ retrieval_context }}
Inputs:
{{ inputs }}
```
## Jinja template contexts
- Map / Filter / Extract: `{"input": current_item}`
- Reduce: `{"reduce_key": {...}, "inputs": [items]}`
## Token budget and truncation
- DocETL uses a conservative default to limit the size of `retrieval_context` by truncating each retrieved text to ~1000 characters.
## Troubleshooting
- No results: the retriever injects “No extra context available.” and continues.
- Index issues: set `build_index: always` to rebuild; ensure `index_dir` exists and is writable.
- Embeddings: DocETL uses its embedding router and caches results where possible.
- **No results**: the retriever injects "No extra context available." and continues.
- **Index issues**: set `build_index: always` to rebuild; ensure `index_dir` exists and is writable.
- **Token limits**: `retrieval_context` is truncated to ~1000 chars per retrieved doc.

View File

@ -14,6 +14,7 @@ nav:
- Getting Started:
- Overview: index.md
- Installation: installation.md
- Quick Start (Claude Code): quickstart-claude-code.md
- Quick Start Tutorial: tutorial.md
- Quick Start Tutorial (Python API): tutorial-pythonapi.md
- Best Practices: best-practices.md

View File

@ -1,6 +1,6 @@
[project]
name = "docetl"
version = "0.2.5"
version = "0.2.6"
description = "ETL with LLM operations."
readme = "README.md"
requires-python = ">=3.10"
@ -122,7 +122,7 @@ build-backend = "hatchling.build"
[tool.hatch.build]
packages = ["docetl"]
include = ["docetl/**", "server/**", "README.md", "LICENSE"]
include = ["docetl/**", "server/**", "README.md", "LICENSE", ".claude/skills/**"]
exclude = ["website/**/*"]
[tool.pytest.ini_options]

View File

@ -698,7 +698,7 @@ wheels = [
[[package]]
name = "docetl"
version = "0.2.5"
version = "0.2.6"
source = { editable = "." }
dependencies = [
{ name = "asteval" },