Compare commits
3 Commits
main
...
claudeskil
| Author | SHA1 | Date |
|---|---|---|
|
|
d1bd0a000e | |
|
|
366a118458 | |
|
|
9b6bd41508 |
|
|
@ -0,0 +1,647 @@
|
|||
---
|
||||
name: docetl
|
||||
description: Build and run LLM-powered data processing pipelines with DocETL. Use when users say "docetl", want to analyze unstructured data, process documents, extract information, or run ETL tasks on text. Helps with data collection, pipeline creation, execution, and optimization.
|
||||
---
|
||||
|
||||
# DocETL Pipeline Development
|
||||
|
||||
DocETL is a system for creating LLM-powered data processing pipelines. This skill helps you build end-to-end pipelines: from data preparation to execution and optimization.
|
||||
|
||||
## Workflow Overview
|
||||
|
||||
1. **Understand the task** - What data? What processing?
|
||||
2. **Prepare data** - Transform into DocETL dataset (JSON/CSV)
|
||||
3. **Read and understand the data** - Examine actual documents to write specific prompts
|
||||
4. **Author pipeline** - Create YAML configuration with detailed, data-specific prompts
|
||||
5. **Verify environment** - Check API keys in `.env`
|
||||
6. **Execute pipeline** - Run with `docetl run`
|
||||
7. **Optimize (optional)** - Use MOAR optimizer for cost/accuracy tradeoffs
|
||||
|
||||
## Step 1: Data Preparation
|
||||
|
||||
DocETL datasets must be **JSON arrays** or **CSV files**.
|
||||
|
||||
### JSON Format
|
||||
```json
|
||||
[
|
||||
{"id": 1, "text": "First document content...", "metadata": "value"},
|
||||
{"id": 2, "text": "Second document content...", "metadata": "value"}
|
||||
]
|
||||
```
|
||||
|
||||
### CSV Format
|
||||
```csv
|
||||
id,text,metadata
|
||||
1,"First document content...","value"
|
||||
2,"Second document content...","value"
|
||||
```
|
||||
|
||||
### Data Collection Scripts
|
||||
|
||||
If user needs to collect data, write a Python script:
|
||||
|
||||
```python
|
||||
import json
|
||||
|
||||
# Collect/transform data
|
||||
documents = []
|
||||
for source in sources:
|
||||
documents.append({
|
||||
"id": source.id,
|
||||
"text": source.content, # DO NOT truncate text
|
||||
# Add relevant fields
|
||||
})
|
||||
|
||||
# Save as DocETL dataset
|
||||
with open("dataset.json", "w") as f:
|
||||
json.dump(documents, f, indent=2)
|
||||
```
|
||||
|
||||
**Important:** Never truncate document text in collection scripts. DocETL operations like `split` handle long documents properly. Truncation loses information.
|
||||
|
||||
## Step 2: Read and Understand the Data
|
||||
|
||||
**CRITICAL**: Before writing any prompts, READ the actual input data to understand:
|
||||
- The structure and format of documents
|
||||
- The vocabulary and terminology used
|
||||
- What information is present vs. absent
|
||||
- Edge cases and variations
|
||||
|
||||
```python
|
||||
import json
|
||||
with open("dataset.json") as f:
|
||||
data = json.load(f)
|
||||
# Examine several examples
|
||||
for doc in data[:5]:
|
||||
print(doc)
|
||||
```
|
||||
|
||||
This understanding is essential for writing specific, effective prompts.
|
||||
|
||||
## Step 3: Pipeline Structure
|
||||
|
||||
Create a YAML file with this structure:
|
||||
|
||||
```yaml
|
||||
default_model: gpt-5-nano
|
||||
|
||||
system_prompt:
|
||||
dataset_description: <describe the data based on what you observed>
|
||||
persona: <role for the LLM to adopt>
|
||||
|
||||
datasets:
|
||||
input_data:
|
||||
type: file
|
||||
path: "dataset.json" # or dataset.csv
|
||||
|
||||
operations:
|
||||
- name: <operation_name>
|
||||
type: <operation_type>
|
||||
prompt: |
|
||||
<Detailed, specific prompt based on the actual data>
|
||||
output:
|
||||
schema:
|
||||
<field_name>: <type>
|
||||
|
||||
pipeline:
|
||||
steps:
|
||||
- name: process
|
||||
input: input_data
|
||||
operations:
|
||||
- <operation_name>
|
||||
output:
|
||||
type: file
|
||||
path: "output.json"
|
||||
intermediate_dir: "intermediates" # ALWAYS set this for debugging
|
||||
```
|
||||
|
||||
### Key Configuration
|
||||
|
||||
- **default_model**: Use `gpt-5-nano` unless user specifies otherwise
|
||||
- **intermediate_dir**: Always set to log intermediate results
|
||||
- **system_prompt**: Describe the data based on what you actually observed
|
||||
|
||||
## Step 4: Writing Effective Prompts
|
||||
|
||||
**Prompts must be specific to the data, not generic.** After reading the input data:
|
||||
|
||||
### Bad (Generic) Prompt
|
||||
```yaml
|
||||
prompt: |
|
||||
Extract key information from this document.
|
||||
{{ input.text }}
|
||||
```
|
||||
|
||||
### Good (Specific) Prompt
|
||||
```yaml
|
||||
prompt: |
|
||||
You are analyzing a medical transcript from a doctor-patient visit.
|
||||
|
||||
The transcript follows this format:
|
||||
- Doctor statements are prefixed with "DR:"
|
||||
- Patient statements are prefixed with "PT:"
|
||||
- Timestamps appear in brackets like [00:05:23]
|
||||
|
||||
From the following transcript, extract:
|
||||
1. All medications mentioned (brand names or generic)
|
||||
2. Dosages if specified
|
||||
3. Patient-reported side effects or concerns
|
||||
|
||||
Transcript:
|
||||
{{ input.transcript }}
|
||||
|
||||
Be thorough - patients often mention medication names informally.
|
||||
If a medication is unclear, include it with a note.
|
||||
```
|
||||
|
||||
### Prompt Writing Guidelines
|
||||
|
||||
1. **Describe the data format** you observed
|
||||
2. **Be specific about what to extract** - list exact fields
|
||||
3. **Mention edge cases** you noticed in the data
|
||||
4. **Provide examples** if the task is ambiguous
|
||||
5. **Set expectations** for handling missing/unclear information
|
||||
|
||||
## Step 5: Choosing Operations
|
||||
|
||||
Many tasks only need a **single map operation**. Use good judgement:
|
||||
|
||||
| Task | Recommended Approach |
|
||||
|------|---------------------|
|
||||
| Extract info from each doc | Single `map` |
|
||||
| Multiple extractions | Multiple `map` operations chained |
|
||||
| Extract then summarize | `map` → `reduce` |
|
||||
| Filter then process | `filter` → `map` |
|
||||
| Split long docs | `split` → `map` → `reduce` |
|
||||
| Deduplicate entities | `map` → `unnest` → `resolve` |
|
||||
|
||||
## Operation Reference
|
||||
|
||||
### Map Operation
|
||||
|
||||
Applies an LLM transformation to each document independently.
|
||||
|
||||
```yaml
|
||||
- name: extract_info
|
||||
type: map
|
||||
prompt: |
|
||||
Analyze this document:
|
||||
{{ input.text }}
|
||||
|
||||
Extract the main topic and 3 key points.
|
||||
output:
|
||||
schema:
|
||||
topic: string
|
||||
key_points: list[string]
|
||||
model: gpt-5-nano # optional, uses default_model if not set
|
||||
skip_on_error: true # recommended for large-scale runs
|
||||
validate: # optional
|
||||
- len(output["key_points"]) == 3
|
||||
num_retries_on_validate_failure: 2 # optional
|
||||
```
|
||||
|
||||
**Key parameters:**
|
||||
- `prompt`: Jinja2 template, use `{{ input.field }}` to reference fields
|
||||
- `output.schema`: Define output structure
|
||||
- `skip_on_error`: Set `true` to continue on LLM errors (recommended at scale)
|
||||
- `validate`: Python expressions to validate output
|
||||
- `sample`: Process only N documents (for testing)
|
||||
- `limit`: Stop after producing N outputs
|
||||
|
||||
### Filter Operation
|
||||
|
||||
Keeps or removes documents based on LLM criteria. Output schema must have exactly one boolean field.
|
||||
|
||||
```yaml
|
||||
- name: filter_relevant
|
||||
type: filter
|
||||
skip_on_error: true
|
||||
prompt: |
|
||||
Document: {{ input.text }}
|
||||
|
||||
Is this document relevant to climate change?
|
||||
Respond true or false.
|
||||
output:
|
||||
schema:
|
||||
is_relevant: boolean
|
||||
```
|
||||
|
||||
### Reduce Operation
|
||||
|
||||
Aggregates documents by a key using an LLM.
|
||||
|
||||
**Always include `fold_prompt` and `fold_batch_size`** for reduce operations. This handles cases where the group is too large to fit in context.
|
||||
|
||||
```yaml
|
||||
- name: summarize_by_category
|
||||
type: reduce
|
||||
reduce_key: category # use "_all" to aggregate everything
|
||||
skip_on_error: true
|
||||
prompt: |
|
||||
Summarize these {{ inputs | length }} items for category "{{ inputs[0].category }}":
|
||||
|
||||
{% for item in inputs %}
|
||||
- {{ item.title }}: {{ item.description }}
|
||||
{% endfor %}
|
||||
fold_prompt: |
|
||||
You have an existing summary and new items to incorporate.
|
||||
|
||||
Existing summary:
|
||||
{{ output.summary }}
|
||||
|
||||
New items to add:
|
||||
{% for item in inputs %}
|
||||
- {{ item.title }}: {{ item.description }}
|
||||
{% endfor %}
|
||||
|
||||
Update the summary to include the new information.
|
||||
fold_batch_size: 10 # Estimate based on doc size and model context window
|
||||
output:
|
||||
schema:
|
||||
summary: string
|
||||
validate:
|
||||
- len(output["summary"].strip()) > 0
|
||||
num_retries_on_validate_failure: 2
|
||||
```
|
||||
|
||||
**Estimating `fold_batch_size`:**
|
||||
- Consider document size and model context window
|
||||
- For gpt-4o-mini (128k context): ~50-100 small docs, ~10-20 medium docs
|
||||
- For gpt-4o (128k context): similar to gpt-4o-mini
|
||||
- Use WebSearch to look up context window sizes for unfamiliar models
|
||||
|
||||
**Key parameters:**
|
||||
- `reduce_key`: Field to group by (or list of fields, or `_all`)
|
||||
- `fold_prompt`: Template for incrementally adding items to existing output (required)
|
||||
- `fold_batch_size`: Number of items per fold iteration (required)
|
||||
- `associative`: Set to `false` if order matters
|
||||
|
||||
### Split Operation
|
||||
|
||||
Divides long text into smaller chunks. No LLM call.
|
||||
|
||||
```yaml
|
||||
- name: split_document
|
||||
type: split
|
||||
split_key: content
|
||||
method: token_count # or "delimiter"
|
||||
method_kwargs:
|
||||
num_tokens: 500
|
||||
model: gpt-5-nano
|
||||
```
|
||||
|
||||
**Output adds:**
|
||||
- `{split_key}_chunk`: The chunk content
|
||||
- `{op_name}_id`: Original document ID
|
||||
- `{op_name}_chunk_num`: Chunk number
|
||||
|
||||
### Unnest Operation
|
||||
|
||||
Flattens list fields into separate rows. No LLM call.
|
||||
|
||||
```yaml
|
||||
- name: unnest_items
|
||||
type: unnest
|
||||
unnest_key: items # field containing the list
|
||||
keep_empty: false # optional
|
||||
```
|
||||
|
||||
**Example:** If a document has `items: ["a", "b", "c"]`, unnest creates 3 documents, each with `items: "a"`, `items: "b"`, `items: "c"`.
|
||||
|
||||
### Resolve Operation
|
||||
|
||||
Deduplicates and canonicalizes entities. Uses pairwise comparison.
|
||||
|
||||
```yaml
|
||||
- name: dedupe_names
|
||||
type: resolve
|
||||
optimize: true # let optimizer find blocking rules
|
||||
skip_on_error: true
|
||||
comparison_prompt: |
|
||||
Are these the same person?
|
||||
|
||||
Person 1: {{ input1.name }} ({{ input1.email }})
|
||||
Person 2: {{ input2.name }} ({{ input2.email }})
|
||||
|
||||
Respond true or false.
|
||||
resolution_prompt: |
|
||||
Standardize this person's name:
|
||||
|
||||
{% for entry in inputs %}
|
||||
- {{ entry.name }}
|
||||
{% endfor %}
|
||||
|
||||
Return the canonical name.
|
||||
output:
|
||||
schema:
|
||||
name: string
|
||||
```
|
||||
|
||||
**Important:** Set `optimize: true` and run `docetl build` to generate efficient blocking rules. Without blocking, this is O(n²).
|
||||
|
||||
### Code Operations
|
||||
|
||||
Deterministic Python transformations without LLM calls.
|
||||
|
||||
**code_map:**
|
||||
```yaml
|
||||
- name: compute_stats
|
||||
type: code_map
|
||||
code: |
|
||||
def transform(doc) -> dict:
|
||||
return {
|
||||
"word_count": len(doc["text"].split()),
|
||||
"char_count": len(doc["text"])
|
||||
}
|
||||
```
|
||||
|
||||
**code_reduce:**
|
||||
```yaml
|
||||
- name: aggregate
|
||||
type: code_reduce
|
||||
reduce_key: category
|
||||
code: |
|
||||
def transform(items) -> dict:
|
||||
total = sum(item["value"] for item in items)
|
||||
return {"total": total, "count": len(items)}
|
||||
```
|
||||
|
||||
**code_filter:**
|
||||
```yaml
|
||||
- name: filter_long
|
||||
type: code_filter
|
||||
code: |
|
||||
def transform(doc) -> bool:
|
||||
return len(doc["text"]) > 100
|
||||
|
||||
```
|
||||
|
||||
### Retrievers (LanceDB)
|
||||
|
||||
Augment LLM operations with retrieved context from a LanceDB index. Useful for:
|
||||
- Finding related documents to compare against
|
||||
- Providing additional context for extraction/classification
|
||||
- Cross-referencing facts across a dataset
|
||||
|
||||
**Define a retriever:**
|
||||
```yaml
|
||||
retrievers:
|
||||
facts_index:
|
||||
type: lancedb
|
||||
dataset: extracted_facts # dataset to index
|
||||
index_dir: workloads/wiki/lance_index
|
||||
build_index: if_missing # if_missing | always | never
|
||||
index_types: ["fts", "embedding"] # or "hybrid"
|
||||
fts:
|
||||
index_phrase: "{{ input.fact }}: {{ input.source }}"
|
||||
query_phrase: "{{ input.fact }}"
|
||||
embedding:
|
||||
model: openai/text-embedding-3-small
|
||||
index_phrase: "{{ input.fact }}"
|
||||
query_phrase: "{{ input.fact }}"
|
||||
query:
|
||||
mode: hybrid
|
||||
top_k: 5
|
||||
```
|
||||
|
||||
**Use in operations:**
|
||||
```yaml
|
||||
- name: find_conflicts
|
||||
type: map
|
||||
retriever: facts_index
|
||||
prompt: |
|
||||
Check if this fact conflicts with any retrieved facts:
|
||||
|
||||
Current fact: {{ input.fact }} (from {{ input.source }})
|
||||
|
||||
Related facts from other articles:
|
||||
{{ retrieval_context }}
|
||||
|
||||
Return whether there's a genuine conflict.
|
||||
output:
|
||||
schema:
|
||||
has_conflict: boolean
|
||||
```
|
||||
|
||||
**Key points:**
|
||||
- `{{ retrieval_context }}` is injected into prompts automatically
|
||||
- Index is built on first use (when `build_index: if_missing`)
|
||||
- Supports full-text (`fts`), vector (`embedding`), or `hybrid` search
|
||||
- Use `save_retriever_output: true` to debug what was retrieved
|
||||
- **Can index intermediate outputs**: Retriever can index the output of a previous pipeline step, enabling patterns like "extract facts → index facts → retrieve similar facts for each"
|
||||
|
||||
## Documentation Reference
|
||||
|
||||
For detailed parameters, advanced features, and more examples, read the docs:
|
||||
- **Operations**: `docs/operators/` folder (map.md, reduce.md, filter.md, etc.)
|
||||
- **Concepts**: `docs/concepts/` folder (pipelines.md, operators.md, schemas.md)
|
||||
- **Examples**: `docs/examples/` folder
|
||||
- **Optimization**: `docs/optimization/` folder
|
||||
|
||||
## Step 6: Environment Setup
|
||||
|
||||
Before running, verify API keys exist:
|
||||
|
||||
```bash
|
||||
# Check for .env file
|
||||
cat .env
|
||||
```
|
||||
|
||||
Required keys depend on the model:
|
||||
- OpenAI: `OPENAI_API_KEY`
|
||||
- Anthropic: `ANTHROPIC_API_KEY`
|
||||
- Google: `GEMINI_API_KEY`
|
||||
|
||||
If missing, prompt user to create `.env`:
|
||||
```
|
||||
OPENAI_API_KEY=sk-...
|
||||
```
|
||||
|
||||
## Step 7: Execution
|
||||
|
||||
**Always ask user before running** - LLM calls cost money.
|
||||
|
||||
```bash
|
||||
docetl run pipeline.yaml
|
||||
```
|
||||
|
||||
Options:
|
||||
- `--max_threads N` - Control parallelism
|
||||
|
||||
Check intermediate results in the `intermediate_dir` folder to debug.
|
||||
|
||||
## Step 8: Optimization (Optional)
|
||||
|
||||
Use MOAR optimizer to find the Pareto frontier of **cost vs. accuracy** tradeoffs. MOAR experiments with different pipeline rewrites and models to find optimal configurations.
|
||||
|
||||
Add to pipeline YAML:
|
||||
|
||||
```yaml
|
||||
optimizer_config:
|
||||
type: moar
|
||||
save_dir: ./optimization_results
|
||||
available_models:
|
||||
- gpt-5-nano
|
||||
- gpt-4o-mini
|
||||
- gpt-4o
|
||||
evaluation_file: evaluate.py # User must provide
|
||||
metric_key: score
|
||||
max_iterations: 20
|
||||
model: gpt-5-nano
|
||||
```
|
||||
|
||||
Create evaluation file (`evaluate.py`):
|
||||
```python
|
||||
def evaluate(outputs: list[dict]) -> dict:
|
||||
# Score the outputs (0-1 scale recommended)
|
||||
correct = sum(1 for o in outputs if is_correct(o))
|
||||
return {"score": correct / len(outputs)}
|
||||
```
|
||||
|
||||
Run optimization:
|
||||
```bash
|
||||
docetl build pipeline.yaml --optimizer moar
|
||||
```
|
||||
|
||||
MOAR will produce multiple pipeline variants on the Pareto frontier - user can choose based on their cost/accuracy preferences.
|
||||
|
||||
## Output Schemas
|
||||
|
||||
**Keep schemas minimal and simple** unless the user explicitly requests more fields. Default to 1-3 output fields per operation. Only add more fields if the user specifically asks for them.
|
||||
|
||||
**Nesting limit:** Maximum 2 levels deep (e.g., `list[{field: str}]` is allowed, but no deeper).
|
||||
|
||||
```yaml
|
||||
# Good - minimal, focused on the core task
|
||||
output:
|
||||
schema:
|
||||
summary: string
|
||||
|
||||
# Good - a few fields when task requires it
|
||||
output:
|
||||
schema:
|
||||
topic: string
|
||||
keywords: list[string]
|
||||
|
||||
# Acceptable - 2 levels of nesting (list of objects)
|
||||
output:
|
||||
schema:
|
||||
items: "list[{name: str, value: int}]"
|
||||
|
||||
# Bad - too many fields (unless user explicitly requested all of these)
|
||||
output:
|
||||
schema:
|
||||
conflicts_found: bool
|
||||
num_conflicts: int
|
||||
conflicts: "list[{claim_a: str, source_a: str, claim_b: str, source_b: str}]"
|
||||
analysis_summary: str
|
||||
|
||||
# Bad - more than 2 levels of nesting (not supported)
|
||||
output:
|
||||
schema:
|
||||
data: "list[{nested: {too: {deep: str}}}]"
|
||||
```
|
||||
|
||||
**Guidelines:**
|
||||
- Start with the minimum fields needed to answer the user's question
|
||||
- Avoid complex nested objects unless explicitly requested
|
||||
- If you need structured data, prefer multiple simple operations over one complex schema
|
||||
- Complex schemas increase LLM failures and costs
|
||||
|
||||
Supported types: `string`, `int`, `float`, `bool`, `list[type]`, `enum`
|
||||
|
||||
## Validation
|
||||
|
||||
**Always add validation to LLM-powered operations** (map, reduce, filter, resolve). Validation catches malformed outputs and retries automatically.
|
||||
|
||||
```yaml
|
||||
- name: extract_keywords
|
||||
type: map
|
||||
prompt: |
|
||||
Extract 3-5 keywords from: {{ input.text }}
|
||||
output:
|
||||
schema:
|
||||
keywords: list[string]
|
||||
validate:
|
||||
- len(output["keywords"]) >= 3
|
||||
- len(output["keywords"]) <= 5
|
||||
num_retries_on_validate_failure: 2
|
||||
```
|
||||
|
||||
Common validation patterns:
|
||||
```yaml
|
||||
# List length constraints
|
||||
- len(output["items"]) >= 1
|
||||
- len(output["items"]) <= 10
|
||||
|
||||
# Enum/allowed values
|
||||
- output["sentiment"] in ["positive", "negative", "neutral"]
|
||||
|
||||
# String not empty
|
||||
- len(output["summary"].strip()) > 0
|
||||
|
||||
# Numeric ranges
|
||||
- output["score"] >= 0
|
||||
- output["score"] <= 100
|
||||
```
|
||||
|
||||
## Jinja2 Templating
|
||||
|
||||
For map operations, use `input`:
|
||||
```yaml
|
||||
prompt: |
|
||||
Document: {{ input.text }}
|
||||
{% if input.metadata %}
|
||||
Context: {{ input.metadata }}
|
||||
{% endif %}
|
||||
```
|
||||
|
||||
For reduce operations, use `inputs` (list):
|
||||
```yaml
|
||||
prompt: |
|
||||
Summarize these {{ inputs | length }} items:
|
||||
{% for item in inputs %}
|
||||
- {{ item.summary }}
|
||||
{% endfor %}
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Pipeline won't run
|
||||
- Check `.env` has correct API keys
|
||||
- Verify dataset file exists and is valid JSON/CSV
|
||||
- Check YAML syntax
|
||||
|
||||
### Bad outputs
|
||||
- Read more input data examples to improve prompt specificity
|
||||
- Add `validate` rules with retries
|
||||
- Simplify output schema
|
||||
- Add concrete examples to prompt
|
||||
|
||||
### High costs
|
||||
- Use `gpt-5-nano` or `gpt-4o-mini`
|
||||
- Add `sample: 10` to test on subset first
|
||||
- Run MOAR optimizer to find cost-efficient rewrites
|
||||
|
||||
### Check intermediate results
|
||||
Look in `intermediate_dir` folder to debug each step.
|
||||
|
||||
## Quick Reference
|
||||
|
||||
```bash
|
||||
# Run pipeline
|
||||
docetl run pipeline.yaml
|
||||
|
||||
# Run with more parallelism
|
||||
docetl run pipeline.yaml --max_threads 16
|
||||
|
||||
# Optimize pipeline (cost/accuracy tradeoff)
|
||||
docetl build pipeline.yaml --optimizer moar
|
||||
|
||||
# Clear LLM cache
|
||||
docetl clear-cache
|
||||
|
||||
# Check version
|
||||
docetl version
|
||||
```
|
||||
|
||||
|
|
@ -1,4 +1,4 @@
|
|||
__version__ = "0.2.5"
|
||||
__version__ = "0.2.6"
|
||||
|
||||
import warnings
|
||||
|
||||
|
|
|
|||
|
|
@ -244,5 +244,104 @@ def version():
|
|||
typer.echo(f"DocETL version: {docetl.__version__}")
|
||||
|
||||
|
||||
@app.command("install-skill")
|
||||
def install_skill(
|
||||
uninstall: bool = typer.Option(
|
||||
False, "--uninstall", "-u", help="Remove the installed skill instead"
|
||||
),
|
||||
):
|
||||
"""
|
||||
Install the DocETL Claude Code skill to your personal skills directory.
|
||||
|
||||
This makes the DocETL skill available in Claude Code for any project.
|
||||
The skill helps you build and run DocETL pipelines.
|
||||
"""
|
||||
import shutil
|
||||
|
||||
# Find the skill source - try multiple locations
|
||||
# 1. Installed package location (via importlib.resources)
|
||||
# 2. Development location (relative to this file)
|
||||
skill_source = None
|
||||
|
||||
# Try to find via package resources first
|
||||
try:
|
||||
import importlib.resources as pkg_resources
|
||||
|
||||
# For Python 3.9+, use files()
|
||||
try:
|
||||
package_root = Path(pkg_resources.files("docetl")).parent
|
||||
potential_source = package_root / ".claude" / "skills" / "docetl"
|
||||
if potential_source.exists():
|
||||
skill_source = potential_source
|
||||
except (TypeError, AttributeError):
|
||||
pass
|
||||
except ImportError:
|
||||
pass
|
||||
|
||||
# Fallback: try relative to this file (development mode)
|
||||
if skill_source is None:
|
||||
dev_source = Path(__file__).parent.parent / ".claude" / "skills" / "docetl"
|
||||
if dev_source.exists():
|
||||
skill_source = dev_source
|
||||
|
||||
if skill_source is None or not skill_source.exists():
|
||||
console.print(
|
||||
Panel(
|
||||
"[bold red]Error:[/bold red] Could not find the DocETL skill files.\n\n"
|
||||
"This may happen if the package was not installed correctly.\n"
|
||||
"Try reinstalling: [bold]pip install --force-reinstall docetl[/bold]",
|
||||
title="[bold red]Skill Not Found[/bold red]",
|
||||
border_style="red",
|
||||
)
|
||||
)
|
||||
raise typer.Exit(1)
|
||||
|
||||
# Target directory
|
||||
skill_target = Path.home() / ".claude" / "skills" / "docetl"
|
||||
|
||||
if uninstall:
|
||||
if skill_target.exists():
|
||||
shutil.rmtree(skill_target)
|
||||
console.print(
|
||||
Panel(
|
||||
f"[bold green]Success![/bold green] DocETL skill removed from:\n"
|
||||
f"[dim]{skill_target}[/dim]",
|
||||
title="[bold green]Skill Uninstalled[/bold green]",
|
||||
border_style="green",
|
||||
)
|
||||
)
|
||||
else:
|
||||
console.print(
|
||||
Panel(
|
||||
"[yellow]The DocETL skill is not currently installed.[/yellow]",
|
||||
title="[yellow]Nothing to Uninstall[/yellow]",
|
||||
border_style="yellow",
|
||||
)
|
||||
)
|
||||
return
|
||||
|
||||
# Create parent directories if needed
|
||||
skill_target.parent.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# Copy the skill
|
||||
if skill_target.exists():
|
||||
shutil.rmtree(skill_target)
|
||||
|
||||
shutil.copytree(skill_source, skill_target)
|
||||
|
||||
console.print(
|
||||
Panel(
|
||||
f"[bold green]Success![/bold green] DocETL skill installed to:\n"
|
||||
f"[dim]{skill_target}[/dim]\n\n"
|
||||
"[bold]Next steps:[/bold]\n"
|
||||
"1. Restart Claude Code if it's running\n"
|
||||
"2. The skill will automatically activate when you work on DocETL tasks\n\n"
|
||||
"[dim]To uninstall: docetl install-skill --uninstall[/dim]",
|
||||
title="[bold green]Skill Installed[/bold green]",
|
||||
border_style="green",
|
||||
)
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
app()
|
||||
|
|
|
|||
|
|
@ -34,6 +34,9 @@ To get started with DocETL:
|
|||
2. Define your pipeline in a YAML file. Want to use an LLM like ChatGPT or Claude to help you write your pipeline? See [docetl.org/llms.txt](https://docetl.org/llms.txt) for a big prompt you can copy paste into ChatGPT or Claude, before describing your task.
|
||||
3. Run your pipeline using the DocETL command-line interface
|
||||
|
||||
!!! tip "Fastest Way: Claude Code"
|
||||
Clone this repo and run `claude` to use the built-in DocETL skill. Just describe your data processing task and Claude will create and run the pipeline for you. See [Quick Start (Claude Code)](quickstart-claude-code.md) for details.
|
||||
|
||||
## 🏛️ Project Origin
|
||||
|
||||
DocETL was created by members of the EPIC Data Lab and Data Systems and Foundations group at UC Berkeley. The EPIC (Effective Programming, Interaction, and Computation with Data) Lab focuses on developing low-code and no-code interfaces for data work, powered by next-generation predictive programming techniques. DocETL is one of the projects that emerged from our research efforts to streamline complex document processing tasks.
|
||||
|
|
|
|||
|
|
@ -55,14 +55,6 @@ If you want to use only the parsing extra:
|
|||
uv sync --extra parsing
|
||||
```
|
||||
|
||||
If you want to use the parsing tools, you need to install the `parsing` extra:
|
||||
|
||||
```bash
|
||||
poetry install --extras "parsing"
|
||||
```
|
||||
|
||||
This will create a virtual environment and install all the required dependencies.
|
||||
|
||||
4. Set up your OpenAI API key:
|
||||
|
||||
Create a .env file in the project root and add your OpenAI API key:
|
||||
|
|
|
|||
|
|
@ -0,0 +1,45 @@
|
|||
# Quick Start with Claude Code
|
||||
|
||||
The fastest way to build DocETL pipelines is with [Claude Code](https://claude.ai/download), Anthropic's agentic coding tool. DocETL includes a built-in Claude Code skill that helps you create, run, and debug pipelines interactively.
|
||||
|
||||
## Option 1: Clone the Repository (Recommended)
|
||||
|
||||
This gives you the full development environment with the skill already configured.
|
||||
|
||||
1. Follow the [Installation from Source](installation.md#installation-from-source) instructions
|
||||
2. Run `claude` in the repository directory
|
||||
|
||||
The skill is located at `.claude/skills/docetl/SKILL.md`.
|
||||
|
||||
## Option 2: Install via pip
|
||||
|
||||
If you already have DocETL installed via pip, you can install the skill separately:
|
||||
|
||||
```bash
|
||||
pip install docetl
|
||||
docetl install-skill
|
||||
```
|
||||
|
||||
This copies the skill to `~/.claude/skills/docetl/`. Then run `claude` in any directory.
|
||||
|
||||
To uninstall: `docetl install-skill --uninstall`
|
||||
|
||||
## Usage
|
||||
|
||||
Simply describe what you want to do with your data. The skill activates automatically when you mention "docetl" or describe unstructured data processing tasks:
|
||||
|
||||
```
|
||||
> I have a folder of customer support tickets in JSON format.
|
||||
> I want to extract the main issue, sentiment, and suggested resolution for each.
|
||||
```
|
||||
|
||||
Claude will:
|
||||
|
||||
1. **Read your data** to understand its structure
|
||||
2. **Write a tailored pipeline** with prompts specific to your documents
|
||||
3. **Run the pipeline** and show you the results
|
||||
4. **Debug issues** if any operations fail
|
||||
|
||||
## Alternative: Manual Pipeline Authoring
|
||||
|
||||
If you prefer not to use Claude Code, see the [Quick Start Tutorial](tutorial.md) for writing pipelines by hand.
|
||||
|
|
@ -1,29 +1,33 @@
|
|||
## Retrievers (LanceDB OSS)
|
||||
|
||||
Retrievers let you augment LLM operations with retrieved context from a LanceDB index built over one of your DocETL datasets. You define retrievers once at the top-level, then attach them to any LLM-powered operation using `retriever: <name>`. At runtime, DocETL performs full‑text, vector, or hybrid search and injects the results into your prompt as `{{ retrieval_context }}`.
|
||||
Retrievers let you augment LLM operations with retrieved context from a LanceDB index built over a DocETL dataset. You define retrievers once at the top-level, then attach them to any LLM-powered operation using `retriever: <name>`. At runtime, DocETL performs full-text, vector, or hybrid search and injects the results into your prompt as `{{ retrieval_context }}`.
|
||||
|
||||
LanceDB supports built-in full-text search, vector search, and hybrid with RRF reranking. See the official docs: [LanceDB Hybrid Search docs](https://lancedb.com/docs/search/hybrid-search/).
|
||||
|
||||
### Key points
|
||||
- Always OSS LanceDB (local `index_dir`).
|
||||
- A retriever references an existing dataset from the pipeline config.
|
||||
- Operations do not override retriever settings. One source of truth = consistency.
|
||||
- `{{ retrieval_context }}` is available to your prompt; if not used, DocETL prepends a short “extra context” section automatically.
|
||||
|
||||
## Configuration (clear separation of index vs query)
|
||||
- Always OSS LanceDB (local `index_dir`).
|
||||
- A retriever references a dataset from the pipeline config, or the output of a previous pipeline step.
|
||||
- Operations do not override retriever settings. One source of truth = consistency.
|
||||
- `{{ retrieval_context }}` is available to your prompt; if not used, DocETL prepends a short "extra context" section automatically.
|
||||
|
||||
## Configuration
|
||||
|
||||
Add a top-level `retrievers` section. Each retriever has:
|
||||
- `dataset`: dataset name to index
|
||||
|
||||
- `dataset`: dataset name to index (can be a dataset or output of a previous pipeline step)
|
||||
- `index_dir`: LanceDB path
|
||||
- `index_types`: which indexes to build: `fts`, `embedding`, or `hybrid` (interpreted as both `fts` and `embedding`)
|
||||
- `fts.index_phrase`: Jinja for how to index each dataset row for FTS (context: `input`)
|
||||
- `fts.query_phrase`: Jinja for how to build the FTS query (context: operation context)
|
||||
- `embedding.model`: embedding model used for the vector index and for query vectors
|
||||
- `embedding.index_phrase`: Jinja for how to index each dataset row for embedding (context: `input`)
|
||||
- `embedding.query_phrase`: Jinja for how to build the embedding query text (context: operation context)
|
||||
- `index_types`: which indexes to build: `fts`, `embedding`, or `hybrid` (both)
|
||||
- `fts.index_phrase`: Jinja template for indexing each row for full-text search
|
||||
- `fts.query_phrase`: Jinja template for building the FTS query at runtime
|
||||
- `embedding.model`: embedding model for vector index and queries
|
||||
- `embedding.index_phrase`: Jinja template for indexing each row for embeddings
|
||||
- `embedding.query_phrase`: Jinja template for building the embedding query
|
||||
- `query.mode`: `fts` | `embedding` | `hybrid` (defaults to `hybrid` when both indexes exist)
|
||||
- `query.top_k`: number of results to retrieve
|
||||
|
||||
### Basic example
|
||||
|
||||
```yaml
|
||||
datasets:
|
||||
transcripts:
|
||||
|
|
@ -37,158 +41,356 @@ retrievers:
|
|||
type: lancedb
|
||||
dataset: transcripts
|
||||
index_dir: workloads/medical/lance_index
|
||||
build_index: if_missing # if_missing | always | never
|
||||
index_types: ["fts", "embedding"] # or "hybrid"
|
||||
build_index: if_missing # if_missing | always | never
|
||||
index_types: ["fts", "embedding"]
|
||||
fts:
|
||||
# How to index each row (context: input == dataset row)
|
||||
index_phrase: >
|
||||
{{ input.src }}
|
||||
# How to build the query (map/filter/extract context: input; reduce: reduce_key & inputs)
|
||||
query_phrase: >
|
||||
{{ input.get("src","")[:1000] if input else "" }}
|
||||
index_phrase: "{{ input.src }}"
|
||||
query_phrase: "{{ input.src[:1000] }}"
|
||||
embedding:
|
||||
model: openai/text-embedding-3-small
|
||||
# How to index each row for embedding (context: input == dataset row)
|
||||
index_phrase: >
|
||||
{{ input.src }}
|
||||
# How to build the query text to embed (op context)
|
||||
query_phrase: >
|
||||
{{ input.get("src","")[:1000] if input else "" }}
|
||||
index_phrase: "{{ input.src }}"
|
||||
query_phrase: "{{ input.src[:1000] }}"
|
||||
query:
|
||||
mode: hybrid
|
||||
top_k: 8
|
||||
```
|
||||
|
||||
Notes:
|
||||
- Index build is automatic the first time a retriever is used (when `build_index: if_missing`).
|
||||
- `fts.index_phrase` and `embedding.index_phrase` are evaluated with `input` for each dataset record (here `input` is the dataset row).
|
||||
- `fts.query_phrase` and `embedding.query_phrase` are evaluated with the operation context.
|
||||
## Multi-step pipelines with retrieval
|
||||
|
||||
Most pipelines have a single step, but you can define multiple steps where **the output of one step becomes the input (and retriever source) for the next**. This is powerful for patterns like:
|
||||
|
||||
1. Extract structured data from documents
|
||||
2. Build a retrieval index on that extracted data
|
||||
3. Use retrieval to find related items and process them
|
||||
|
||||
### Example: Extract facts, then find conflicts
|
||||
|
||||
```yaml
|
||||
datasets:
|
||||
articles:
|
||||
type: file
|
||||
path: workloads/wiki/articles.json
|
||||
|
||||
default_model: gpt-4o-mini
|
||||
|
||||
# Retriever indexes output of step 1 (extract_facts_step)
|
||||
retrievers:
|
||||
facts_index:
|
||||
type: lancedb
|
||||
dataset: extract_facts_step # References output of a pipeline step!
|
||||
index_dir: workloads/wiki/facts_lance_index
|
||||
build_index: if_missing
|
||||
index_types: ["fts", "embedding"]
|
||||
fts:
|
||||
index_phrase: "{{ input.fact }} from {{ input.title }}"
|
||||
query_phrase: "{{ input.fact }}"
|
||||
embedding:
|
||||
model: openai/text-embedding-3-small
|
||||
index_phrase: "{{ input.fact }}"
|
||||
query_phrase: "{{ input.fact }}"
|
||||
query:
|
||||
mode: hybrid
|
||||
top_k: 5
|
||||
|
||||
operations:
|
||||
- name: extract_facts
|
||||
type: map
|
||||
prompt: |
|
||||
Extract factual claims from this article.
|
||||
Article: {{ input.title }}
|
||||
Text: {{ input.text }}
|
||||
output:
|
||||
schema:
|
||||
facts: list[string]
|
||||
|
||||
- name: unnest_facts
|
||||
type: unnest
|
||||
unnest_key: facts
|
||||
|
||||
- name: find_conflicts
|
||||
type: map
|
||||
retriever: facts_index # Uses the retriever
|
||||
prompt: |
|
||||
Check if this fact conflicts with similar facts from other articles.
|
||||
|
||||
Current fact: {{ input.facts }} (from {{ input.title }})
|
||||
|
||||
Similar facts from other articles:
|
||||
{{ retrieval_context }}
|
||||
|
||||
Return true only if there's a genuine contradiction.
|
||||
output:
|
||||
schema:
|
||||
has_conflict: boolean
|
||||
|
||||
pipeline:
|
||||
steps:
|
||||
# Step 1: Extract and unnest facts
|
||||
- name: extract_facts_step
|
||||
input: articles
|
||||
operations:
|
||||
- extract_facts
|
||||
- unnest_facts
|
||||
|
||||
# Step 2: Use retrieval to find conflicts
|
||||
- name: find_conflicts_step
|
||||
input: extract_facts_step # Input is output of step 1
|
||||
operations:
|
||||
- find_conflicts
|
||||
|
||||
output:
|
||||
type: file
|
||||
path: workloads/wiki/conflicts.json
|
||||
intermediate_dir: workloads/wiki/intermediates
|
||||
```
|
||||
|
||||
In this example:
|
||||
- **Step 1** (`extract_facts_step`) extracts facts from articles
|
||||
- The **retriever** (`facts_index`) indexes the output of step 1
|
||||
- **Step 2** (`find_conflicts_step`) processes each fact, using retrieval to find similar facts from other articles
|
||||
|
||||
## Configuration reference
|
||||
|
||||
Top-level (retrievers.<name>):
|
||||
### Minimal example
|
||||
|
||||
| Parameter | Type | Required | Default | Description |
|
||||
| --- | --- | --- | --- | --- |
|
||||
| type | string | yes | - | Must be `lancedb`. |
|
||||
| dataset | string | yes | - | Name of an existing dataset in the pipeline config. |
|
||||
| index_dir | string | yes | - | Filesystem path for the LanceDB database. Created if missing. |
|
||||
| build_index | enum | no | `if_missing` | `if_missing` \| `always` \| `never`. Controls when to build the index. |
|
||||
| index_types | list[string] \| string | yes | - | Which indexes to build: `fts`, `embedding`, or `"hybrid"` (interpreted as both). |
|
||||
Here's the simplest possible retriever config (FTS only):
|
||||
|
||||
FTS section (retrievers.<name>.fts):
|
||||
```yaml
|
||||
retrievers:
|
||||
my_search: # name can be anything you want
|
||||
type: lancedb
|
||||
dataset: my_dataset # must match a dataset name or pipeline step
|
||||
index_dir: ./my_lance_index
|
||||
index_types: ["fts"]
|
||||
fts:
|
||||
index_phrase: "{{ input.text }}" # what to index from each row
|
||||
query_phrase: "{{ input.query }}" # what to search for at runtime
|
||||
```
|
||||
|
||||
| Parameter | Type | Required | Default | Description |
|
||||
| --- | --- | --- | --- | --- |
|
||||
| index_phrase | jinja string | required if `fts` in index_types | - | How to index each dataset row. Context: `row`. |
|
||||
| query_phrase | jinja string | recommended for FTS/hybrid queries | - | How to construct the FTS query. Context: op context (see below). |
|
||||
### Full example with all options
|
||||
|
||||
Embedding section (retrievers.<name>.embedding):
|
||||
```yaml
|
||||
retrievers:
|
||||
my_search:
|
||||
type: lancedb
|
||||
dataset: my_dataset
|
||||
index_dir: ./my_lance_index
|
||||
build_index: if_missing # optional, default: if_missing
|
||||
index_types: ["fts", "embedding"] # can be ["fts"], ["embedding"], or both
|
||||
fts:
|
||||
index_phrase: "{{ input.text }}"
|
||||
query_phrase: "{{ input.query }}"
|
||||
embedding:
|
||||
model: openai/text-embedding-3-small
|
||||
index_phrase: "{{ input.text }}" # optional, falls back to fts.index_phrase
|
||||
query_phrase: "{{ input.query }}"
|
||||
query: # optional section
|
||||
mode: hybrid # optional, auto-selects based on index_types
|
||||
top_k: 10 # optional, default: 5
|
||||
```
|
||||
|
||||
| Parameter | Type | Required | Default | Description |
|
||||
| --- | --- | --- | --- | --- |
|
||||
| model | string | required if `embedding` in index_types | - | Embedding model used for both index vectors and query vectors. |
|
||||
| index_phrase | jinja string | no | falls back to `fts.index_phrase` if present | How to index each dataset row for embedding. Context: `row`. |
|
||||
| query_phrase | jinja string | recommended for embedding/hybrid queries | - | How to construct the text to embed at query time. Context: op context. |
|
||||
---
|
||||
|
||||
Query section (retrievers.<name>.query):
|
||||
### Required fields
|
||||
|
||||
| Parameter | Type | Required | Default | Description |
|
||||
| --- | --- | --- | --- | --- |
|
||||
| mode | enum | no | auto | `fts` \| `embedding` \| `hybrid`. If omitted: `hybrid` when both indexes exist, else whichever index exists. |
|
||||
| top_k | int | no | 5 | Number of results to return. |
|
||||
| Field | Description |
|
||||
| --- | --- |
|
||||
| `type` | Must be `lancedb` |
|
||||
| `dataset` | Name of a dataset or pipeline step to index |
|
||||
| `index_dir` | Path where LanceDB stores the index (created if missing) |
|
||||
| `index_types` | List of index types: `["fts"]`, `["embedding"]`, or `["fts", "embedding"]` |
|
||||
|
||||
Notes:
|
||||
- Hybrid search uses LanceDB’s built-in reranking (RRF) by default.
|
||||
- Jinja contexts:
|
||||
- Map / Filter / Extract: `{"input": <current item>}`
|
||||
- Reduce: `{"reduce_key": {...}, "inputs": [items]}`
|
||||
- Jinja for indexing uses `{"input": <dataset row>}`
|
||||
- Keep query phrases concise; slice long fields, e.g. `{{ input.src[:1000] }}`.
|
||||
- The injected `retrieval_context` is truncated conservatively (~1000 chars per doc).
|
||||
---
|
||||
|
||||
### Optional fields
|
||||
|
||||
| Field | Default | Description |
|
||||
| --- | --- | --- |
|
||||
| `build_index` | `if_missing` | When to build: `if_missing`, `always`, or `never` |
|
||||
| `query.mode` | auto | `fts`, `embedding`, or `hybrid`. Auto-selects based on what indexes exist |
|
||||
| `query.top_k` | 5 | Number of results to return |
|
||||
|
||||
---
|
||||
|
||||
### The `fts` section
|
||||
|
||||
Required if `"fts"` is in `index_types`. Configures full-text search.
|
||||
|
||||
| Field | Required | Description |
|
||||
| --- | --- | --- |
|
||||
| `index_phrase` | yes | Jinja template: what text to index from each dataset row |
|
||||
| `query_phrase` | yes | Jinja template: what text to search for at query time |
|
||||
|
||||
**Jinja variables available:**
|
||||
|
||||
| Template | Variables | When it runs |
|
||||
| --- | --- | --- |
|
||||
| `index_phrase` | `input` = the dataset row | Once per row when building the index |
|
||||
| `query_phrase` | `input` = current item (map/filter/extract) | At query time for each item processed |
|
||||
| `query_phrase` | `reduce_key`, `inputs` (reduce operations) | At query time for each group |
|
||||
|
||||
**Example - Medical knowledge base:**
|
||||
|
||||
```yaml
|
||||
datasets:
|
||||
drugs:
|
||||
type: file
|
||||
path: drugs.json # [{"name": "Aspirin", "uses": "pain, fever"}, ...]
|
||||
|
||||
patient_notes:
|
||||
type: file
|
||||
path: notes.json # [{"symptoms": "headache and fever"}, ...]
|
||||
|
||||
retrievers:
|
||||
drug_lookup:
|
||||
type: lancedb
|
||||
dataset: drugs # index the drugs dataset
|
||||
index_dir: ./drug_index
|
||||
index_types: ["fts"]
|
||||
fts:
|
||||
index_phrase: "{{ input.name }}: {{ input.uses }}" # index: "Aspirin: pain, fever"
|
||||
query_phrase: "{{ input.symptoms }}" # search with patient symptoms
|
||||
|
||||
operations:
|
||||
- name: find_treatment
|
||||
type: map
|
||||
retriever: drug_lookup # attach the retriever
|
||||
prompt: |
|
||||
Patient symptoms: {{ input.symptoms }}
|
||||
|
||||
Relevant drugs from knowledge base:
|
||||
{{ retrieval_context }}
|
||||
|
||||
Recommend a treatment.
|
||||
output:
|
||||
schema:
|
||||
recommendation: string
|
||||
```
|
||||
|
||||
When processing `{"symptoms": "headache and fever"}`:
|
||||
|
||||
1. `query_phrase` renders to `"headache and fever"`
|
||||
2. FTS searches the index and finds `"Aspirin: pain, fever"` as a match
|
||||
3. `{{ retrieval_context }}` in your prompt contains the matched results
|
||||
|
||||
---
|
||||
|
||||
### The `embedding` section
|
||||
|
||||
Required if `"embedding"` is in `index_types`. Configures vector/semantic search.
|
||||
|
||||
| Field | Required | Description |
|
||||
| --- | --- | --- |
|
||||
| `model` | yes | Embedding model, e.g. `openai/text-embedding-3-small` |
|
||||
| `index_phrase` | no | Jinja template for text to embed. Falls back to `fts.index_phrase` |
|
||||
| `query_phrase` | yes | Jinja template for query text to embed |
|
||||
|
||||
**Jinja variables:** Same as FTS section.
|
||||
|
||||
**Example - Semantic search:**
|
||||
|
||||
```yaml
|
||||
retrievers:
|
||||
semantic_docs:
|
||||
type: lancedb
|
||||
dataset: documentation
|
||||
index_dir: ./docs_index
|
||||
index_types: ["embedding"]
|
||||
embedding:
|
||||
model: openai/text-embedding-3-small
|
||||
index_phrase: "{{ input.content }}"
|
||||
query_phrase: "{{ input.question }}"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### The `query` section (optional)
|
||||
|
||||
Controls search behavior. You can omit this entire section.
|
||||
|
||||
| Field | Default | Description |
|
||||
| --- | --- | --- |
|
||||
| `mode` | auto | `fts`, `embedding`, or `hybrid`. Auto-selects `hybrid` if both indexes exist |
|
||||
| `top_k` | 5 | Number of results to retrieve |
|
||||
|
||||
**Example - Override defaults:**
|
||||
|
||||
```yaml
|
||||
retrievers:
|
||||
my_search:
|
||||
# ... other config ...
|
||||
query:
|
||||
mode: fts # force FTS even if embedding index exists
|
||||
top_k: 20 # return more results
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Using a retriever in operations
|
||||
|
||||
Attach the retriever to any LLM-powered op with `retriever: <name>`. Include `{{ retrieval_context }}` in your prompt or let DocETL prepend it automatically.
|
||||
|
||||
### Operation Parameters
|
||||
|
||||
When using a retriever with an operation, the following additional parameters are available:
|
||||
Attach a retriever to any LLM operation (map, filter, reduce, extract) with `retriever: <retriever_name>`. The retrieved results are available as `{{ retrieval_context }}` in your prompt.
|
||||
|
||||
| Parameter | Type | Default | Description |
|
||||
| --- | --- | --- | --- |
|
||||
| retriever | string | - | Name of the retriever to use (must be defined in the `retrievers` section). |
|
||||
| save_retriever_output | bool | false | If true, saves the retrieved context to `_<operation_name>_retrieved_context` in the output. Useful for debugging and verifying retrieval quality. |
|
||||
| retriever | string | - | Name of the retriever to use (must match a key in `retrievers`). |
|
||||
| save_retriever_output | bool | false | If true, saves retrieved context to `_<operation_name>_retrieved_context` in output. |
|
||||
|
||||
### Map example
|
||||
|
||||
### Map
|
||||
```yaml
|
||||
operations:
|
||||
- name: tag_visit
|
||||
type: map
|
||||
retriever: medical_r
|
||||
save_retriever_output: true # Optional: save retrieved context to output
|
||||
output:
|
||||
schema:
|
||||
tag: string
|
||||
confidence: float
|
||||
prompt: |
|
||||
Classify the medical visit. Use the extra context if helpful:
|
||||
{{ retrieval_context }}
|
||||
Transcript:
|
||||
{{ input.src }}
|
||||
- name: tag_visit
|
||||
type: map
|
||||
retriever: medical_r
|
||||
save_retriever_output: true
|
||||
output:
|
||||
schema:
|
||||
tag: string
|
||||
prompt: |
|
||||
Classify this medical visit. Related context:
|
||||
{{ retrieval_context }}
|
||||
|
||||
Transcript: {{ input.src }}
|
||||
```
|
||||
|
||||
When `save_retriever_output: true`, each output document will include a `_tag_visit_retrieved_context` field containing the exact context that was retrieved and used for that document.
|
||||
### Filter example
|
||||
|
||||
### Extract
|
||||
```yaml
|
||||
- name: extract_side_effects
|
||||
type: extract
|
||||
retriever: medical_r
|
||||
document_keys: ["src"]
|
||||
prompt: "Extract side effects mentioned in the text."
|
||||
- name: filter_relevant
|
||||
type: filter
|
||||
retriever: medical_r
|
||||
prompt: |
|
||||
Is this transcript relevant to medication counseling?
|
||||
Context: {{ retrieval_context }}
|
||||
Transcript: {{ input.src }}
|
||||
output:
|
||||
schema:
|
||||
is_relevant: boolean
|
||||
```
|
||||
|
||||
### Filter
|
||||
### Reduce example
|
||||
|
||||
When using reduce, the retrieval context is computed per group.
|
||||
|
||||
```yaml
|
||||
- name: filter_relevant
|
||||
type: filter
|
||||
retriever: medical_r
|
||||
prompt: "Is this transcript relevant to medication counseling? Return is_relevant: boolean."
|
||||
output:
|
||||
schema:
|
||||
is_relevant: bool
|
||||
_short_explanation: string
|
||||
- name: summarize_by_medication
|
||||
type: reduce
|
||||
retriever: medical_r
|
||||
reduce_key: medication
|
||||
output:
|
||||
schema:
|
||||
summary: string
|
||||
prompt: |
|
||||
Summarize key points for medication '{{ reduce_key.medication }}'.
|
||||
Related context: {{ retrieval_context }}
|
||||
|
||||
Inputs:
|
||||
{% for item in inputs %}
|
||||
- {{ item.src }}
|
||||
{% endfor %}
|
||||
```
|
||||
|
||||
### Reduce
|
||||
When using reduce, the retrieval context is computed per group. The Jinja context provides both `reduce_key` and `inputs`.
|
||||
```yaml
|
||||
- name: summarize_by_medication
|
||||
type: reduce
|
||||
retriever: medical_r
|
||||
reduce_key: "medication"
|
||||
output:
|
||||
schema:
|
||||
summary: string
|
||||
prompt: |
|
||||
Summarize key points for medication '{{ reduce_key.medication }}'.
|
||||
Use the extra context if helpful:
|
||||
{{ retrieval_context }}
|
||||
Inputs:
|
||||
{{ inputs }}
|
||||
```
|
||||
|
||||
## Jinja template contexts
|
||||
- Map / Filter / Extract: `{"input": current_item}`
|
||||
- Reduce: `{"reduce_key": {...}, "inputs": [items]}`
|
||||
|
||||
## Token budget and truncation
|
||||
- DocETL uses a conservative default to limit the size of `retrieval_context` by truncating each retrieved text to ~1000 characters.
|
||||
|
||||
## Troubleshooting
|
||||
- No results: the retriever injects “No extra context available.” and continues.
|
||||
- Index issues: set `build_index: always` to rebuild; ensure `index_dir` exists and is writable.
|
||||
- Embeddings: DocETL uses its embedding router and caches results where possible.
|
||||
|
||||
|
||||
- **No results**: the retriever injects "No extra context available." and continues.
|
||||
- **Index issues**: set `build_index: always` to rebuild; ensure `index_dir` exists and is writable.
|
||||
- **Token limits**: `retrieval_context` is truncated to ~1000 chars per retrieved doc.
|
||||
|
|
|
|||
|
|
@ -14,6 +14,7 @@ nav:
|
|||
- Getting Started:
|
||||
- Overview: index.md
|
||||
- Installation: installation.md
|
||||
- Quick Start (Claude Code): quickstart-claude-code.md
|
||||
- Quick Start Tutorial: tutorial.md
|
||||
- Quick Start Tutorial (Python API): tutorial-pythonapi.md
|
||||
- Best Practices: best-practices.md
|
||||
|
|
|
|||
|
|
@ -1,6 +1,6 @@
|
|||
[project]
|
||||
name = "docetl"
|
||||
version = "0.2.5"
|
||||
version = "0.2.6"
|
||||
description = "ETL with LLM operations."
|
||||
readme = "README.md"
|
||||
requires-python = ">=3.10"
|
||||
|
|
@ -122,7 +122,7 @@ build-backend = "hatchling.build"
|
|||
|
||||
[tool.hatch.build]
|
||||
packages = ["docetl"]
|
||||
include = ["docetl/**", "server/**", "README.md", "LICENSE"]
|
||||
include = ["docetl/**", "server/**", "README.md", "LICENSE", ".claude/skills/**"]
|
||||
exclude = ["website/**/*"]
|
||||
|
||||
[tool.pytest.ini_options]
|
||||
|
|
|
|||
Loading…
Reference in New Issue