* feat: Add runtime blocking threshold optimization Co-authored-by: ss.shankar505 <ss.shankar505@gmail.com> * Checkpoint before follow-up message Co-authored-by: ss.shankar505 <ss.shankar505@gmail.com> * Refactor: Simplify target_recall retrieval in Equijoin and Resolve Co-authored-by: ss.shankar505 <ss.shankar505@gmail.com> * Refactor: Improve blocking documentation and add auto-blocking Co-authored-by: ss.shankar505 <ss.shankar505@gmail.com> * allow resolve and equijoin to figure out blocking thresholds on the fly. --------- Co-authored-by: Cursor Agent <cursoragent@cursor.com> |
||
|---|---|---|
| .claude/skills/docetl | ||
| .github/workflows | ||
| docetl | ||
| docs | ||
| example_data | ||
| experiments | ||
| server | ||
| tests | ||
| website | ||
| .env.example | ||
| .env.sample | ||
| .gitignore | ||
| .pre-commit-config.yaml | ||
| Dockerfile | ||
| LICENSE | ||
| Makefile | ||
| README.md | ||
| docker-compose.yml | ||
| mkdocs.yml | ||
| pyproject.toml | ||
| uv.lock | ||
README.md
📜 DocETL: Powering Complex Document Processing Pipelines
DocETL is a tool for creating and executing data processing pipelines, especially suited for complex document processing tasks. It offers:
- An interactive UI playground for iterative prompt engineering and pipeline development
- A Python package for running production pipelines from the command line or Python code
💡 Need Help Writing Your Pipeline?
You can use Claude Code (recommended) to help you write your pipeline—see the quickstart: https://ucbepic.github.io/docetl/quickstart-claude-code/
If you’d rather use ChatGPT or the Claude app, see docetl.org/llms.txt for a big prompt you can copy/paste before describing your task.
🌟 Community Projects
📚 Educational Resources
🚀 Getting Started
There are two main ways to use DocETL:
1. 🎮 DocWrangler, the Interactive UI Playground (Recommended for Development)
DocWrangler helps you iteratively develop your pipeline:
- Experiment with different prompts and see results in real-time
- Build your pipeline step by step
- Export your finalized pipeline configuration for production use
DocWrangler is hosted at docetl.org/playground. But to run the playground locally, you can either:
- Use Docker (recommended for quick start):
make docker - Set up the development environment manually
See the Playground Setup Guide for detailed instructions.
2. 📦 Python Package (For Production Use)
If you want to use DocETL as a Python package:
Prerequisites
- Python 3.10 or later
- OpenAI API key
pip install docetl
Create a .env file in your project directory:
OPENAI_API_KEY=your_api_key_here # Required for LLM operations (or the key for the LLM of your choice)
⚠️ Important: Two Different .env Files
- Root
.env: Used by the backend Python server that executes DocETL pipelineswebsite/.env.local: Used by the frontend TypeScript code in DocWrangler (UI features like improve prompt and chatbot)
To see examples of how to use DocETL, check out the tutorial.
2. 🎮 DocWrangler Setup
To run DocWrangler locally, you have two options:
Option A: Using Docker (Recommended for Quick Start)
The easiest way to get the DocWrangler playground running:
- Create the required environment files:
Create .env in the root directory (for the backend Python server that executes pipelines):
OPENAI_API_KEY=your_api_key_here # Used by DocETL pipeline execution engine
# BACKEND configuration
BACKEND_ALLOW_ORIGINS=http://localhost:3000,http://127.0.0.1:3000
BACKEND_HOST=localhost
BACKEND_PORT=8000
BACKEND_RELOAD=True
# FRONTEND configuration
FRONTEND_HOST=0.0.0.0
FRONTEND_PORT=3000
# Host port mapping for docker-compose (if not set, defaults are used in docker-compose.yml)
FRONTEND_DOCKER_COMPOSE_PORT=3031
BACKEND_DOCKER_COMPOSE_PORT=8081
# Supported text file encodings
TEXT_FILE_ENCODINGS=utf-8,latin1,cp1252,iso-8859-1
Create .env.local in the website directory (for DocWrangler UI features like improve prompt and chatbot):
OPENAI_API_KEY=sk-xxx # Used by TypeScript features: improve prompt, chatbot, etc.
OPENAI_API_BASE=https://api.openai.com/v1
MODEL_NAME=gpt-4o-mini # Model used by the UI assistant
NEXT_PUBLIC_BACKEND_HOST=localhost
NEXT_PUBLIC_BACKEND_PORT=8000
NEXT_PUBLIC_HOSTED_DOCWRANGLER=false
- Run Docker:
make docker
This will:
- Create a Docker volume for persistent data
- Build the DocETL image
- Run the container with the UI accessible at http://localhost:3000
To clean up Docker resources (note that this will delete the Docker volume):
make docker-clean
AWS Bedrock
This framework supports integration with AWS Bedrock. To enable:
- Configure AWS credentials:
aws configure
- Test your AWS credentials:
make test-aws
- Run with AWS support:
AWS_PROFILE=your-profile AWS_REGION=your-region make docker
Or using Docker Compose:
AWS_PROFILE=your-profile AWS_REGION=your-region docker compose --profile aws up
Environment variables:
AWS_PROFILE: Your AWS CLI profile (default: 'default')AWS_REGION: AWS region (default: 'us-west-2')
Bedrock models are pefixed with bedrock. See liteLLM docs for more details.
Option B: Manual Setup (Development)
For development or if you prefer not to use Docker:
- Clone the repository:
git clone https://github.com/ucbepic/docetl.git
cd docetl
- Set up environment variables in
.envin the root/top-level directory (for the backend Python server):
OPENAI_API_KEY=your_api_key_here # Used by DocETL pipeline execution engine
# BACKEND configuration
BACKEND_ALLOW_ORIGINS=http://localhost:3000,http://127.0.0.1:3000
BACKEND_HOST=localhost
BACKEND_PORT=8000
BACKEND_RELOAD=True
# FRONTEND configuration
FRONTEND_HOST=0.0.0.0
FRONTEND_PORT=3000
# Host port mapping for docker-compose (if not set, defaults are used in docker-compose.yml)
FRONTEND_DOCKER_COMPOSE_PORT=3031
BACKEND_DOCKER_COMPOSE_PORT=8081
# Supported text file encodings
TEXT_FILE_ENCODINGS=utf-8,latin1,cp1252,iso-8859-1
And create an .env.local file in the website directory (for DocWrangler UI features):
OPENAI_API_KEY=sk-xxx # Used by TypeScript features: improve prompt, chatbot, etc.
OPENAI_API_BASE=https://api.openai.com/v1
MODEL_NAME=gpt-4o-mini # Model used by the UI assistant
NEXT_PUBLIC_BACKEND_HOST=localhost
NEXT_PUBLIC_BACKEND_PORT=8000
NEXT_PUBLIC_HOSTED_DOCWRANGLER=false
- Install dependencies:
make install # Install Python deps with uv and set up pre-commit
make install-ui # Install UI dependencies
If you prefer using uv directly instead of Make:
curl -LsSf https://astral.sh/uv/install.sh | sh
uv sync --all-groups --all-extras
- Start the development server:
make run-ui-dev
- Visit http://localhost:3000/playground to access the interactive UI.
🛠️ Development Setup
If you're planning to contribute or modify DocETL, you can verify your setup by running the test suite:
make tests-basic # Runs basic test suite (costs < $0.01 with OpenAI)
For detailed documentation and tutorials, visit our documentation.

