Go to file

Shreya Shankar bcac6872f5 Embedding blocking threshold optimization (#473 ) * feat: Add runtime blocking threshold optimization Co-authored-by: ss.shankar505 <ss.shankar505@gmail.com> * Checkpoint before follow-up message Co-authored-by: ss.shankar505 <ss.shankar505@gmail.com> * Refactor: Simplify target_recall retrieval in Equijoin and Resolve Co-authored-by: ss.shankar505 <ss.shankar505@gmail.com> * Refactor: Improve blocking documentation and add auto-blocking Co-authored-by: ss.shankar505 <ss.shankar505@gmail.com> * allow resolve and equijoin to figure out blocking thresholds on the fly. --------- Co-authored-by: Cursor Agent <cursoragent@cursor.com>		2025-12-29 19:47:21 -06:00
.claude/skills/docetl	update: claude code skill when doing hn demo	2025-12-28 15:10:52 -06:00
.github/workflows	Fix pydantic-core build error with python 3.14 (#441 )	2025-11-06 20:57:06 -08:00
docetl	Embedding blocking threshold optimization (#473 )	2025-12-29 19:47:21 -06:00
docs	Embedding blocking threshold optimization (#473 )	2025-12-29 19:47:21 -06:00
example_data	fix: make sure pipeline state works	2024-10-27 20:36:26 -07:00
experiments	Fast Decomposition for Map Operations in DocWrangler (#472 )	2025-12-29 18:22:02 -06:00
server	Fast Decomposition for Map Operations in DocWrangler (#472 )	2025-12-29 18:22:02 -06:00
tests	optimizer: add directives for resolve operator (#470 )	2025-12-29 13:57:51 -06:00
website	Fast Decomposition for Map Operations in DocWrangler (#472 )	2025-12-29 18:22:02 -06:00
.env.example	Clarify environment variable usage for frontend and backend (#418 )	2025-08-24 16:21:04 -07:00
.env.sample	🚀 feat: add docker-compose support (#301 )	2025-02-04 07:57:54 -08:00
.gitignore	Add MOAR optimizer to docetl (#464 )	2025-11-28 13:47:42 -06:00
.pre-commit-config.yaml	refactor: DSLRunner now uses a pull-based execution model (#273 )	2025-01-10 12:45:04 -08:00
Dockerfile	Added libgl1 and libglib2.0-0 library install to Dockerfile as required by OpenCV (#451 )	2025-11-10 09:02:19 -08:00
LICENSE	Create MIT LICENSE	2024-09-17 12:15:13 -07:00
Makefile	feat: add topk implementation (#410 )	2025-08-13 13:27:54 -07:00
README.md	Update README with Claude Code integration (#471 )	2025-12-28 12:49:14 -06:00
docker-compose.yml	feat: add AWS support (#331 )	2025-04-10 11:36:21 -07:00
mkdocs.yml	feat: add claude-code skill (#469 )	2025-12-27 22:23:30 -06:00
pyproject.toml	feat: add claude-code skill (#469 )	2025-12-27 22:23:30 -06:00
uv.lock	feat: add claude-code skill (#469 )	2025-12-27 22:23:30 -06:00

README.md

📜 DocETL: Powering Complex Document Processing Pipelines

DocETL is a tool for creating and executing data processing pipelines, especially suited for complex document processing tasks. It offers:

An interactive UI playground for iterative prompt engineering and pipeline development
A Python package for running production pipelines from the command line or Python code

💡 Need Help Writing Your Pipeline?
You can use Claude Code (recommended) to help you write your pipeline—see the quickstart: https://ucbepic.github.io/docetl/quickstart-claude-code/
If you’d rather use ChatGPT or the Claude app, see docetl.org/llms.txt for a big prompt you can copy/paste before describing your task.

🌟 Community Projects

📚 Educational Resources

🚀 Getting Started

There are two main ways to use DocETL:

1. 🎮 DocWrangler, the Interactive UI Playground (Recommended for Development)

DocWrangler helps you iteratively develop your pipeline:

Experiment with different prompts and see results in real-time
Build your pipeline step by step
Export your finalized pipeline configuration for production use

DocWrangler is hosted at docetl.org/playground. But to run the playground locally, you can either:

Use Docker (recommended for quick start): make docker
Set up the development environment manually

See the Playground Setup Guide for detailed instructions.

2. 📦 Python Package (For Production Use)

If you want to use DocETL as a Python package:

Prerequisites

Python 3.10 or later
OpenAI API key

pip install docetl

Create a .env file in your project directory:

OPENAI_API_KEY=your_api_key_here  # Required for LLM operations (or the key for the LLM of your choice)

⚠️ Important: Two Different .env Files

Root .env: Used by the backend Python server that executes DocETL pipelines

website/.env.local: Used by the frontend TypeScript code in DocWrangler (UI features like improve prompt and chatbot)

To see examples of how to use DocETL, check out the tutorial.

2. 🎮 DocWrangler Setup

To run DocWrangler locally, you have two options:

Option A: Using Docker (Recommended for Quick Start)

The easiest way to get the DocWrangler playground running:

Create the required environment files:

Create .env in the root directory (for the backend Python server that executes pipelines):

OPENAI_API_KEY=your_api_key_here  # Used by DocETL pipeline execution engine
# BACKEND configuration
BACKEND_ALLOW_ORIGINS=http://localhost:3000,http://127.0.0.1:3000
BACKEND_HOST=localhost
BACKEND_PORT=8000
BACKEND_RELOAD=True

# FRONTEND configuration
FRONTEND_HOST=0.0.0.0
FRONTEND_PORT=3000

# Host port mapping for docker-compose (if not set, defaults are used in docker-compose.yml)
FRONTEND_DOCKER_COMPOSE_PORT=3031
BACKEND_DOCKER_COMPOSE_PORT=8081

# Supported text file encodings
TEXT_FILE_ENCODINGS=utf-8,latin1,cp1252,iso-8859-1

Create .env.local in the website directory (for DocWrangler UI features like improve prompt and chatbot):

OPENAI_API_KEY=sk-xxx  # Used by TypeScript features: improve prompt, chatbot, etc.
OPENAI_API_BASE=https://api.openai.com/v1
MODEL_NAME=gpt-4o-mini  # Model used by the UI assistant

NEXT_PUBLIC_BACKEND_HOST=localhost
NEXT_PUBLIC_BACKEND_PORT=8000
NEXT_PUBLIC_HOSTED_DOCWRANGLER=false

Run Docker:

make docker

This will:

Create a Docker volume for persistent data
Build the DocETL image
Run the container with the UI accessible at http://localhost:3000

To clean up Docker resources (note that this will delete the Docker volume):

make docker-clean

AWS Bedrock

This framework supports integration with AWS Bedrock. To enable:

Configure AWS credentials:

aws configure

Test your AWS credentials:

make test-aws

Run with AWS support:

AWS_PROFILE=your-profile AWS_REGION=your-region make docker

Or using Docker Compose:

AWS_PROFILE=your-profile AWS_REGION=your-region docker compose --profile aws up

Environment variables:

AWS_PROFILE: Your AWS CLI profile (default: 'default')
AWS_REGION: AWS region (default: 'us-west-2')

Bedrock models are pefixed with bedrock. See liteLLM docs for more details.

Option B: Manual Setup (Development)

For development or if you prefer not to use Docker:

Clone the repository:

git clone https://github.com/ucbepic/docetl.git
cd docetl

Set up environment variables in .env in the root/top-level directory (for the backend Python server):

OPENAI_API_KEY=your_api_key_here  # Used by DocETL pipeline execution engine
# BACKEND configuration
BACKEND_ALLOW_ORIGINS=http://localhost:3000,http://127.0.0.1:3000
BACKEND_HOST=localhost
BACKEND_PORT=8000
BACKEND_RELOAD=True

# FRONTEND configuration
FRONTEND_HOST=0.0.0.0
FRONTEND_PORT=3000

# Host port mapping for docker-compose (if not set, defaults are used in docker-compose.yml)
FRONTEND_DOCKER_COMPOSE_PORT=3031
BACKEND_DOCKER_COMPOSE_PORT=8081

# Supported text file encodings
TEXT_FILE_ENCODINGS=utf-8,latin1,cp1252,iso-8859-1

And create an .env.local file in the website directory (for DocWrangler UI features):

OPENAI_API_KEY=sk-xxx  # Used by TypeScript features: improve prompt, chatbot, etc.
OPENAI_API_BASE=https://api.openai.com/v1
MODEL_NAME=gpt-4o-mini  # Model used by the UI assistant

NEXT_PUBLIC_BACKEND_HOST=localhost
NEXT_PUBLIC_BACKEND_PORT=8000
NEXT_PUBLIC_HOSTED_DOCWRANGLER=false

Install dependencies:

make install      # Install Python deps with uv and set up pre-commit
make install-ui   # Install UI dependencies

If you prefer using uv directly instead of Make:

curl -LsSf https://astral.sh/uv/install.sh | sh
uv sync --all-groups --all-extras

Start the development server:

make run-ui-dev

Visit http://localhost:3000/playground to access the interactive UI.

🛠️ Development Setup

If you're planning to contribute or modify DocETL, you can verify your setup by running the test suite:

make tests-basic  # Runs basic test suite (costs < $0.01 with OpenAI)

For detailed documentation and tutorials, visit our documentation.

README.md Unescape Escape