docetl/docs/concepts
Lindsey Wei 81d110404d
Add MOAR optimizer to docetl (#464)
* feat: adding conditional gleaning (#375)

* fix: improve caching and don't raise error for bad gather configs

* fix: improve caching and don't raise error for bad gather configs

* feat: adding conditional gleaning

* chore: bump up fastapi and python multipart (#376)

* merge

* chore: bump up fastapi and python multipart

* chore: bump up fastapi and python multipart

* pipeline for chaining

* chaining + gleaning(map)

* Clean up and reorganize pytest tests (#377)

* Replace api_wrapper with runner in test fixtures and configurations

Co-authored-by: ss.shankar505 <ss.shankar505@gmail.com>

* Refactor test fixtures and reorganize configuration in test files

Co-authored-by: ss.shankar505 <ss.shankar505@gmail.com>

---------

Co-authored-by: Cursor Agent <cursoragent@cursor.com>

* Refactor api.py for structured output (#378)

* Refactor API wrapper with modular design for LLM calls and output handling

Co-authored-by: ss.shankar505 <ss.shankar505@gmail.com>

* Refactor APIWrapper: Simplify LLM call logic and improve modularity

Co-authored-by: ss.shankar505 <ss.shankar505@gmail.com>

* Refactor output mode handling in APIWrapper with flexible configuration

Co-authored-by: ss.shankar505 <ss.shankar505@gmail.com>

* Add comprehensive tests for DocETL output modes with synthetic data

Co-authored-by: ss.shankar505 <ss.shankar505@gmail.com>

* Refactor output modes tests with improved pytest structure and DSLRunner

Co-authored-by: ss.shankar505 <ss.shankar505@gmail.com>

* Fix runtime errors

* Add nested JSON parsing for string values in API response

Co-authored-by: ss.shankar505 <ss.shankar505@gmail.com>

* Handle nested JSON parsing by extracting matching key values

Co-authored-by: ss.shankar505 <ss.shankar505@gmail.com>

* Simplify JSON parsing logic in API utility functions

Co-authored-by: ss.shankar505 <ss.shankar505@gmail.com>

* Add to tests

* Add documentation for DocETL output modes and configuration options

Co-authored-by: ss.shankar505 <ss.shankar505@gmail.com>

* Add docs

---------

Co-authored-by: Cursor Agent <cursoragent@cursor.com>

* Add operators to pandas API (#379)

* baseline w/ 3 rewrites

* added sample

* 3 choices

* feat: add global bypass cache (#383)

* baseline experiments

* baseline experiments

* update directives

* MCTS random expand

* MCTS v1

* MCTS V1 w/ instantiation check on chaining

* MCTS V2

* remove key

* delete key

* update evaluation code

* >1 instantiation for chaining & true f1 score

* true f1 score as acc

* hypervolume as value algo

* Add directives folder from PR #1

* tests for directives

* updates on instantiate_schemas.py

* adding experiment folder from P1

* added init for mcts

* partially merged PR1

* fix instantiate schema

* added chunking directives

* passed directive tests

* fix bug in doc_chunking instantiate

* fix doc_chunking bug

* update prompt for chunk_header_summary

* update gitignore

* fix mcts bugs when rebasing

* add chunk sampling directive

* print AUC at end

* reward update
merged

* fixed bug in doc_compression and avoid using chunk_header_summary twice

* fixed chunk sampling

* adding blackvault dataset and refactoring hard coded cuad statistics

* scaled cost, progressive widening

* feat: add head tail directive

* feat: add head tail directive

* feat: add head tail directive

* merge chunk sampling + doc chunking into one directive. also modify sample operator to stratify by multiple keys

* add game reviews experiemnt and fix doc chunking

* trim down some fat in game reviews eval logic

* syntax check for stratify key

* added split_key check and fixed bug in take_head_tail

* added **kwargs

* remove reduce in op fusion

* adding medec workload

* document_key check for doc_compression & load schema data

* add sustainability expt

* add reduce chaining

* update expand retry & op fusion

* remove hard coding of blackvault

* remove hard coding of blackvault

* adding agent for instantiation (that is allowed to read docs), and a clarify instructions directive

* adding agent for instantiation (that is allowed to read docs), and a clarify instructions directive

* add swap with code directive

* optimizing cost in first 5 iterations

* added mcts log and remove plot

* add memory along tree path and action reward in log

* add biodex and have train and test splits

* add biodex and have train and test splits

* added map reduce fusion, small changes to mcts

* added reference point for calculating hypervolume in evaluation

* fix bug in calculating hypervolume

* integrate with modal

* add run_all.py script

* add run_all.py script

* fix bug in baseline applying ops

* update mcts chat history

* update mcts chat history

* control message length to fit context window

* fix agent baseline loop

* remove print statement

* update run baseline

* adding extremely simple agent baseline that just gives us entirely new pipelines

* adding extremely simple agent baseline that just gives us entirely new pipelines

* update readme with simple / naive agent baseline

* update

* change modal image

* edit gitignore to operators doc

* adding retrieval based rewrite directive

* debug simple baseline

* fix baseline agent and top k chunking directive

* fix baseline agent and top k chunking directive

* fix baseline agent and top k chunking directive

* fix chunking topk directive and add hierarchical reduce

* merge the original plan execution of three methods

* add plot and hypervolume calculation

* feat: add cascade filtering directive

* feat: add arbitrary rewrite directive

* remove unnecessary field

* fixed fall back method for baseline & mcts

* adding some validation for arbitrary rewrite

* update sample operation

* change model choice

* updates on plot

* fix gpt-5 temperature in query engine

* update model choice in base

* use rp@5 for biodex eval

* separate cost & acc optimize change model directive

* add abacus to plot

* added memo table for each node and provide it to the agent for directive selection

* add facility dataset (we dont need to use for paper) and lotus baselines

* edit prompts in change model directive

* change models

* added rules on compositions of directives. clean up mcts using utils

* add multiple instantiation for selected directives

* fix gemini errors

* adding PZ baselines (still in progress

* add all PZ baselines to main

* add all PZ baselines to main

* fix bugs about multiple instantiations

* add utility to plot the test set

* fix linter errors

* fix PZ script for medec dataset

* add original pipeline on test set to plot

* change color of mcts

* test all model on the input query before search

* add search cost calculation

* fixed bug in json file path

* add concurrency control; start searching from pareto model plans

* set reward to be vertical distance to the step frontier

* newest mcts version

* small changes during exp

* change to run lotus, add bootstrap

* Add needs_code_filter variable for operator fusion

* add validation step; randomly select when values have ties

* eval function change

* clean up run test frontier

* Delete bootstrap_evaluation.py

* Delete BioDEX_evaluate.py

* Delete CUAD_evaluate.py

* Delete CUAD_sample50.py

* Delete bootstrapping.py

* Delete evaluate_blackvault.py

* Delete exp_graph.py

* Delete exp_graph_max.py

* Delete re_evaluate_zero_scores.py

* Delete split_json_data.py

* Delete test_evaluation.py

* Delete test_gemini_models.py

* Delete test_operations_modal.py

* Delete test_validate_frontier.py

* Delete validate_pareto_frontier.py

* Delete docetl/reasoning_optimizer/Untitled-1.py

* Delete docetl/BioDEX_test.py

* Delete docetl/mcts/execute_res_HV directory

* Delete docetl/mcts/graph_baseline.py

* Delete docetl/mcts/graph.py

* remove acc comparator

* delete graph

* delete irrelevant files

* mcts clean up

* update naming

* clean up simple agent

* update readme

* add user specified eval function

* Delete experiments/reasoning/run_tests.py

* Delete experiments/reasoning/run_test_frontier.py

* Delete experiments/reasoning/run_baseline.py

* Delete experiments/reasoning/run_all.py

* Delete experiments/reasoning/plot_result.py

* Delete experiments/reasoning/plot_matrix.py

* Delete experiments/reasoning/combine_biodex_test_results.py

* Delete experiments/reasoning/create_biodex_test_summary.py

* Delete experiments/reasoning/generate_biodex_summary.py

* Delete experiments/reasoning/TEST_FRONTIER_README.md

* Delete experiments/reasoning/utils.py

* Delete experiments/reasoning/README.md

* Delete experiments/reasoning/othersystems directory

* Delete experiments/reasoning/outputs/blackvault_baseline

* Delete experiments/reasoning/outputs/blackvault_mcts

* Delete experiments/reasoning/outputs/cuad_lotus_evaluation.json

* Delete compute_words_per_document.py

* Delete docetl/moar/acc_comparator.py

* Delete docetl/moar/acc_comparator.py.backup

* Delete docetl/moar/instantiation_check.py

* Delete docetl/reasoning_optimizer/build_optimization.py

* Delete docetl/reasoning_optimizer/generate_rewrite_plan.py

* remove auc

* simplify util

* Delete experiments/reasoning/data directory

* requirements

* eval function cleaning

* update readme

* update readme

* update readme

* remove facility

* update readme

* Update api to be consistent with main

* remove unused files

* clean up relative imports and model choices

* change all printing to use the rich console

* add documentation for moar and CLI

* move evaluation code from experiments dir to moar dir

* fix documentation

---------

Co-authored-by: Shreya Shankar <ss.shankar505@gmail.com>
Co-authored-by: linxiwei <lindseywei@visitor-10-57-110-173.wifi.berkeley.edu>
Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: linxiwei <lindseywei@visitor-10-57-109-39.wifi.berkeley.edu>
Co-authored-by: linxiwei <lindseywei@lindseys-mbp-7.lan>
Co-authored-by: linxiwei <lindseywei@visitor-10-57-111-186.wifi.berkeley.edu>
Co-authored-by: linxiwei <lindseywei@wifi-10-41-110-112.wifi.berkeley.edu>
Co-authored-by: linxiwei <lindseywei@wifi-10-44-111-72.wifi.berkeley.edu>
Co-authored-by: Lindsey Wei <152750390+LindseyyyW@users.noreply.github.com>
Co-authored-by: linxiwei <lindseywei@wifi-10-44-110-253.wifi.berkeley.edu>
Co-authored-by: linxiwei <lindseywei@wifi-10-41-110-92.wifi.berkeley.edu>
Co-authored-by: linxiwei <lindseywei@wifi-10-41-108-154.wifi.berkeley.edu>
Co-authored-by: linxiwei <lindseywei@wifi-10-41-109-198.wifi.berkeley.edu>
Co-authored-by: linxiwei <lindseywei@Lindseys-MacBook-Pro-7.local>
Co-authored-by: linxiwei <lindseywei@192.168.0.110>
2025-11-28 13:47:42 -06:00
..
operators.md
optimization.md Add MOAR optimizer to docetl (#464) 2025-11-28 13:47:42 -06:00
pipelines.md feat: add global bypass cache (#383) 2025-07-08 16:53:40 -07:00
schemas.md Refactor api.py for structured output (#378) 2025-07-04 10:58:15 -07:00