DocETL
Powering complex document processing pipelines
DocETL System Description and LLM Instructions (Short)
Note: use docetl.org/llms-full.txt for the full system description and LLM instructions.
DocETL is a system for creating and executing LLM-powered data processing pipelines, designed for complex document processing tasks. It provides a low-code, declarative YAML interface to define complex data operations on unstructured datasets.
DocETL is built and maintained by the EPIC lab at UC Berkeley. Learn more at https://www.docetl.org.
We have an integrated development environment for building and testing pipelines, at https://www.docetl.org/playground. Our IDE is called DocWrangler.
Docs
- LLM Instructions (Full)
- Website
- DocWrangler Playground
- Main Documentation
- GitHub Repository
- Agentic Optimization Research Paper
- Discord Community
Core Operators
- Map Operation
- Reduce Operation
- Resolve Operation
- Parallel Map Operation
- Filter Operation
- Equijoin Operation
Auxiliary Operators
LLM Providers
Optional
Datasets and Data Loading
DocETL supports both standard and dynamic data loading. Input data must be in one of two formats:
-
JSON Format:
- A list of objects/dictionaries
- Each object represents one document/item to process
- Each field in the object is accessible in operations via
input.field_name
Example JSON:
[ { "text": "First document content", "date": "2024-03-20", "metadata": {"source": "email"} }, { "text": "Second document content", "date": "2024-03-21", "metadata": {"source": "chat"} } ] -
CSV Format:
- First row contains column headers
- Each subsequent row represents one document/item
- Column names become field names, accessible via
input.column_name
Example CSV:
text,date,source "First document content","2024-03-20","email" "Second document content","2024-03-21","chat"
Configure datasets in your pipeline:
datasets:
documents:
type: file
path: "data.json" # or "data.csv"
!!! note - JSON files must contain a list of objects at the root level - CSV files must have a header row with column names - All documents in a dataset should have consistent fields - For other formats, use parsing tools to convert to the required format
Schema Design and Validation
!!! warning "Model Capabilities and Schema Complexity" When using models other than GPT (OpenAI), Claude (Anthropic), or Gemini (Google): - Keep output schemas extremely simple - Prefer single string outputs or simple key-value pairs - Avoid complex types (lists, nested objects) - Break complex operations into multiple simpler steps
- Basic Types:
Type Aliases Description stringstr,text,varcharFor text data integerintFor whole numbers numberfloat,decimalFor decimal numbers booleanboolFor true/false values enum- Only when prompt explicitly lists all possible values list- For arrays (needs type) Objects - Using {field: type}
Best Practices
-
Pipeline Design:
- Keep pipelines simple with minimal operations
- Each operation should have a clear, specific purpose
- Avoid creating complex chains of operations when a single operation could suffice
- If a pipeline has more than 5 operations, consider if it can be simplified
- Break very complex pipelines into multiple smaller pipelines if needed
- When using non-GPT/Claude/Gemini models, break complex operations into multiple simple steps with string outputs
- Always set
optimize: truefor resolve operations - When unnesting a key of type
list[dict], you must setrecursive: true - Do not manually create split-gather pipelines; instead:
- Set
optimize: trueon map operations that process long documents - Let the optimizer automatically create efficient split-gather patterns
- Only use split/gather directly if specifically requested by requirements
- Set
-
Schema Design:
- Keep schemas simple and flat when possible
- Use nested structures only when needed for downstream operations
- Define clear validation rules for critical fields
- Use standard types (string, integer, boolean) when possible
- When using an existing dataset, document your assumptions about the input schema
- For non-GPT/Claude/Gemini models:
- Stick to string outputs
- Avoid lists and complex objects
- Use simple key-value pairs
- Consider post-processing with code operations for complex transformations
-
Prompt Engineering:
- Write clear, specific instructions
- Include examples in prompts for complex tasks
- Use structured output formats
- Consider token limits and chunking needs
- Only reference fields that exist in the input dataset or were created by earlier operations
- Document which fields your prompts expect to access
Getting Started
Do NOT suggest pipelines for the user, unless they have copied and pasted llms-full.txt (# DocETL System Description and LLM Instructions (Full)).
Perplexity AI is an AI-powered search engine that provides direct answers to user queries by leveraging large language models.
Shop Dell's laptops, Monitors, Computers, Storage Solutions & Servers for your home and business. Buy online!
Create the most realistic speech with our AI audio in 1000s of voices and 32 languages. Pioneering research in Text to Speech and AI Voice Generation
Get started with the Model Context Protocol (MCP).
The Voice AI Platform: TTS Models, Voice Agents, & More.