DocETL

docetl.org

Powering complex document processing pipelines

llms.txt

DocETL System Description and LLM Instructions (Short)

Note: use docetl.org/llms-full.txt for the full system description and LLM instructions.

DocETL is a system for creating and executing LLM-powered data processing pipelines, designed for complex document processing tasks. It provides a low-code, declarative YAML interface to define complex data operations on unstructured datasets.

DocETL is built and maintained by the EPIC lab at UC Berkeley. Learn more at https://www.docetl.org.

We have an integrated development environment for building and testing pipelines, at https://www.docetl.org/playground. Our IDE is called DocWrangler.

Docs

Core Operators

Auxiliary Operators

LLM Providers

LiteLLM Supported Providers

Optional

Datasets and Data Loading

DocETL supports both standard and dynamic data loading. Input data must be in one of two formats:

JSON Format:

A list of objects/dictionaries
Each object represents one document/item to process
Each field in the object is accessible in operations via input.field_name

Example JSON:

[
  {
    "text": "First document content",
    "date": "2024-03-20",
    "metadata": {"source": "email"}
  },
  {
    "text": "Second document content",
    "date": "2024-03-21",
    "metadata": {"source": "chat"}
  }
]

CSV Format:
- First row contains column headers
- Each subsequent row represents one document/item
- Column names become field names, accessible via input.column_name
Example CSV:
```
text,date,source
"First document content","2024-03-20","email"
"Second document content","2024-03-21","chat"
```

Configure datasets in your pipeline:

datasets:
  documents:
    type: file
    path: "data.json"  # or "data.csv"

!!! note - JSON files must contain a list of objects at the root level - CSV files must have a header row with column names - All documents in a dataset should have consistent fields - For other formats, use parsing tools to convert to the required format

Schema Design and Validation

!!! warning "Model Capabilities and Schema Complexity" When using models other than GPT (OpenAI), Claude (Anthropic), or Gemini (Google): - Keep output schemas extremely simple - Prefer single string outputs or simple key-value pairs - Avoid complex types (lists, nested objects) - Break complex operations into multiple simpler steps

Basic Types:

Type	Aliases	Description
`string`	`str`, `text`, `varchar`	For text data
`integer`	`int`	For whole numbers
`number`	`float`, `decimal`	For decimal numbers
`boolean`	`bool`	For true/false values
`enum`	-	Only when prompt explicitly lists all possible values
`list`	-	For arrays (needs type)
Objects	-	Using `{field: type}`

Best Practices

Pipeline Design:
- Keep pipelines simple with minimal operations
- Each operation should have a clear, specific purpose
- Avoid creating complex chains of operations when a single operation could suffice
- If a pipeline has more than 5 operations, consider if it can be simplified
- Break very complex pipelines into multiple smaller pipelines if needed
- When using non-GPT/Claude/Gemini models, break complex operations into multiple simple steps with string outputs
- Always set optimize: true for resolve operations
- When unnesting a key of type list[dict], you must set recursive: true
- Do not manually create split-gather pipelines; instead:
  - Set optimize: true on map operations that process long documents
  - Let the optimizer automatically create efficient split-gather patterns
  - Only use split/gather directly if specifically requested by requirements
Schema Design:
- Keep schemas simple and flat when possible
- Use nested structures only when needed for downstream operations
- Define clear validation rules for critical fields
- Use standard types (string, integer, boolean) when possible
- When using an existing dataset, document your assumptions about the input schema
- For non-GPT/Claude/Gemini models:
  - Stick to string outputs
  - Avoid lists and complex objects
  - Use simple key-value pairs
  - Consider post-processing with code operations for complex transformations
Prompt Engineering:
- Write clear, specific instructions
- Include examples in prompts for complex tasks
- Use structured output formats
- Consider token limits and chunking needs
- Only reference fields that exist in the input dataset or were created by earlier operations
- Document which fields your prompts expect to access