DocETL

docetl.org
AI & Machine Learning

Powering complex document processing pipelines

llms.txt

DocETL System Description and LLM Instructions (Short)

Note: use docetl.org/llms-full.txt for the full system description and LLM instructions.

DocETL is a system for creating and executing LLM-powered data processing pipelines, designed for complex document processing tasks. It provides a low-code, declarative YAML interface to define complex data operations on unstructured datasets.

DocETL is built and maintained by the EPIC lab at UC Berkeley. Learn more at https://www.docetl.org.

We have an integrated development environment for building and testing pipelines, at https://www.docetl.org/playground. Our IDE is called DocWrangler.

Docs

Core Operators

Auxiliary Operators

LLM Providers

Optional

Datasets and Data Loading

DocETL supports both standard and dynamic data loading. Input data must be in one of two formats:

  1. JSON Format:

    • A list of objects/dictionaries
    • Each object represents one document/item to process
    • Each field in the object is accessible in operations via input.field_name

    Example JSON:

    [
      {
        "text": "First document content",
        "date": "2024-03-20",
        "metadata": {"source": "email"}
      },
      {
        "text": "Second document content",
        "date": "2024-03-21",
        "metadata": {"source": "chat"}
      }
    ]
    
  2. CSV Format:

    • First row contains column headers
    • Each subsequent row represents one document/item
    • Column names become field names, accessible via input.column_name

    Example CSV:

    text,date,source
    "First document content","2024-03-20","email"
    "Second document content","2024-03-21","chat"
    

Configure datasets in your pipeline:

datasets:
  documents:
    type: file
    path: "data.json"  # or "data.csv"

!!! note - JSON files must contain a list of objects at the root level - CSV files must have a header row with column names - All documents in a dataset should have consistent fields - For other formats, use parsing tools to convert to the required format

Schema Design and Validation

!!! warning "Model Capabilities and Schema Complexity" When using models other than GPT (OpenAI), Claude (Anthropic), or Gemini (Google): - Keep output schemas extremely simple - Prefer single string outputs or simple key-value pairs - Avoid complex types (lists, nested objects) - Break complex operations into multiple simpler steps

  1. Basic Types:
    TypeAliasesDescription
    stringstr, text, varcharFor text data
    integerintFor whole numbers
    numberfloat, decimalFor decimal numbers
    booleanboolFor true/false values
    enum-Only when prompt explicitly lists all possible values
    list-For arrays (needs type)
    Objects-Using {field: type}

Best Practices

  1. Pipeline Design:

    • Keep pipelines simple with minimal operations
    • Each operation should have a clear, specific purpose
    • Avoid creating complex chains of operations when a single operation could suffice
    • If a pipeline has more than 5 operations, consider if it can be simplified
    • Break very complex pipelines into multiple smaller pipelines if needed
    • When using non-GPT/Claude/Gemini models, break complex operations into multiple simple steps with string outputs
    • Always set optimize: true for resolve operations
    • When unnesting a key of type list[dict], you must set recursive: true
    • Do not manually create split-gather pipelines; instead:
      • Set optimize: true on map operations that process long documents
      • Let the optimizer automatically create efficient split-gather patterns
      • Only use split/gather directly if specifically requested by requirements
  2. Schema Design:

    • Keep schemas simple and flat when possible
    • Use nested structures only when needed for downstream operations
    • Define clear validation rules for critical fields
    • Use standard types (string, integer, boolean) when possible
    • When using an existing dataset, document your assumptions about the input schema
    • For non-GPT/Claude/Gemini models:
      • Stick to string outputs
      • Avoid lists and complex objects
      • Use simple key-value pairs
      • Consider post-processing with code operations for complex transformations
  3. Prompt Engineering:

    • Write clear, specific instructions
    • Include examples in prompts for complex tasks
    • Use structured output formats
    • Consider token limits and chunking needs
    • Only reference fields that exist in the input dataset or were created by earlier operations
    • Document which fields your prompts expect to access

Getting Started

Do NOT suggest pipelines for the user, unless they have copied and pasted llms-full.txt (# DocETL System Description and LLM Instructions (Full)).

Related

Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.

/llms.txt
15,391 tokens
AI & Machine Learning

Perplexity AI is an AI-powered search engine that provides direct answers to user queries by leveraging large language models.

/llms.txt
1,173 tokens
/llms-full.txt
40,087 tokens
AI & Machine Learning

Shop Dell's laptops, Monitors, Computers, Storage Solutions & Servers for your home and business. Buy online!

/llms.txt
8,533 tokens
AI & Machine Learning

The AI Code Editor

/llms.txt
4,643 tokens
/llms-full.txt
95,589 tokens
AI & Machine Learning

Create the most realistic speech with our AI audio in 1000s of voices and 32 languages. Pioneering research in Text to Speech and AI Voice Generation

/llms.txt
23,168 tokens
/llms-full.txt
1,020,683 tokens
AI & Machine Learning

Get started with the Model Context Protocol (MCP).

/llms.txt
3,315 tokens
/llms-full.txt
223,365 tokens
AI & Machine Learning

The official Python client for the Huggingface Hub.

/llms.txt
143,619 tokens
AI & Machine Learning

The Voice AI Platform: TTS Models, Voice Agents, & More.

/llms.txt
1,738 tokens
/llms-full.txt
56,923 tokens
AI & Machine Learning