Unstructured

docs.unstructured.io

Unstructured helps you get your data ready for AI by transforming it into a format that large language models can understand. Easily connect your data to LLMs.

llms.txt

Unstructured

Docs

About Unstructured
Create destination connector: Create a new destination connector using the provided configuration.
Create destination connection check: Test a destination connection by creating a connection check for the specified destination connector.
Delete destination connector: Delete a specific destination connector by its ID.
Destination endpoints
Get destination connector: Retrieve detailed information for a specific destination connector.
Get destination connection check: Retrieve the most recent connection check for the destination connector.
List destination connectors: Retrieve a list of available destination connectors.
Update destination connector: Update the configuration of an existing destination connector.
Cancel job: Cancel the specified job.
Create job: Create a new on-demand workflow job using either a template or custom DAG.
Download job output: Download the output of a job from a workflow.
Get job: Retrieve basic information for a specific job.
Get job details: Retrieve current processing details for a specific job.
Get job failed files: Retrieve the list of any failed files for a specific job, and why those files failed.
Job endpoints
List jobs: Retrieve a list of jobs with optional filtering.
Create notification channel: Create a new delivery channel for receiving platform event notifications.
Delete notification channel: Delete workspace-scoped notification channel.
Get notification: Get a single notification event by ID.
Get notification channel: Get workspace-scoped notification channel by ID.
Get notifications unread count: Get count of unread notification events for the current user.
List notification channels: List workspace-scoped notification channels.
List notifications: List notification events for the authenticated user, ordered by creation date with cursor-based pagination.
Mark notifications read: Mark notification events as read for the current user.
Notification endpoints: Read and manage workspace-level notifications and the delivery channels that route them. Configure email or webhook targets, verify them, and track unread activity across your workflows.
Update notification channel: Update workspace-scoped notification channel. Only provided fields are updated.
Verify notification channel: Verify a notification channel using a verification code.
Create source connector: Create a new source connector using the provided configuration.
Create source connection check: Test a source connection by creating a connection check for the specified source connector.
Delete source connector: Delete a specific source connector identified by its ID.
Get source connector: Retrieve detailed information for a specific source connector.
Get source connection check: Retrieve the most recent connection check for the source connector.
List source connectors: Retrieve a list of available source connectors.
Source endpoints
Update source connector: Update the configuration of an existing source connector.
Get template: Retrieve detailed information and DAG for a specific template.
List templates: Retrieve a list of available templates with their metadata.
Template endpoints
Create workflow: Create a new workflow, either custom or auto, and configure its settings.
Create workflow notification channel: Create a new delivery channel for receiving platform event notifications scoped to the specified workflow.
Delete workflow: Delete a workflow by its ID.
Delete workflow notification channel: Delete workflow-scoped notification channel.
Get workflow: Retrieve detailed information for a specific workflow by its ID.
Get workflow notification channel: Get workflow-scoped notification channel by ID.
List workflow notification channels: List workflow-scoped notification channels.
List workflows: Retrieve a list of workflows, optionally filtered by source, destination, state, name, date range, and supports pagination and sorting.
Run workflow: Run a workflow manually, by triggering a new job if none is currently active.
Update workflow: Update an existing workflow's name, connectors, schedule, or workflow type.
Update workflow notification channel: Update workflow-scoped notification channel. Only provided fields are updated.
Verify workflow notification channel: Verify a workflow-scoped notification channel using a verification code.
Workflow endpoints
Email Notifications: Learn how to create and verify email notification channels to receive automated emails for Unstructured workflow job events.
Unstructured API on AWS: Follow these steps to deploy the Unstructured API service into your AWS account.
Unstructured API on Azure
Overview
Partition Endpoint parameters
Endpoint errors
Chunking strategies
Document elements and metadata
Examples: This page provides some examples of accessing Unstructured Partition Endpoint via different methods.
Extract images and tables from documents
Generate a JSON schema for a file
Get chunked elements
Get element contents
Overview
Partitioning strategies
Process an individual file by making a direct POST request
Unstructured API Quickstart
JavaScript/TypeScript SDK
Python SDK
Speed up processing of large files and batches
Extract tables as HTML
Transform a JSON file into a different schema
Overview
Embed quickstart: Create an on-demand job to partition, chunk, and embed local files into AI-ready JSON with vector embeddings for semantic search.
Enrich quickstart: Create an on-demand job to partition and enrich local files into AI-ready JSON with image descriptions, table HTML, named entities, and generative OCR output.
Extract quickstart: Create an on-demand job to partition local files and extract specified data as JSON object data records using the Unstructured API.
Unstructured API quickstart: Use curl or the Unstructured Python SDK to turn local documents into AI-ready JSON.
Transform quickstart: Create an on-demand job that uses Auto partitioning to transform local files into standard AI-ready JSON.
Supported file types
Troubleshooting Unstructured API keys and URLs
Webhooks
Astra DB
Azure AI Search
Azure Blob Storage
Couchbase
Delta Tables in Databricks: Send processed data from Unstructured to a Delta Table in Databricks.
Databricks Volumes
Delta Tables in Amazon S3
Elasticsearch
IBM FileNet: Connect Unstructured to IBM FileNet as a destination to send processed document data into your FileNet content repository.
Google Cloud Storage
IBM watsonx.data
Kafka
Local
Milvus
MongoDB
MotherDuck
Neo4j
OneDrive
OpenSearch
Overview
Pinecone
PostgreSQL
Qdrant
Redis
S3
Snowflake
Teradata
Teradata Enterprise Vector Store: Send processed data from Unstructured to a Teradata Enterprise Vector Store.
Weaviate
Errors
Jobs: Manage Unstructured workflow jobs with the Python SDK or REST API. List jobs, monitor processing status, inspect failed files, and cancel running jobs.
Migration
Available models: The following models are available for use in Unstructured workflow nodes, organized by model functionality and provider.
Chunker node: Configure a Chunker node to split partitioned document elements into chunks sized for embedding models, improving retrieval precision in RAG applications.
Chunk by Character strategy: The Chunk by Character strategy combines sequential elements to fill each chunk within hard and soft character limits, without using section or page boundaries.
Chunk by Page strategy: The Chunk by Page strategy preserves page boundaries, ensuring no chunk contains content from more than one page.
Chunk by Similarity strategy: The Chunk by Similarity strategy uses an embedding model to identify topically similar consecutive elements and combines them into chunks.
Chunk by Title strategy: The Chunk by Title strategy preserves section boundaries by starting a new chunk whenever a Title element is encountered.
Embedder node: Configure an Embedder node to convert chunked document elements into vector embeddings using a supported provider and model, enabling similarity search in vector stores.
Enrichment node: Configure an Enrichment node to add AI-powered enhancements to partitioned document elements, including image descriptions, table summaries, HTML representations, named entity recognition, and generative OCR.
Generative OCR task: The Generative OCR task uses a vision language model to improve the fidelity of text blocks that Unstructured initially processed during partitioning.
Image Description task: The Image Description task uses a vision language model to generate text-based summaries of detected images in partitioned documents.
Named Entity Recognition (NER) task: The Named Entity Recognition (NER) task identifies entities and their types — such as people, organizations, and locations — and the relationships between them.
Table Description task: The Table Description task uses a vision language model to generate text-based summaries of detected tables in partitioned documents.
Table to HTML task: The Table to HTML task converts detected tables into HTML markup using agentic AI or a vision language model, improving accuracy and downstream rendering.
Extract node: Configure an Extract node to automatically pull structured data from partitioned documents into a consistent JSON format based on a schema or guidance prompt you provide.
LLM: The LLM method uses a language model to extract structured data fields from partitioned documents based on a JSON schema or plain-language extraction guidance.
Regex: The Regex method extracts structured data fields from partitioned documents using named regular-expression patterns, without requiring a language model.
Custom workflow DAG nodes
Partitioner node: Configure a Partitioner node to extract content from raw files and output it as structured document elements. Choose from Auto, VLM, High Res, and Fast strategies to balance output quality, speed, and cost.
Auto strategy: The Auto strategy dynamically routes each page to the Fast, High Res, or VLM partitioning strategy at runtime, optimizing for the highest quality at the lowest cost.
Fast strategy: The Fast strategy uses rule-based text extraction optimized for text-only documents, delivering the fastest processing at the lowest cost.
High Res strategy: The High Res strategy uses advanced image-to-text processing for all supported file types and supports generation of bounding box coordinates for document elements.
VLM strategy: The VLM strategy uses a vision language model to deliver the highest-quality partitioning for image-heavy file types including PDFs, images, and scanned documents.
Overview
Retries
Azure Blob Storage
Box: Ingest your files into Unstructured from Box.
Confluence
Couchbase
Databricks Volumes
Dropbox
Elasticsearch
IBM FileNet: Connect Unstructured to IBM FileNet as a source to ingest documents and content from your FileNet content repository into Unstructured.
Google Cloud Storage
Google Drive
Jira
Kafka
Local
MongoDB
OneDrive
OpenSearch
Outlook
Overview
PostgreSQL
S3
Salesforce
SharePoint
Slack
Snowflake
Teradata
Zendesk
Workflows: Build and manage Unstructured workflows with the Python SDK or REST API. Create, update, run, and delete workflows that process and transform documents from source to destination.
AI providers
Connecting to customer-managed services on AWS
Connecting to AWS-managed services
Connector support for AWS PrivateLink
Configure DNS for AWS PrivateLink
Onboarding your dedicated instance + AWS PrivateLink
Enterprise Connect for AWS: Enterprise Connect lets Unstructured connectors access AWS resources by assuming a customer-configured IAM role with temporary, scoped credentials.
Onboarding in-VPC on your AWS platform
Unstructured Business on Amazon Web Services (AWS)
Connecting to customer-managed services on Azure
Connecting to Azure-managed services
Connector support for Azure Private Link
Configure DNS for Azure Private Link
Onboarding Dedicated Instance + Azure Private Link
Enterprise Connect for Azure: Enterprise Connect lets Unstructured connectors access Azure resources by authenticating as a customer-configured App Registration using short-lived access tokens.
Onboarding in-VPC on your Microsoft Azure platform
Unstructured Business on Microsoft Azure (Azure)
Concepts and architecture
Before you begin
Dedicated instance overview
Requirements and limitations
Security FAQ
Security model
Shared responsibility model
Onboarding in-VPC on your Google Cloud Platform (GCP)
Unstructured Business in-VPC on Google Cloud Platform (GCP) - overview
Configure an identity provider
IdP group mappings
IdP overview
Unstructured Business
Security and compliance overview
Chunking
Document elements and metadata
Embedding
Generative OCR optimization
Image descriptions
Named entity recognition (NER)
Overview
Table descriptions
Tables to HTML
More examples
Unstructured concepts overview: Learn how Unstructured processes documents: partitioning, chunking, enriching, embedding, and structured data extraction.
Partitioning
Additional examples
Choose an extraction method: LLM or Regex
Structured data extraction
Structured Extraction with LLM
Structured Extraction with Regex
Saving extracted data
Using the structured data extractor
Query processed PDF with HuggingChat
Table extraction from PDF: This section describes two methods for extracting tables from PDF files.
Multi-file API processing
Delta Table source connector
Vector database ingestion
Notebooks: Notebooks contain complete working sample code for end-to-end solutions.
Azure Blob Storage event triggers
Claude Desktop
CrewAI
Databricks Volumes event triggers
Firecrawl
Google Cloud Storage event triggers
Google Drive event triggers
IBM watsonx Orchestrate
Query JSON with jq
Working with Unstructured JSON output
Langflow
Model Context Protocol (MCP) Hands-On Walkthrough for the Unstructured API's workflow operations
Model Context Protocol (MCP) Hands-On Walkthrough for the Unstructured Partition Endpoint
Chatbot with Neo4j
OneDrive event triggers
PII detection
Amazon S3 event triggers
Amazon S3 Vectors
SharePoint event triggers
Streamlit in Snowflake
VectorShift
Frequently asked questions
Chunking
Embedding
Document elements and metadata
Models: Depending on your need, Unstructured provides OCR-based and Transformer-based models to detect elements in the documents. The models are useful to detect the complex layout in the documents and predict the element types.
Partitioning strategies
Chunking: Chunking functions in unstructured use metadata and document elements detected with partition functions to post-process elements into more useful "chunks" for uses cases such as retrieval-augmented generation (RAG).
Cleaning: As part of data preparation for an NLP model, it’s common to need to clean up your data prior to passing it into the model. If there’s unwanted content in your output, for example, it could impact the quality of your NLP model. To help with this, the unstructured library includes cleaning function…
Embedding
Extracting
Core Functionality: The unstructured library includes functions to partition, chunk, clean, and stage raw source documents. These functions serve as the primary public interfaces within the library.
Partitioning: Partitioning functions in unstructured allow users to extract structured content from a raw unstructured document. These functions break a document down into elements such as Title, NarrativeText, and ListItem, enabling users to decide what content they’d like to keep for their particular ap…
Staging
Set embedding behavior
Examples: This page provides some examples of accessing Unstructured by using the Unstructured Ingest CLI and the Unstructured Ingest Python library.
Extract images and tables from documents
Process a subset of files
Get chunked elements
Get element contents
Set the OCR agent
Speed up processing of large files and batches
Extract tables as HTML
Astra DB
Azure
Azure AI Search
Box
Chroma
Couchbase
Delta Tables in Databricks
Databricks Volumes
Delta Tables in Amazon S3
Dropbox
DuckDB
Elasticsearch
Google Cloud Storage
IBM watsonx.data
Kafka
KDB.AI
LanceDB
Local
Milvus
MongoDB
MotherDuck
Neo4j
OneDrive
OpenSearch
Overview
Pinecone
PostgreSQL
Qdrant
Redis
S3
SFTP
SingleStore
Snowflake
SQLite
Teradata
Vectara
Weaviate
Process files in batches by using the Unstructured Ingest CLI
Chunking configuration
Embedding configuration
Overview
Partition configuration
Processor configuration
Ingest dependencies
Overview
Process files in batches by using the Unstructured Ingest Python library
Airtable
Astra DB
Azure
Box
Confluence
Couchbase
Databricks Volumes
Delta Table
Discord
Dropbox
Elasticsearch
GitHub
GitLab
Google Cloud Storage
Google Drive
Jira
Kafka
Local
MongoDB
Notion
OneDrive
OpenSearch
Outlook
Overview
PostgreSQL
S3
Salesforce
SFTP
SharePoint
SingleStore
Slack
Snowflake
SQLite
Teradata
Zendesk
Supported file types
Docker installation
Full installation
Overview: Unstructured open source library offers flexible options for installation.
Integrations: Integrate your model development pipeline with your favorite machine learning frameworks and libraries, and prepare your data for ingestion into downstream systems. Most of our integrations come in the form of staging functions, which take a list of `Elemen…
Overview
Quickstart
Supported file types
Account API keys and API URLs
Account billing
Organizational accounts
Accounts overview
Role-based access control (RBAC)
Account usage
Workspaces
Supported connectors
Astra DB
Azure AI Search
Azure Blob Storage
Couchbase
Delta Tables in Databricks: Send processed data from Unstructured to a Delta Table in Databricks.
Databricks Volumes
Delta Tables in Amazon S3
Elasticsearch
IBM FileNet: Connect Unstructured to IBM FileNet as a destination to send processed document data into your FileNet content repository.
Google Cloud Storage
IBM watsonx.data
Kafka
Milvus
MongoDB
MotherDuck
Neo4j
OneDrive
OpenSearch
Overview
Pinecone
Pinecone destination connector quickstart
PostgreSQL
Qdrant
Redis
S3
Snowflake
Teradata
Teradata Enterprise Vector Store: Send processed data from Unstructured to Teradata Enterprise Vector Store.
Weaviate
Jobs
Overview
Unstructured Pipelines quickstarts
Azure Blob Storage
Box: Ingest your files into Unstructured from Box.
Confluence
Couchbase
Databricks Volumes
Dropbox
Dropbox source connector quickstart
Elasticsearch
IBM FileNet: Connect Unstructured to IBM FileNet as a source to ingest documents and content from your FileNet content repository into Unstructured.
Google Cloud Storage
Google Drive
Jira
Kafka
MongoDB
OneDrive
OpenSearch
Outlook
Overview
PostgreSQL
S3
Salesforce
SharePoint
Slack
Snowflake
Teradata
Zendesk
Supported file types
Unstructured Pipelines walkthrough
Webhooks
Workflows
Access to services
Delete or export personal data stored by Unstructured
Raise an invoice issue or request a refund
Overview
Google Drive source connectors cannot be created by using the API
Authorization and permissions issues
S3 destination connector - Unable to locate credentials
Configuration and resource issues
Data format, schema, and validation issues
Document processing issues
Unable to sign up with personal email address
Get authenticated user error
Google Drive schema validation failure
Internal issues
Network, connection, and timeout issues
The Fast strategy is not available for image files
Overview
Quota, billing, and rate limiting issues
Workflow already has a job scheduled or in progress
Requesting support
Shared Responsibility Model
Check service status
Welcome to Unstructured!

Optional

Vercel AI SDK

The AI Toolkit for TypeScript, from the creators of Next.js.

/llms.txt

136,985 tokens

Developer Tools

Mintlify

Meet the modern standard for public facing documentation. Beautiful out of the box, easy to maintain, and optimized for user engagement.

/llms.txt

5,436 tokens

/llms-full.txt

181,290 tokens

Developer Tools

Svelte

Web development for the rest of us.

/llms.txt

602 tokens

/llms-full.txt

453,623 tokens

Developer Tools

Pinecone

Search through billions of items for similar matches to any object, in milliseconds. It’s the next generation of search, an API call away.

/llms.txt

15,715 tokens

/llms-full.txt

588,629 tokens

Developer Tools

Trigger.dev

Build and deploy reliable background jobs with no timeouts and no infrastructure to manage.

/llms.txt

12,202 tokens

/llms-full.txt

387,586 tokens

Developer Tools

Turso

Get the simple developer experience of SQLite in production, and scale your multi-tenant backend with unlimited databases.

/llms.txt

10,006 tokens

/llms-full.txt

163,317 tokens

Developer Tools

Upstash

Upstash is a serverless data platform providing low latency and high scalability for real-time applications.

/llms.txt

52,307 tokens

/llms-full.txt

1,200,134 tokens

Developer Tools

Envoyer

One-click deployments built for teams, tuned for Laravel, loaded with tools and goodies you're going to love.

/llms.txt

565 tokens

/llms-full.txt

11,330 tokens

Developer Tools