Unstructured
Unstructured helps you get your data ready for AI by transforming it into a format that large language models can understand. Easily connect your data to LLMs.
Unstructured
Docs
- About Unstructured
- Create destination connector: Create a new destination connector using the provided configuration.
- Create destination connection check: Test a destination connection by creating a connection check for the specified destination connector.
- Delete destination connector: Delete a specific destination connector by its ID.
- Destination endpoints
- Get destination connector: Retrieve detailed information for a specific destination connector.
- Get destination connection check: Retrieve the most recent connection check for the destination connector.
- List destination connectors: Retrieve a list of available destination connectors.
- Update destination connector: Update the configuration of an existing destination connector.
- Cancel job: Cancel the specified job.
- Create job: Create a new on-demand workflow job using either a template or custom DAG.
- Download job output: Download the output of a job from a workflow.
- Get job: Retrieve basic information for a specific job.
- Get job details: Retrieve current processing details for a specific job.
- Get job failed files: Retrieve the list of any failed files for a specific job, and why those files failed.
- Job endpoints
- List jobs: Retrieve a list of jobs with optional filtering.
- Create notification channel: Create a new delivery channel for receiving platform event notifications.
- Delete notification channel: Delete workspace-scoped notification channel.
- Get notification: Get a single notification event by ID.
- Get notification channel: Get workspace-scoped notification channel by ID.
- Get notifications unread count: Get count of unread notification events for the current user.
- List notification channels: List workspace-scoped notification channels.
- List notifications: List notification events for the authenticated user, ordered by creation date with cursor-based pagination.
- Mark notifications read: Mark notification events as read for the current user.
- Notification endpoints: Read and manage workspace-level notifications and the delivery channels that route them. Configure email or webhook targets, verify them, and track unread activity across your workflows.
- Update notification channel: Update workspace-scoped notification channel. Only provided fields are updated.
- Verify notification channel: Verify a notification channel using a verification code.
- Create source connector: Create a new source connector using the provided configuration.
- Create source connection check: Test a source connection by creating a connection check for the specified source connector.
- Delete source connector: Delete a specific source connector identified by its ID.
- Get source connector: Retrieve detailed information for a specific source connector.
- Get source connection check: Retrieve the most recent connection check for the source connector.
- List source connectors: Retrieve a list of available source connectors.
- Source endpoints
- Update source connector: Update the configuration of an existing source connector.
- Get template: Retrieve detailed information and DAG for a specific template.
- List templates: Retrieve a list of available templates with their metadata.
- Template endpoints
- Create workflow: Create a new workflow, either custom or auto, and configure its settings.
- Create workflow notification channel: Create a new delivery channel for receiving platform event notifications scoped to the specified workflow.
- Delete workflow: Delete a workflow by its ID.
- Delete workflow notification channel: Delete workflow-scoped notification channel.
- Get workflow: Retrieve detailed information for a specific workflow by its ID.
- Get workflow notification channel: Get workflow-scoped notification channel by ID.
- List workflow notification channels: List workflow-scoped notification channels.
- List workflows: Retrieve a list of workflows, optionally filtered by source, destination, state, name, date range, and supports pagination and sorting.
- Run workflow: Run a workflow manually, by triggering a new job if none is currently active.
- Update workflow: Update an existing workflow's name, connectors, schedule, or workflow type.
- Update workflow notification channel: Update workflow-scoped notification channel. Only provided fields are updated.
- Verify workflow notification channel: Verify a workflow-scoped notification channel using a verification code.
- Workflow endpoints
- Email Notifications: Learn how to create and verify email notification channels to receive automated emails for Unstructured workflow job events.
- Unstructured API on AWS: Follow these steps to deploy the Unstructured API service into your AWS account.
- Unstructured API on Azure
- Overview
- Partition Endpoint parameters
- Endpoint errors
- Chunking strategies
- Document elements and metadata
- Examples: This page provides some examples of accessing Unstructured Partition Endpoint via different methods.
- Extract images and tables from documents
- Generate a JSON schema for a file
- Get chunked elements
- Get element contents
- Overview
- Partitioning strategies
- Process an individual file by making a direct POST request
- Unstructured API Quickstart
- JavaScript/TypeScript SDK
- Python SDK
- Speed up processing of large files and batches
- Extract tables as HTML
- Transform a JSON file into a different schema
- Overview
- Embed quickstart: Create an on-demand job to partition, chunk, and embed local files into AI-ready JSON with vector embeddings for semantic search.
- Enrich quickstart: Create an on-demand job to partition and enrich local files into AI-ready JSON with image descriptions, table HTML, named entities, and generative OCR output.
- Extract quickstart: Create an on-demand job to partition local files and extract specified data as JSON object data records using the Unstructured API.
- Unstructured API quickstart: Use curl or the Unstructured Python SDK to turn local documents into AI-ready JSON.
- Transform quickstart: Create an on-demand job that uses Auto partitioning to transform local files into standard AI-ready JSON.
- Supported file types
- Troubleshooting Unstructured API keys and URLs
- Webhooks
- Astra DB
- Azure AI Search
- Azure Blob Storage
- Couchbase
- Delta Tables in Databricks: Send processed data from Unstructured to a Delta Table in Databricks.
- Databricks Volumes
- Delta Tables in Amazon S3
- Elasticsearch
- IBM FileNet: Connect Unstructured to IBM FileNet as a destination to send processed document data into your FileNet content repository.
- Google Cloud Storage
- IBM watsonx.data
- Kafka
- Local
- Milvus
- MongoDB
- MotherDuck
- Neo4j
- OneDrive
- OpenSearch
- Overview
- Pinecone
- PostgreSQL
- Qdrant
- Redis
- S3
- Snowflake
- Teradata
- Teradata Enterprise Vector Store: Send processed data from Unstructured to a Teradata Enterprise Vector Store.
- Weaviate
- Errors
- Jobs: Manage Unstructured workflow jobs with the Python SDK or REST API. List jobs, monitor processing status, inspect failed files, and cancel running jobs.
- Migration
- Available models: The following models are available for use in Unstructured workflow nodes, organized by model functionality and provider.
- Chunker node: Configure a Chunker node to split partitioned document elements into chunks sized for embedding models, improving retrieval precision in RAG applications.
- Chunk by Character strategy: The Chunk by Character strategy combines sequential elements to fill each chunk within hard and soft character limits, without using section or page boundaries.
- Chunk by Page strategy: The Chunk by Page strategy preserves page boundaries, ensuring no chunk contains content from more than one page.
- Chunk by Similarity strategy: The Chunk by Similarity strategy uses an embedding model to identify topically similar consecutive elements and combines them into chunks.
- Chunk by Title strategy: The Chunk by Title strategy preserves section boundaries by starting a new chunk whenever a Title element is encountered.
- Embedder node: Configure an Embedder node to convert chunked document elements into vector embeddings using a supported provider and model, enabling similarity search in vector stores.
- Enrichment node: Configure an Enrichment node to add AI-powered enhancements to partitioned document elements, including image descriptions, table summaries, HTML representations, named entity recognition, and generative OCR.
- Generative OCR task: The Generative OCR task uses a vision language model to improve the fidelity of text blocks that Unstructured initially processed during partitioning.
- Image Description task: The Image Description task uses a vision language model to generate text-based summaries of detected images in partitioned documents.
- Named Entity Recognition (NER) task: The Named Entity Recognition (NER) task identifies entities and their types — such as people, organizations, and locations — and the relationships between them.
- Table Description task: The Table Description task uses a vision language model to generate text-based summaries of detected tables in partitioned documents.
- Table to HTML task: The Table to HTML task converts detected tables into HTML markup using agentic AI or a vision language model, improving accuracy and downstream rendering.
- Extract node: Configure an Extract node to automatically pull structured data from partitioned documents into a consistent JSON format based on a schema or guidance prompt you provide.
- LLM: The LLM method uses a language model to extract structured data fields from partitioned documents based on a JSON schema or plain-language extraction guidance.
- Regex: The Regex method extracts structured data fields from partitioned documents using named regular-expression patterns, without requiring a language model.
- Custom workflow DAG nodes
- Partitioner node: Configure a Partitioner node to extract content from raw files and output it as structured document elements. Choose from Auto, VLM, High Res, and Fast strategies to balance output quality, speed, and cost.
- Auto strategy: The Auto strategy dynamically routes each page to the Fast, High Res, or VLM partitioning strategy at runtime, optimizing for the highest quality at the lowest cost.
- Fast strategy: The Fast strategy uses rule-based text extraction optimized for text-only documents, delivering the fastest processing at the lowest cost.
- High Res strategy: The High Res strategy uses advanced image-to-text processing for all supported file types and supports generation of bounding box coordinates for document elements.
- VLM strategy: The VLM strategy uses a vision language model to deliver the highest-quality partitioning for image-heavy file types including PDFs, images, and scanned documents.
- Overview
- Retries
- Azure Blob Storage
- Box: Ingest your files into Unstructured from Box.
- Confluence
- Couchbase
- Databricks Volumes
- Dropbox
- Elasticsearch
- IBM FileNet: Connect Unstructured to IBM FileNet as a source to ingest documents and content from your FileNet content repository into Unstructured.
- Google Cloud Storage
- Google Drive
- Jira
- Kafka
- Local
- MongoDB
- OneDrive
- OpenSearch
- Outlook
- Overview
- PostgreSQL
- S3
- Salesforce
- SharePoint
- Slack
- Snowflake
- Teradata
- Zendesk
- Workflows: Build and manage Unstructured workflows with the Python SDK or REST API. Create, update, run, and delete workflows that process and transform documents from source to destination.
- AI providers
- Connecting to customer-managed services on AWS
- Connecting to AWS-managed services
- Connector support for AWS PrivateLink
- Configure DNS for AWS PrivateLink
- Onboarding your dedicated instance + AWS PrivateLink
- Enterprise Connect for AWS: Enterprise Connect lets Unstructured connectors access AWS resources by assuming a customer-configured IAM role with temporary, scoped credentials.
- Onboarding in-VPC on your AWS platform
- Unstructured Business on Amazon Web Services (AWS)
- Connecting to customer-managed services on Azure
- Connecting to Azure-managed services
- Connector support for Azure Private Link
- Configure DNS for Azure Private Link
- Onboarding Dedicated Instance + Azure Private Link
- Enterprise Connect for Azure: Enterprise Connect lets Unstructured connectors access Azure resources by authenticating as a customer-configured App Registration using short-lived access tokens.
- Onboarding in-VPC on your Microsoft Azure platform
- Unstructured Business on Microsoft Azure (Azure)
- Concepts and architecture
- Before you begin
- Dedicated instance overview
- Requirements and limitations
- Security FAQ
- Security model
- Shared responsibility model
- Onboarding in-VPC on your Google Cloud Platform (GCP)
- Unstructured Business in-VPC on Google Cloud Platform (GCP) - overview
- Configure an identity provider
- IdP group mappings
- IdP overview
- Unstructured Business
- Security and compliance overview
- Chunking
- Document elements and metadata
- Embedding
- Generative OCR optimization
- Image descriptions
- Named entity recognition (NER)
- Overview
- Table descriptions
- Tables to HTML
- More examples
- Unstructured concepts overview: Learn how Unstructured processes documents: partitioning, chunking, enriching, embedding, and structured data extraction.
- Partitioning
- Additional examples
- Choose an extraction method: LLM or Regex
- Structured data extraction
- Structured Extraction with LLM
- Structured Extraction with Regex
- Saving extracted data
- Using the structured data extractor
- Query processed PDF with HuggingChat
- Table extraction from PDF: This section describes two methods for extracting tables from PDF files.
- Multi-file API processing
- Delta Table source connector
- Vector database ingestion
- Notebooks: Notebooks contain complete working sample code for end-to-end solutions.
- Azure Blob Storage event triggers
- Claude Desktop
- CrewAI
- Databricks Volumes event triggers
- Firecrawl
- Google Cloud Storage event triggers
- Google Drive event triggers
- IBM watsonx Orchestrate
- Query JSON with jq
- Working with Unstructured JSON output
- Langflow
- Model Context Protocol (MCP) Hands-On Walkthrough for the Unstructured API's workflow operations
- Model Context Protocol (MCP) Hands-On Walkthrough for the Unstructured Partition Endpoint
- Chatbot with Neo4j
- OneDrive event triggers
- PII detection
- Amazon S3 event triggers
- Amazon S3 Vectors
- SharePoint event triggers
- Streamlit in Snowflake
- VectorShift
- Frequently asked questions
- Chunking
- Embedding
- Document elements and metadata
- Models: Depending on your need,
Unstructuredprovides OCR-based and Transformer-based models to detect elements in the documents. The models are useful to detect the complex layout in the documents and predict the element types. - Partitioning strategies
- Chunking: Chunking functions in
unstructureduse metadata and document elements detected withpartitionfunctions to post-process elements into more useful "chunks" for uses cases such as retrieval-augmented generation (RAG). - Cleaning: As part of data preparation for an NLP model, it’s common to need to clean up your data prior to passing it into the model. If there’s unwanted content in your output, for example, it could impact the quality of your NLP model. To help with this, the
unstructuredlibrary includes cleaning function… - Embedding
- Extracting
- Core Functionality: The
unstructuredlibrary includes functions to partition, chunk, clean, and stage raw source documents. These functions serve as the primary public interfaces within the library. - Partitioning: Partitioning functions in
unstructuredallow users to extract structured content from a raw unstructured document. These functions break a document down into elements such asTitle,NarrativeText, andListItem, enabling users to decide what content they’d like to keep for their particular ap… - Staging
- Set embedding behavior
- Examples: This page provides some examples of accessing Unstructured by using the Unstructured Ingest CLI and the Unstructured Ingest Python library.
- Extract images and tables from documents
- Process a subset of files
- Get chunked elements
- Get element contents
- Set the OCR agent
- Speed up processing of large files and batches
- Extract tables as HTML
- Astra DB
- Azure
- Azure AI Search
- Box
- Chroma
- Couchbase
- Delta Tables in Databricks
- Databricks Volumes
- Delta Tables in Amazon S3
- Dropbox
- DuckDB
- Elasticsearch
- Google Cloud Storage
- IBM watsonx.data
- Kafka
- KDB.AI
- LanceDB
- Local
- Milvus
- MongoDB
- MotherDuck
- Neo4j
- OneDrive
- OpenSearch
- Overview
- Pinecone
- PostgreSQL
- Qdrant
- Redis
- S3
- SFTP
- SingleStore
- Snowflake
- SQLite
- Teradata
- Vectara
- Weaviate
- Process files in batches by using the Unstructured Ingest CLI
- Chunking configuration
- Embedding configuration
- Overview
- Partition configuration
- Processor configuration
- Ingest dependencies
- Overview
- Process files in batches by using the Unstructured Ingest Python library
- Airtable
- Astra DB
- Azure
- Box
- Confluence
- Couchbase
- Databricks Volumes
- Delta Table
- Discord
- Dropbox
- Elasticsearch
- GitHub
- GitLab
- Google Cloud Storage
- Google Drive
- Jira
- Kafka
- Local
- MongoDB
- Notion
- OneDrive
- OpenSearch
- Outlook
- Overview
- PostgreSQL
- S3
- Salesforce
- SFTP
- SharePoint
- SingleStore
- Slack
- Snowflake
- SQLite
- Teradata
- Zendesk
- Supported file types
- Docker installation
- Full installation
- Overview: Unstructured open source library offers flexible options for installation.
- Integrations: Integrate your model development pipeline with your favorite machine learning frameworks and libraries, and prepare your data for ingestion into downstream systems. Most of our integrations come in the form of staging functions, which take a list of `Elemen…
- Overview
- Quickstart
- Supported file types
- Account API keys and API URLs
- Account billing
- Organizational accounts
- Accounts overview
- Role-based access control (RBAC)
- Account usage
- Workspaces
- Supported connectors
- Astra DB
- Azure AI Search
- Azure Blob Storage
- Couchbase
- Delta Tables in Databricks: Send processed data from Unstructured to a Delta Table in Databricks.
- Databricks Volumes
- Delta Tables in Amazon S3
- Elasticsearch
- IBM FileNet: Connect Unstructured to IBM FileNet as a destination to send processed document data into your FileNet content repository.
- Google Cloud Storage
- IBM watsonx.data
- Kafka
- Milvus
- MongoDB
- MotherDuck
- Neo4j
- OneDrive
- OpenSearch
- Overview
- Pinecone
- Pinecone destination connector quickstart
- PostgreSQL
- Qdrant
- Redis
- S3
- Snowflake
- Teradata
- Teradata Enterprise Vector Store: Send processed data from Unstructured to Teradata Enterprise Vector Store.
- Weaviate
- Jobs
- Overview
- Unstructured Pipelines quickstarts
- Azure Blob Storage
- Box: Ingest your files into Unstructured from Box.
- Confluence
- Couchbase
- Databricks Volumes
- Dropbox
- Dropbox source connector quickstart
- Elasticsearch
- IBM FileNet: Connect Unstructured to IBM FileNet as a source to ingest documents and content from your FileNet content repository into Unstructured.
- Google Cloud Storage
- Google Drive
- Jira
- Kafka
- MongoDB
- OneDrive
- OpenSearch
- Outlook
- Overview
- PostgreSQL
- S3
- Salesforce
- SharePoint
- Slack
- Snowflake
- Teradata
- Zendesk
- Supported file types
- Unstructured Pipelines walkthrough
- Webhooks
- Workflows
- Access to services
- Delete or export personal data stored by Unstructured
- Raise an invoice issue or request a refund
- Overview
- Google Drive source connectors cannot be created by using the API
- Authorization and permissions issues
- S3 destination connector - Unable to locate credentials
- Configuration and resource issues
- Data format, schema, and validation issues
- Document processing issues
- Unable to sign up with personal email address
- Get authenticated user error
- Google Drive schema validation failure
- Internal issues
- Network, connection, and timeout issues
- The Fast strategy is not available for image files
- Overview
- Quota, billing, and rate limiting issues
- Workflow already has a job scheduled or in progress
- Requesting support
- Shared Responsibility Model
- Check service status
- Welcome to Unstructured!
Optional
Meet the modern standard for public facing documentation. Beautiful out of the box, easy to maintain, and optimized for user engagement.
Search through billions of items for similar matches to any object, in milliseconds. It’s the next generation of search, an API call away.
Build and deploy reliable background jobs with no timeouts and no infrastructure to manage.
Get the simple developer experience of SQLite in production, and scale your multi-tenant backend with unlimited databases.
Upstash is a serverless data platform providing low latency and high scalability for real-time applications.
One-click deployments built for teams, tuned for Laravel, loaded with tools and goodies you're going to love.