Skip to main content

Create an Azure Blob Storage Connector

An Azure Blob Storage connector enables you to ingest documents stored in an Azure Blob Storage container into the Zeta Alpha platform. This guide shows you how to create and configure an Azure Blob Storage connector for your data ingestion workflows.

Info: This guide presents an example configuration for an Azure Blob Storage connector. For a complete set of configuration options, see the Azure Blob Storage Connector Configuration Reference.

Prerequisites

Before you begin, ensure you have:

  1. Access to the Zeta Alpha Platform UI
  2. A tenant created
  3. An index created
  4. An Azure Storage Account with a container containing the documents to ingest
  5. The storage account name and an access key (refer to the Connecting Azure Blob Storage to Zeta Alpha guide for detailed instructions on retrieving these credentials)

Step 1: Create the Basic Configuration

To create an Azure Blob Storage connector, define a configuration object with the following fields:

  • container_name: (string, required) Name of the Azure Blob Storage container to crawl.
  • access_credentials: (object, required) Authentication credentials.
    • storage_account_name: (string, required) The name of the Azure Storage Account.
    • storage_account_key: (string, required) An access key for the storage account.
  • logo_url: (string, optional) The URL of a logo to display on document cards.

Example Configuration

{
"name": "My Azure Blob Connector",
"description": "Documents from our Azure Blob Storage container",
"is_indexable": true,
"connector": "azure_blob",
"connector_configuration": {
"container_name": "my-documents-container",
"access_credentials": {
"storage_account_name": "mycompanydocs",
"storage_account_key": "YOUR_STORAGE_ACCOUNT_KEY"
},
"logo_url": "https://example.com/logo.png"
}
}

Step 2: Add Path Filtering (Optional)

You can limit which blobs are crawled using regular expression patterns. Patterns are matched against the full blob path (e.g. research/2024/paper.pdf).

  • path_inclusion_regex_patterns: Only blobs whose path matches at least one of these patterns are crawled.
  • path_exclusion_regex_patterns: Blobs whose path matches any of these patterns are skipped. Exclusions take precedence over inclusions.

Example: Crawl only a specific prefix

{
...
"connector_configuration": {
...
"path_inclusion_regex_patterns": ["^research/"],
"path_exclusion_regex_patterns": ["\\.tmp$", "^research/drafts/"]
}
}

Step 3: Add Field Mapping Configuration (Optional)

You can map blob metadata fields to your index fields using the field_mappings configuration.

Example Field Mappings

{
...
"connector_configuration": {
...
"field_mappings": [
{
"content_source_field_name": "document_title",
"index_field_name": "DCMI.title"
},
{
"content_source_field_name": "document_abstract",
"index_field_name": "DCMI.abstract"
}
]
}
}

Step 4: Create the Azure Blob Storage Content Source

To create your Azure Blob Storage connector in the Zeta Alpha Platform UI:

  1. Navigate to your tenant and click View next to your target index
  2. Click View under Content Sources for the index
  3. Click Create Content Source
  4. Paste your JSON configuration
  5. Click Submit

Crawling Behaviour

The connector lists all blobs in the specified container and downloads each one individually. The crawling process:

  1. Applies date filters: If since_crawl_date or until_crawl_date are set, only blobs last modified within that range are crawled.
  2. Applies path filters: Blobs whose path does not match the inclusion patterns (if set) or that match any exclusion pattern are skipped.
  3. Downloads and processes: Each blob is downloaded and its MIME type is detected. Blobs with unsupported MIME types are skipped.
  4. Applies crawl limit: If crawl_limit is set, the connector stops after yielding that many documents.

A summary log entry is written at the end of each crawl run showing the number of documents yielded and how many were skipped (and why).