Skip to main content

Create an S3 Connector

An S3 connector enables you to ingest documents stored in AWS S3 buckets (or S3-compatible storage) into the Zeta Alpha platform. This guide shows you how to create and configure an S3 connector for your data ingestion workflows.

Info: This guide presents an example configuration for an S3 connector. For a complete set of configuration options, see the S3 Connector Configuration Reference.

Prerequisites

Before you begin, ensure you have:

  1. Access to the Zeta Alpha Platform UI
  2. A tenant created
  3. An index created
  4. AWS S3 bucket with documents to ingest
  5. AWS credentials or IAM role configuration (refer to the PDF tutorial "Connecting S3 to Zeta Alpha.pdf" for detailed instructions)

Step 1: Prepare Your S3 Bucket

The S3 connector requires documents to be stored with accompanying metadata files:

Document Structure

For each document file, you must provide:

  1. Content file: The actual document (supported extensions: .pdf, .txt, .html, .md, .csv, .json, .doc, .docx, .odt, .xls, .xlsx, .ods, .ppt, .pptx, .odp, .vsd, .odg)

  2. Metadata file: A JSON file with the same name as the content file, followed by .metadata.json

Example:

  • Content file: my_document.pdf
  • Metadata file: my_document.metadata.json (note: omit the .pdf extension)

Metadata File Format

The metadata JSON should contain fields matching your index configuration or field mappings. Here's an example for the default index fields:

{
"DCMI.title": "Title of the document",
"DCMI.abstract": "Summary of the document",
"DCMI.creator": [
{
"full_name": "Author Name"
}
],
"DCMI.created": "2025",
"DCMI.date": "2025-01-12",
"DCMI.source": "source",
"DCMI.language": "en",
"DCMI.identifier": [
"https://example.com/123"
],
"access_rights": [
"user_role_id:XYZ"
],
"uri": "https://example.com/123"
}

Step 2: Create the S3 Basic Configuration

To create an S3 connector, define a configuration file with the following basic fields:

  • is_document_owner: (boolean) Indicates whether this connector "owns" the crawled documents. When set to true, other connectors cannot crawl the same documents.
  • bucket_name: (string) Name of the S3 bucket where documents are stored.
  • prefix: (string, optional) Bucket prefix to limit crawling to a specific folder.
  • aws_access_key_id: (string, optional) AWS access key ID for authentication. For better security, use IAM roles (IRSA) instead of static credentials.
  • aws_secret_access_key: (string, optional) AWS secret access key for authentication. For better security, use IAM roles (IRSA) instead of static credentials.
  • aws_region: (string, optional) AWS region for S3 operations.
  • aws_endpoint_url: (string, optional) Custom endpoint URL for S3-compatible services (e.g., MinIO, LocalStack).
  • logo_url: (string, optional) The URL of a logo to display on document cards

Example Configuration with AWS Credentials

{
"name": "My S3 Connector",
"description": "My S3 connector",
"is_indexable": true,
"connector": "s3",
"connector_configuration": {
"is_document_owner": true,
"bucket_name": "my-documents-bucket",
"prefix": "documents/2024/",
"aws_access_key_id": "YOUR_ACCESS_KEY_ID",
"aws_secret_access_key": "YOUR_SECRET_ACCESS_KEY",
"aws_region": "us-east-1",
"logo_url": "https://example.com/logo.png"
}
}
{
"name": "My S3 Connector",
"description": "My S3 connector using IAM role",
"is_indexable": true,
"connector": "s3",
"connector_configuration": {
"is_document_owner": true,
"bucket_name": "my-documents-bucket",
"prefix": "documents/2024/",
"aws_region": "us-east-1",
"logo_url": "https://example.com/logo.png"
}
}

Example Configuration for S3-Compatible Storage

{
"name": "My MinIO Connector",
"description": "Connector for MinIO storage",
"is_indexable": true,
"connector": "s3",
"connector_configuration": {
"is_document_owner": true,
"bucket_name": "my-minio-bucket",
"aws_access_key_id": "YOUR_MINIO_ACCESS_KEY",
"aws_secret_access_key": "YOUR_MINIO_SECRET_KEY",
"aws_endpoint_url": "https://minio.example.com",
"logo_url": "https://example.com/logo.png"
}
}

Step 3: Add Field Mapping Configuration

You can map the fields from your metadata JSON to your index fields using the field_mappings configuration. This is useful when your metadata structure differs from your index structure.

Example Field Mappings

{
...
"connector_configuration": {
...
"field_mappings": [
{
"content_source_field_name": "document_title",
"index_field_name": "DCMI.title"
},
{
"content_source_field_name": "document_summary",
"index_field_name": "DCMI.abstract"
},
{
"content_source_field_name": "authors.name",
"index_field_name": "DCMI.creator"
}
],
...
}
}

Step 4: Create the S3 Content Source

To create your S3 connector in the Zeta Alpha Platform UI:

  1. Navigate to your tenant and click View next to your target index
  2. Click View under Content Sources for the index
  3. Click Create Content Source
  4. Paste your JSON configuration
  5. Click Submit

Crawling Behavior

The S3 connector crawls objects in the specified bucket (or prefix) in alphabetical order by file path. The crawling process:

  1. Groups files by prefix: Files are grouped by their base name (without extensions). For example, document.pdf and document.metadata.json share the prefix document.

  2. Processes each group: For each file prefix group:

    • Identifies the content file (with supported extension)
    • Looks for the corresponding .metadata.json file
    • If metadata exists, combines it with the content file
    • If metadata is missing, skips the document with a warning
  3. Ingests complete documents: Only documents with both content and metadata are ingested into the platform.