Create an S3 Connector
An S3 connector enables you to ingest documents stored in AWS S3 buckets (or S3-compatible storage) into the Zeta Alpha platform. This guide shows you how to create and configure an S3 connector for your data ingestion workflows.
Info: This guide presents an example configuration for an S3 connector. For a complete set of configuration options, see the S3 Connector Configuration Reference.
Prerequisites
Before you begin, ensure you have:
- Access to the Zeta Alpha Platform UI
- A tenant created
- An index created
- AWS S3 bucket with documents to ingest
- AWS credentials or IAM role configuration (refer to the PDF tutorial "Connecting S3 to Zeta Alpha.pdf" for detailed instructions)
Step 1: Prepare Your S3 Bucket
The S3 connector requires documents to be stored with accompanying metadata files:
Document Structure
For each document file, you must provide:
-
Content file: The actual document (supported extensions: .pdf, .txt, .html, .md, .csv, .json, .doc, .docx, .odt, .xls, .xlsx, .ods, .ppt, .pptx, .odp, .vsd, .odg)
-
Metadata file: A JSON file with the same name as the content file, followed by
.metadata.json
Example:
- Content file:
my_document.pdf
- Metadata file:
my_document.metadata.json
(note: omit the.pdf
extension)
Metadata File Format
The metadata JSON should contain fields matching your index configuration or field mappings. Here's an example for the default index fields:
{
"DCMI.title": "Title of the document",
"DCMI.abstract": "Summary of the document",
"DCMI.creator": [
{
"full_name": "Author Name"
}
],
"DCMI.created": "2025",
"DCMI.date": "2025-01-12",
"DCMI.source": "source",
"DCMI.language": "en",
"DCMI.identifier": [
"https://example.com/123"
],
"access_rights": [
"user_role_id:XYZ"
],
"uri": "https://example.com/123"
}
Step 2: Create the S3 Basic Configuration
To create an S3 connector, define a configuration file with the following basic fields:
is_document_owner
: (boolean) Indicates whether this connector "owns" the crawled documents. When set totrue
, other connectors cannot crawl the same documents.bucket_name
: (string) Name of the S3 bucket where documents are stored.prefix
: (string, optional) Bucket prefix to limit crawling to a specific folder.aws_access_key_id
: (string, optional) AWS access key ID for authentication. For better security, use IAM roles (IRSA) instead of static credentials.aws_secret_access_key
: (string, optional) AWS secret access key for authentication. For better security, use IAM roles (IRSA) instead of static credentials.aws_region
: (string, optional) AWS region for S3 operations.aws_endpoint_url
: (string, optional) Custom endpoint URL for S3-compatible services (e.g., MinIO, LocalStack).logo_url
: (string, optional) The URL of a logo to display on document cards
Example Configuration with AWS Credentials
{
"name": "My S3 Connector",
"description": "My S3 connector",
"is_indexable": true,
"connector": "s3",
"connector_configuration": {
"is_document_owner": true,
"bucket_name": "my-documents-bucket",
"prefix": "documents/2024/",
"aws_access_key_id": "YOUR_ACCESS_KEY_ID",
"aws_secret_access_key": "YOUR_SECRET_ACCESS_KEY",
"aws_region": "us-east-1",
"logo_url": "https://example.com/logo.png"
}
}
Example Configuration with IAM Role (Recommended)
{
"name": "My S3 Connector",
"description": "My S3 connector using IAM role",
"is_indexable": true,
"connector": "s3",
"connector_configuration": {
"is_document_owner": true,
"bucket_name": "my-documents-bucket",
"prefix": "documents/2024/",
"aws_region": "us-east-1",
"logo_url": "https://example.com/logo.png"
}
}
Example Configuration for S3-Compatible Storage
{
"name": "My MinIO Connector",
"description": "Connector for MinIO storage",
"is_indexable": true,
"connector": "s3",
"connector_configuration": {
"is_document_owner": true,
"bucket_name": "my-minio-bucket",
"aws_access_key_id": "YOUR_MINIO_ACCESS_KEY",
"aws_secret_access_key": "YOUR_MINIO_SECRET_KEY",
"aws_endpoint_url": "https://minio.example.com",
"logo_url": "https://example.com/logo.png"
}
}
Step 3: Add Field Mapping Configuration
You can map the fields from your metadata JSON to your index fields using the field_mappings
configuration. This is useful when your metadata structure differs from your index structure.
Example Field Mappings
{
...
"connector_configuration": {
...
"field_mappings": [
{
"content_source_field_name": "document_title",
"index_field_name": "DCMI.title"
},
{
"content_source_field_name": "document_summary",
"index_field_name": "DCMI.abstract"
},
{
"content_source_field_name": "authors.name",
"index_field_name": "DCMI.creator"
}
],
...
}
}
Step 4: Create the S3 Content Source
To create your S3 connector in the Zeta Alpha Platform UI:
- Navigate to your tenant and click View next to your target index
- Click View under Content Sources for the index
- Click Create Content Source
- Paste your JSON configuration
- Click Submit
Crawling Behavior
The S3 connector crawls objects in the specified bucket (or prefix) in alphabetical order by file path. The crawling process:
-
Groups files by prefix: Files are grouped by their base name (without extensions). For example,
document.pdf
anddocument.metadata.json
share the prefixdocument
. -
Processes each group: For each file prefix group:
- Identifies the content file (with supported extension)
- Looks for the corresponding
.metadata.json
file - If metadata exists, combines it with the content file
- If metadata is missing, skips the document with a warning
-
Ingests complete documents: Only documents with both content and metadata are ingested into the platform.