Create a Bluesky Connector
A Bluesky connector enables you to ingest posts from Bluesky into the Zeta Alpha platform. This guide shows you how to create and configure a Bluesky connector for your data ingestion workflows.
Info: This guide presents an example configuration for a Bluesky connector. For a complete set of configuration options, see the Bluesky Connector Configuration Reference.
Prerequisites
Before you begin, ensure you have:
- Access to the Zeta Alpha Platform UI
- A tenant created
- An index created
- Bluesky account credentials (refer to the PDF tutorial "Connecting Bluesky to Zeta Alpha.pdf" for detailed instructions)
Step 1: Create the Bluesky Basic Configuration
To create a Bluesky connector, define a configuration file with the following basic fields:
is_document_owner
: (boolean) Indicates whether this connector "owns" the crawled documents. When set totrue
, other connectors cannot crawl the same documents.search_query
: (string) Search query string. Lucene query syntax is recommended for advanced filtering.credentials
: (object) Account credentials for accessing Bluesky:username
: Bluesky handlepassword
: Bluesky account password
crawl_limit
: (integer, optional) Maximum number of posts to crawl per run.since_crawl_date
: (date, optional) Filter posts after the indicated date (inclusive). Defaults to today.until_crawl_date
: (date, optional) Filter posts before the indicated date (exclusive).mentioned_url_domain
: (string, optional) Filter to posts with URLs linking to the given domain (hostname).excluded_users
: (array of strings, optional) List of user handles to exclude from crawling.logo_url
: (string, optional) The URL of a logo to display on document cards
Example Configuration
{
"name": "My Bluesky Connector",
"description": "My Bluesky connector",
"is_indexable": true,
"connector": "bluesky",
"connector_configuration": {
"is_document_owner": true,
"search_query": "machine learning OR artificial intelligence",
"credentials": {
"username": "your-handle.bsky.social",
"password": "your-password"
},
"crawl_limit": 1000,
"since_crawl_date": "2024-01-01",
"mentioned_url_domain": "arxiv.org",
"excluded_users": [
"spam-account.bsky.social",
"bot-account.bsky.social"
],
"logo_url": "https://example.com/logo.png"
}
}
Step 2: Add Field Mapping Configuration
When crawling Bluesky, the connector extracts document metadata and content as described in the Bluesky Connector Configuration Reference. You can map these Bluesky fields to your index fields using the field_mappings
configuration.
Example Field Mappings
The following example shows field mappings for the default index fields:
{
...
"connector_configuration": {
...
"field_mappings": [
{
"content_source_field_name": "user_screen_name",
"index_field_name": "DCMI.creator"
},
{
"content_source_field_name": "created_at",
"index_field_name": "DCMI.created"
},
{
"content_source_field_name": "lang",
"index_field_name": "DCMI.language"
},
{
"content_source_field_name": "mentioned_urls",
"index_field_name": "DCMI.relation"
},
{
"content_source_field_name": "uri",
"index_field_name": "uri"
},
{
"content_source_field_name": "document_content_type",
"index_field_name": "document_content_type"
},
{
"content_source_field_name": "document_content",
"index_field_name": "document_content"
}
],
...
}
}
Step 3: Configure Search and Filtering
The Bluesky connector provides several options to refine what content is crawled:
-
Search Query: Use Lucene query syntax for advanced filtering. For example:
"machine learning"
- exact phraseML OR "artificial intelligence"
- multiple termsAI -cryptocurrency
- exclude terms
-
Date Filtering: Use
since_crawl_date
anduntil_crawl_date
to limit the time range -
URL Filtering: Use
mentioned_url_domain
to only crawl posts that link to specific domains (e.g., "arxiv.org" for academic papers) -
User Filtering: Use
excluded_users
to skip posts from specific accounts
Example Advanced Configuration
{
...
"connector_configuration": {
...
"search_query": "(research OR paper) AND (AI OR ML) -cryptocurrency",
"since_crawl_date": "2024-01-01",
"until_crawl_date": "2024-12-31",
"mentioned_url_domain": "arxiv.org",
"crawl_limit": 5000,
...
}
}
Step 4: Create the Bluesky Content Source
To create your Bluesky connector in the Zeta Alpha Platform UI:
- Navigate to your tenant and click View next to your target index
- Click View under Content Sources for the index
- Click Create Content Source
- Paste your JSON configuration
- Click Submit
Crawling Behavior
The connector crawls posts from Bluesky based on your search query and filters, extracting:
- Post content (plain text)
- Author information (handle, profile URL, profile image)
- Engagement metrics (likes, reposts, replies)
- Creation timestamp
- Language detection
- Mentioned URLs (optionally filtered by domain)
The connector processes posts in chronological order and respects the configured crawl limits to avoid overwhelming the Bluesky API.