Skip to main content

Create a Bluesky Connector

A Bluesky connector enables you to ingest posts from Bluesky into the Zeta Alpha platform. This guide shows you how to create and configure a Bluesky connector for your data ingestion workflows.

Info: This guide presents an example configuration for a Bluesky connector. For a complete set of configuration options, see the Bluesky Connector Configuration Reference.

Prerequisites

Before you begin, ensure you have:

  1. Access to the Zeta Alpha Platform UI
  2. A tenant created
  3. An index created
  4. Bluesky account credentials (refer to the PDF tutorial "Connecting Bluesky to Zeta Alpha.pdf" for detailed instructions)

Step 1: Create the Bluesky Basic Configuration

To create a Bluesky connector, define a configuration file with the following basic fields:

  • is_document_owner: (boolean) Indicates whether this connector "owns" the crawled documents. When set to true, other connectors cannot crawl the same documents.
  • search_query: (string) Search query string. Lucene query syntax is recommended for advanced filtering.
  • credentials: (object) Account credentials for accessing Bluesky:
    • username: Bluesky handle
    • password: Bluesky account password
  • crawl_limit: (integer, optional) Maximum number of posts to crawl per run.
  • since_crawl_date: (date, optional) Filter posts after the indicated date (inclusive). Defaults to today.
  • until_crawl_date: (date, optional) Filter posts before the indicated date (exclusive).
  • mentioned_url_domain: (string, optional) Filter to posts with URLs linking to the given domain (hostname).
  • excluded_users: (array of strings, optional) List of user handles to exclude from crawling.
  • logo_url: (string, optional) The URL of a logo to display on document cards

Example Configuration

{
"name": "My Bluesky Connector",
"description": "My Bluesky connector",
"is_indexable": true,
"connector": "bluesky",
"connector_configuration": {
"is_document_owner": true,
"search_query": "machine learning OR artificial intelligence",
"credentials": {
"username": "your-handle.bsky.social",
"password": "your-password"
},
"crawl_limit": 1000,
"since_crawl_date": "2024-01-01",
"mentioned_url_domain": "arxiv.org",
"excluded_users": [
"spam-account.bsky.social",
"bot-account.bsky.social"
],
"logo_url": "https://example.com/logo.png"
}
}

Step 2: Add Field Mapping Configuration

When crawling Bluesky, the connector extracts document metadata and content as described in the Bluesky Connector Configuration Reference. You can map these Bluesky fields to your index fields using the field_mappings configuration.

Example Field Mappings

The following example shows field mappings for the default index fields:

{
...
"connector_configuration": {
...
"field_mappings": [
{
"content_source_field_name": "user_screen_name",
"index_field_name": "DCMI.creator"
},
{
"content_source_field_name": "created_at",
"index_field_name": "DCMI.created"
},
{
"content_source_field_name": "lang",
"index_field_name": "DCMI.language"
},
{
"content_source_field_name": "mentioned_urls",
"index_field_name": "DCMI.relation"
},
{
"content_source_field_name": "uri",
"index_field_name": "uri"
},
{
"content_source_field_name": "document_content_type",
"index_field_name": "document_content_type"
},
{
"content_source_field_name": "document_content",
"index_field_name": "document_content"
}
],
...
}
}

Step 3: Configure Search and Filtering

The Bluesky connector provides several options to refine what content is crawled:

  • Search Query: Use Lucene query syntax for advanced filtering. For example:

    • "machine learning" - exact phrase
    • ML OR "artificial intelligence" - multiple terms
    • AI -cryptocurrency - exclude terms
  • Date Filtering: Use since_crawl_date and until_crawl_date to limit the time range

  • URL Filtering: Use mentioned_url_domain to only crawl posts that link to specific domains (e.g., "arxiv.org" for academic papers)

  • User Filtering: Use excluded_users to skip posts from specific accounts

Example Advanced Configuration

{
...
"connector_configuration": {
...
"search_query": "(research OR paper) AND (AI OR ML) -cryptocurrency",
"since_crawl_date": "2024-01-01",
"until_crawl_date": "2024-12-31",
"mentioned_url_domain": "arxiv.org",
"crawl_limit": 5000,
...
}
}

Step 4: Create the Bluesky Content Source

To create your Bluesky connector in the Zeta Alpha Platform UI:

  1. Navigate to your tenant and click View next to your target index
  2. Click View under Content Sources for the index
  3. Click Create Content Source
  4. Paste your JSON configuration
  5. Click Submit

Crawling Behavior

The connector crawls posts from Bluesky based on your search query and filters, extracting:

  • Post content (plain text)
  • Author information (handle, profile URL, profile image)
  • Engagement metrics (likes, reposts, replies)
  • Creation timestamp
  • Language detection
  • Mentioned URLs (optionally filtered by domain)

The connector processes posts in chronological order and respects the configured crawl limits to avoid overwhelming the Bluesky API.