Create a Twitter Connector

A Twitter connector enables you to ingest tweets from Twitter into the Zeta Alpha platform. This guide shows you how to create and configure a Twitter connector for your data ingestion workflows.

Info: This guide presents an example configuration for a Twitter connector. For a complete set of configuration options, see the Twitter Connector Configuration Reference.

Prerequisites

Before you begin, ensure you have:

Access to the Zeta Alpha Platform UI
A tenant created
An index created
Twitter API credentials (refer to the PDF tutorial "Connecting Twitter to Zeta Alpha.pdf" for detailed instructions)

Step 1: Create the Twitter Basic Configuration

To create a Twitter connector, define a configuration file with the following basic fields:

is_document_owner: (boolean) Indicates whether this connector "owns" the crawled documents. When set to true, other connectors cannot crawl the same documents.
api_url: (string) URL of the Twitter API (e.g., "https://api.twitter.com/2/tweets/search/recent").
bearer_token: (string) Token that grants access to the Twitter API. This is obtained during authentication setup.
search_query: (string) Query for matching tweets. String size is limited to 512 characters for standard access, and 1024 characters for Academic Research access. See Twitter's guide on building queries.
crawl_limit: (integer, optional) Maximum number of tweets to crawl on every run. If not passed, the limit is set to not exceed the Twitter API monthly limit.
start_date: (datetime, optional) The oldest UTC timestamp (from most recent seven days) from which tweets will be provided. If not passed, it will crawl tweets from up to seven days ago.
end_date: (datetime, optional) The newest UTC timestamp to which tweets will be crawled. If not passed, it will crawl tweets from as recent as 30 seconds ago.
mentioned_url_regex: (string, optional) Limits the crawled mentioned_urls to only those that match the regular expression. If the filtered list is empty, the tweet will not be crawled.
excluded_users: (array of strings, optional) Tweets from authors in this list will not be crawled.
logo_url: (string, optional) The URL of a logo to display on document cards

Example Configuration

{
    "name": "My Twitter Connector",
    "description": "My Twitter connector",
    "is_indexable": true,
    "connector": "twitter",
    "connector_configuration": {
        "is_document_owner": true,
        "api_url": "https://api.twitter.com/2/tweets/search/recent",
        "bearer_token": "your-bearer-token-here",
        "search_query": "url:\"https://arxiv.org\" -from:PaperTrending -from:arxivabs",
        "crawl_limit": 10000,
        "start_date": "2024-01-01T00:00:00Z",
        "mentioned_url_regex": ".*arxiv\\.org.*",
        "excluded_users": [
            "spam_account",
            "bot_account"
        ],
        "logo_url": "https://example.com/logo.png"
    }
}

Step 2: Add Field Mapping Configuration

When crawling Twitter, the connector extracts document metadata and content as described in the Twitter Connector Configuration Reference. You can map these Twitter fields to your index fields using the field_mappings configuration.

Example Field Mappings

The following example shows field mappings for the default index fields:

{
    ...
    "connector_configuration": {
        ...
        "field_mappings": [
            {
                "content_source_field_name": "user_screen_name",
                "index_field_name": "DCMI.creator"
            },
            {
                "content_source_field_name": "created_at",
                "index_field_name": "DCMI.created"
            },
            {
                "content_source_field_name": "lang",
                "index_field_name": "DCMI.language"
            },
            {
                "content_source_field_name": "mentioned_urls",
                "index_field_name": "DCMI.relation"
            },
            {
                "content_source_field_name": "uri",
                "index_field_name": "uri"
            },
            {
                "content_source_field_name": "document_content_type",
                "index_field_name": "document_content_type"
            },
            {
                "content_source_field_name": "document_content",
                "index_field_name": "document_content"
            }
        ],
        ...
    }
}

Step 3: Configure Search and Filtering

The Twitter connector provides several options to refine what content is crawled:

Search Query

Build sophisticated queries using Twitter's query operators. For example:

"machine learning" - exact phrase
AI OR ML - multiple terms
-cryptocurrency - exclude terms
from:username - tweets from specific users
-from:username - exclude specific users
url:"example.com" - tweets containing URLs from a domain

Time Range

Use start_date and end_date to limit the time range. Note that the Twitter API typically provides access to tweets from the most recent seven days (standard access) or full archive (Academic Research access).

URL Filtering

Use mentioned_url_regex to filter tweets by URLs they contain. This is useful for finding tweets that link to specific domains or content.

Example Advanced Configuration

{
    ...
    "connector_configuration": {
        ...
        "search_query": "(research OR paper) (AI OR ML) url:\"arxiv.org\" -from:bot_account",
        "start_date": "2024-01-01T00:00:00Z",
        "end_date": "2024-12-31T23:59:59Z",
        "mentioned_url_regex": ".*arxiv\\.org/abs/.*",
        "crawl_limit": 50000,
        ...
    }
}

Step 4: Create the Twitter Content Source

To create your Twitter connector in the Zeta Alpha Platform UI:

Navigate to your tenant and click View next to your target index
Click View under Content Sources for the index
Click Create Content Source
Paste your JSON configuration
Click Submit

Crawling Behavior

The connector crawls tweets from Twitter based on your search query and filters, extracting:

Tweet content (plain text)
Author information (screen name, profile URL, profile image, follower/following counts)
Engagement metrics (retweet count, reply count, like count)
Creation timestamp
Language detection
Mentioned URLs
Tweet type (tweet, quoted tweet, retweet)

The connector processes tweets in chronological order and respects Twitter API rate limits. It's recommended to set crawl_limit to stay within your API quota.

Note: The excluded_users list is applied after the API query. For better performance, if users fit within the search query character limit, include them directly in the search_query using -from:username operators.

Prerequisites​

Step 1: Create the Twitter Basic Configuration​

Example Configuration​

Step 2: Add Field Mapping Configuration​

Example Field Mappings​

Step 3: Configure Search and Filtering​

Search Query​

Time Range​

URL Filtering​

Example Advanced Configuration​

Step 4: Create the Twitter Content Source​

Crawling Behavior​