Create a Twitter Connector
A Twitter connector enables you to ingest tweets from Twitter into the Zeta Alpha platform. This guide shows you how to create and configure a Twitter connector for your data ingestion workflows.
Info: This guide presents an example configuration for a Twitter connector. For a complete set of configuration options, see the Twitter Connector Configuration Reference.
Prerequisites
Before you begin, ensure you have:
- Access to the Zeta Alpha Platform UI
- A tenant created
- An index created
- Twitter API credentials (refer to the PDF tutorial "Connecting Twitter to Zeta Alpha.pdf" for detailed instructions)
Step 1: Create the Twitter Basic Configuration
To create a Twitter connector, define a configuration file with the following basic fields:
is_document_owner
: (boolean) Indicates whether this connector "owns" the crawled documents. When set totrue
, other connectors cannot crawl the same documents.api_url
: (string) URL of the Twitter API (e.g.,"https://api.twitter.com/2/tweets/search/recent"
).bearer_token
: (string) Token that grants access to the Twitter API. This is obtained during authentication setup.search_query
: (string) Query for matching tweets. String size is limited to 512 characters for standard access, and 1024 characters for Academic Research access. See Twitter's guide on building queries.crawl_limit
: (integer, optional) Maximum number of tweets to crawl on every run. If not passed, the limit is set to not exceed the Twitter API monthly limit.start_date
: (datetime, optional) The oldest UTC timestamp (from most recent seven days) from which tweets will be provided. If not passed, it will crawl tweets from up to seven days ago.end_date
: (datetime, optional) The newest UTC timestamp to which tweets will be crawled. If not passed, it will crawl tweets from as recent as 30 seconds ago.mentioned_url_regex
: (string, optional) Limits the crawledmentioned_urls
to only those that match the regular expression. If the filtered list is empty, the tweet will not be crawled.excluded_users
: (array of strings, optional) Tweets from authors in this list will not be crawled.logo_url
: (string, optional) The URL of a logo to display on document cards
Example Configuration
{
"name": "My Twitter Connector",
"description": "My Twitter connector",
"is_indexable": true,
"connector": "twitter",
"connector_configuration": {
"is_document_owner": true,
"api_url": "https://api.twitter.com/2/tweets/search/recent",
"bearer_token": "your-bearer-token-here",
"search_query": "url:\"https://arxiv.org\" -from:PaperTrending -from:arxivabs",
"crawl_limit": 10000,
"start_date": "2024-01-01T00:00:00Z",
"mentioned_url_regex": ".*arxiv\\.org.*",
"excluded_users": [
"spam_account",
"bot_account"
],
"logo_url": "https://example.com/logo.png"
}
}
Step 2: Add Field Mapping Configuration
When crawling Twitter, the connector extracts document metadata and content as described in the Twitter Connector Configuration Reference. You can map these Twitter fields to your index fields using the field_mappings
configuration.
Example Field Mappings
The following example shows field mappings for the default index fields:
{
...
"connector_configuration": {
...
"field_mappings": [
{
"content_source_field_name": "user_screen_name",
"index_field_name": "DCMI.creator"
},
{
"content_source_field_name": "created_at",
"index_field_name": "DCMI.created"
},
{
"content_source_field_name": "lang",
"index_field_name": "DCMI.language"
},
{
"content_source_field_name": "mentioned_urls",
"index_field_name": "DCMI.relation"
},
{
"content_source_field_name": "uri",
"index_field_name": "uri"
},
{
"content_source_field_name": "document_content_type",
"index_field_name": "document_content_type"
},
{
"content_source_field_name": "document_content",
"index_field_name": "document_content"
}
],
...
}
}
Step 3: Configure Search and Filtering
The Twitter connector provides several options to refine what content is crawled:
Search Query
Build sophisticated queries using Twitter's query operators. For example:
"machine learning"
- exact phraseAI OR ML
- multiple terms-cryptocurrency
- exclude termsfrom:username
- tweets from specific users-from:username
- exclude specific usersurl:"example.com"
- tweets containing URLs from a domain
Time Range
Use start_date
and end_date
to limit the time range. Note that the Twitter API typically provides access to tweets from the most recent seven days (standard access) or full archive (Academic Research access).
URL Filtering
Use mentioned_url_regex
to filter tweets by URLs they contain. This is useful for finding tweets that link to specific domains or content.
Example Advanced Configuration
{
...
"connector_configuration": {
...
"search_query": "(research OR paper) (AI OR ML) url:\"arxiv.org\" -from:bot_account",
"start_date": "2024-01-01T00:00:00Z",
"end_date": "2024-12-31T23:59:59Z",
"mentioned_url_regex": ".*arxiv\\.org/abs/.*",
"crawl_limit": 50000,
...
}
}
Step 4: Create the Twitter Content Source
To create your Twitter connector in the Zeta Alpha Platform UI:
- Navigate to your tenant and click View next to your target index
- Click View under Content Sources for the index
- Click Create Content Source
- Paste your JSON configuration
- Click Submit
Crawling Behavior
The connector crawls tweets from Twitter based on your search query and filters, extracting:
- Tweet content (plain text)
- Author information (screen name, profile URL, profile image, follower/following counts)
- Engagement metrics (retweet count, reply count, like count)
- Creation timestamp
- Language detection
- Mentioned URLs
- Tweet type (tweet, quoted tweet, retweet)
The connector processes tweets in chronological order and respects Twitter API rate limits. It's recommended to set crawl_limit
to stay within your API quota.
Note: The
excluded_users
list is applied after the API query. For better performance, if users fit within the search query character limit, include them directly in thesearch_query
using-from:username
operators.