Create a GitHub Connector

A GitHub connector enables you to ingest repository files from your GitHub organization into the Zeta Alpha platform. This guide shows you how to create and configure a GitHub connector for your data ingestion workflows.

Info: This guide presents an example configuration for a GitHub connector. For a complete set of configuration options, see the GitHub Connector Configuration Reference.

Prerequisites

Before you begin, ensure you have:

Access to the Zeta Alpha Platform UI
A tenant created
An index created
GitHub credentials (refer to the PDF tutorial "Connecting GitHub to Zeta Alpha.pdf" for detailed instructions)

Step 1: Create the GitHub Basic Configuration

To create a GitHub connector, define a configuration file with the following basic fields:

is_document_owner: (boolean) Indicates whether this connector "owns" the crawled documents. When set to true, other connectors cannot crawl the same documents.
org_name: (string) The name of the GitHub organization that owns the repositories to be crawled.
personal_access_token: (string) Token that grants access to the GitHub content. This is obtained during authentication setup.
allowed_repositories: (array of strings) List of repository names to be crawled.
logo_url: (string, optional) The URL of a logo to display on document cards

Example Configuration

{
    "name": "My GitHub Connector",
    "description": "My GitHub connector",
    "is_indexable": true,
    "connector": "github",
    "connector_configuration": {
        "is_document_owner": true,
        "org_name": "my-organization",
        "personal_access_token": "ghp_your_token_here",
        "allowed_repositories": [
            "repo-1",
            "repo-2",
            "repo-3"
        ],
        "logo_url": "https://example.com/logo.png"
    }
}

Step 2: Add Field Mapping Configuration

When crawling GitHub, the connector extracts document metadata and content as described in the GitHub Connector Configuration Reference. You can map these GitHub fields to your index fields using the field_mappings configuration.

Example Field Mappings

The following example shows field mappings for the default index fields:

{
    ...
    "connector_configuration": {
        ...
        "field_mappings": [
            {
                "content_source_field_name": "full_path",
                "index_field_name": "DCMI.title"
            },
            {
                "content_source_field_name": "last_updated_at",
                "index_field_name": "DCMI.modified"
            },
            {
                "content_source_field_name": "committers.name",
                "index_field_name": "DCMI.creator"
            },
            {
                "content_source_field_name": "content_source_name",
                "index_field_name": "DCMI.source"
            },
            {
                "content_source_field_name": "uri",
                "index_field_name": "uri"
            },
            {
                "content_source_field_name": "document_content_type",
                "index_field_name": "document_content_type"
            },
            {
                "content_source_field_name": "base64_content",
                "index_field_name": "document_content_path.base64_content"
            }
        ],
        ...
    }
}

Step 3: Specify What to Crawl

You can configure the GitHub connector to crawl specific files using the content_configuration field:

repository_files: (object, optional) Configuration for crawling files within the repository.
- path_include_regex_patterns: (array of strings, optional) Files whose paths match any of the regular expressions in the list will be crawled. If a file matches both an include and exclude pattern, the exclude pattern takes precedence and the file is not crawled. If not passed, all files are crawled.
- path_exclude_regex_patterns: (array of strings, optional) Files whose paths match any of the regular expressions in the list will not be crawled. If a file matches both an include and exclude pattern, the exclude pattern takes precedence and the file is not crawled.

Example Configuration

{
    ...
    "connector_configuration": {
        ...
        "content_configuration": {
            "repository_files": {
                "path_include_regex_patterns": [
                    ".*\\.md$",
                    ".*\\.py$",
                    ".*\\.js$"
                ],
                "path_exclude_regex_patterns": [
                    ".*node_modules.*",
                    ".*\\.test\\..*",
                    ".*__pycache__.*"
                ]
            }
        },
        ...
    }
}

Step 4: Create the GitHub Content Source

To create your GitHub connector in the Zeta Alpha Platform UI:

Navigate to your tenant and click View next to your target index
Click View under Content Sources for the index
Click Create Content Source
Paste your JSON configuration
Click Submit

Crawling Behavior

The connector crawls files from the specified repositories based on your configuration, extracting:

File content (base64-encoded)
File path and metadata
Repository and organization information
Commit history and authors
Collaborator information
File size and type

Access rights are automatically extracted from repository collaborators, including:

Outside collaborators
Organization members with direct collaborator access
Organization members with access through team memberships
Organization members with default organization permissions
Organization owners

Only authorized users can access the files in Zeta Alpha based on their GitHub repository permissions.

Prerequisites​

Step 1: Create the GitHub Basic Configuration​

Example Configuration​

Step 2: Add Field Mapping Configuration​

Example Field Mappings​

Step 3: Specify What to Crawl​

Example Configuration​

Step 4: Create the GitHub Content Source​

Crawling Behavior​

Prerequisites

Step 1: Create the GitHub Basic Configuration

Example Configuration

Step 2: Add Field Mapping Configuration

Example Field Mappings

Step 3: Specify What to Crawl

Example Configuration

Step 4: Create the GitHub Content Source

Crawling Behavior