Skip to main content

Create a GitHub Connector

A GitHub connector enables you to ingest repository files from your GitHub organization into the Zeta Alpha platform. This guide shows you how to create and configure a GitHub connector for your data ingestion workflows.

Info: This guide presents an example configuration for a GitHub connector. For a complete set of configuration options, see the GitHub Connector Configuration Reference.

Prerequisites

Before you begin, ensure you have:

  1. Access to the Zeta Alpha Platform UI
  2. A tenant created
  3. An index created
  4. GitHub credentials (refer to the PDF tutorial "Connecting GitHub to Zeta Alpha.pdf" for detailed instructions)

Step 1: Create the GitHub Basic Configuration

To create a GitHub connector, define a configuration file with the following basic fields:

  • is_document_owner: (boolean) Indicates whether this connector "owns" the crawled documents. When set to true, other connectors cannot crawl the same documents.
  • org_name: (string) The name of the GitHub organization that owns the repositories to be crawled.
  • personal_access_token: (string) Token that grants access to the GitHub content. This is obtained during authentication setup.
  • allowed_repositories: (array of strings) List of repository names to be crawled.
  • logo_url: (string, optional) The URL of a logo to display on document cards

Example Configuration

{
"name": "My GitHub Connector",
"description": "My GitHub connector",
"is_indexable": true,
"connector": "github",
"connector_configuration": {
"is_document_owner": true,
"org_name": "my-organization",
"personal_access_token": "ghp_your_token_here",
"allowed_repositories": [
"repo-1",
"repo-2",
"repo-3"
],
"logo_url": "https://example.com/logo.png"
}
}

Step 2: Add Field Mapping Configuration

When crawling GitHub, the connector extracts document metadata and content as described in the GitHub Connector Configuration Reference. You can map these GitHub fields to your index fields using the field_mappings configuration.

Example Field Mappings

The following example shows field mappings for the default index fields:

{
...
"connector_configuration": {
...
"field_mappings": [
{
"content_source_field_name": "full_path",
"index_field_name": "DCMI.title"
},
{
"content_source_field_name": "last_updated_at",
"index_field_name": "DCMI.modified"
},
{
"content_source_field_name": "committers.name",
"index_field_name": "DCMI.creator"
},
{
"content_source_field_name": "content_source_name",
"index_field_name": "DCMI.source"
},
{
"content_source_field_name": "uri",
"index_field_name": "uri"
},
{
"content_source_field_name": "document_content_type",
"index_field_name": "document_content_type"
},
{
"content_source_field_name": "base64_content",
"index_field_name": "document_content_path.base64_content"
}
],
...
}
}

Step 3: Specify What to Crawl

You can configure the GitHub connector to crawl specific files using the content_configuration field:

  • repository_files: (object, optional) Configuration for crawling files within the repository.
    • path_include_regex_patterns: (array of strings, optional) Files whose paths match any of the regular expressions in the list will be crawled. If a file matches both an include and exclude pattern, the exclude pattern takes precedence and the file is not crawled. If not passed, all files are crawled.
    • path_exclude_regex_patterns: (array of strings, optional) Files whose paths match any of the regular expressions in the list will not be crawled. If a file matches both an include and exclude pattern, the exclude pattern takes precedence and the file is not crawled.

Example Configuration

{
...
"connector_configuration": {
...
"content_configuration": {
"repository_files": {
"path_include_regex_patterns": [
".*\\.md$",
".*\\.py$",
".*\\.js$"
],
"path_exclude_regex_patterns": [
".*node_modules.*",
".*\\.test\\..*",
".*__pycache__.*"
]
}
},
...
}
}

Step 4: Create the GitHub Content Source

To create your GitHub connector in the Zeta Alpha Platform UI:

  1. Navigate to your tenant and click View next to your target index
  2. Click View under Content Sources for the index
  3. Click Create Content Source
  4. Paste your JSON configuration
  5. Click Submit

Crawling Behavior

The connector crawls files from the specified repositories based on your configuration, extracting:

  • File content (base64-encoded)
  • File path and metadata
  • Repository and organization information
  • Commit history and authors
  • Collaborator information
  • File size and type

Access rights are automatically extracted from repository collaborators, including:

  • Outside collaborators
  • Organization members with direct collaborator access
  • Organization members with access through team memberships
  • Organization members with default organization permissions
  • Organization owners

Only authorized users can access the files in Zeta Alpha based on their GitHub repository permissions.