Create a Custom Web Crawler Connector

A Custom Web Crawler connector enables you to ingest content from websites using Scrapy-based crawlers. This guide shows you how to create and configure a Custom Web Crawler connector for your data ingestion workflows.

Info: This guide presents an example configuration for a Custom Web Crawler connector. For a complete set of configuration options, see the Custom Web Crawler Connector Configuration Reference.

Prerequisites

Before you begin, ensure you have:

Access to the Zeta Alpha Platform UI
A tenant created
An index created
Knowledge of your target website's structure and crawling requirements

Step 1: Create the Custom Web Crawler Basic Configuration

To create a Custom Web Crawler connector, define a configuration file with the following basic fields:

is_document_owner: (boolean) Indicates whether this connector "owns" the crawled documents. When set to true, other connectors cannot crawl the same documents.
spider_name: (string) Name of the Scrapy spider to use for crawling. This determines the crawling logic.
spider_options: (array of objects) List of option dictionaries to pass to the spider. Each object represents a separate spider run with different options.
crawler_settings: (object, optional) Scrapy settings to customize crawler behavior (e.g., download delay, concurrent requests, user agent).
access_credentials: (object, optional) Basic authentication credentials if the website requires authentication:
- username: Username for basic authentication
- password: Password for basic authentication
content_source_name: (string, optional) Custom name for the content source.
logo_url: (string, optional) The URL of a logo to display on document cards

Example Configuration

{
    "name": "My Web Crawler Connector",
    "description": "My web crawler connector",
    "is_indexable": true,
    "connector": "custom_web_crawler",
    "connector_configuration": {
        "is_document_owner": true,
        "spider_name": "documentation_spider",
        "spider_options": [
            {
                "start_urls": [
                    "https://example.com/docs"
                ],
                "allowed_domains": [
                    "example.com"
                ],
                "max_depth": 3
            }
        ],
        "crawler_settings": {
            "DOWNLOAD_DELAY": 1,
            "CONCURRENT_REQUESTS": 8,
            "USER_AGENT": "ZetaAlpha Bot 1.0"
        },
        "content_source_name": "Example Documentation",
        "logo_url": "https://example.com/logo.png"
    }
}

Step 2: Add Field Mapping Configuration

The crawled objects depend on your spider implementation. You can map the fields emitted by your spider to your index fields using the field_mappings configuration.

Example Field Mappings

{
    ...
    "connector_configuration": {
        ...
        "field_mappings": [
            {
                "content_source_field_name": "title",
                "index_field_name": "DCMI.title"
            },
            {
                "content_source_field_name": "url",
                "index_field_name": "uri"
            },
            {
                "content_source_field_name": "content",
                "index_field_name": "document_content"
            },
            {
                "content_source_field_name": "last_modified",
                "index_field_name": "DCMI.modified"
            }
        ],
        ...
    }
}

Step 3: Configure Crawler Settings

Scrapy provides extensive configuration options to control crawler behavior:

DOWNLOAD_DELAY: Delay in seconds between requests to the same domain (helps avoid overwhelming servers)
CONCURRENT_REQUESTS: Maximum number of concurrent requests (default: 16)
CONCURRENT_REQUESTS_PER_DOMAIN: Maximum concurrent requests per domain (default: 8)
USER_AGENT: User agent string to identify your crawler
ROBOTSTXT_OBEY: Whether to obey robots.txt rules (default: True)
DEPTH_LIMIT: Maximum depth to crawl from start URLs
CLOSESPIDER_PAGECOUNT: Stop spider after crawling N pages

Example Advanced Configuration

{
    ...
    "connector_configuration": {
        ...
        "crawler_settings": {
            "DOWNLOAD_DELAY": 2,
            "CONCURRENT_REQUESTS": 4,
            "CONCURRENT_REQUESTS_PER_DOMAIN": 2,
            "USER_AGENT": "ZetaAlpha Documentation Crawler",
            "ROBOTSTXT_OBEY": true,
            "DEPTH_LIMIT": 5,
            "CLOSESPIDER_PAGECOUNT": 1000
        },
        ...
    }
}

Step 4: Configure Access Rights

You can configure access rights to control who can view the crawled content:

allow_access_rights: (array of objects, optional) Users with any of these access rights will be able to access the documents.
deny_access_rights: (array of objects, optional) Users with these access rights will not be able to access the documents.

Example Configuration with Access Rights

{
    ...
    "connector_configuration": {
        ...
        "allow_access_rights": [
            {
                "name": "public"
            }
        ],
        ...
    }
}

Step 5: Create the Custom Web Crawler Content Source

To create your Custom Web Crawler connector in the Zeta Alpha Platform UI:

Navigate to your tenant and click View next to your target index
Click View under Content Sources for the index
Click Create Content Source
Paste your JSON configuration
Click Submit

Crawling Behavior

The connector uses Scrapy to crawl websites based on your spider configuration. The spider determines what gets crawled and how the data is extracted. The crawled items are then mapped to your index fields using the field mappings.

The spider_options array allows you to configure multiple crawl runs with different parameters. Each object in the array triggers a separate spider execution.

Best Practices

Respect robots.txt: Set ROBOTSTXT_OBEY: true to respect website crawling rules
Use reasonable delays: Set DOWNLOAD_DELAY to avoid overwhelming servers
Limit concurrent requests: Use CONCURRENT_REQUESTS_PER_DOMAIN to be respectful
Set meaningful user agent: Identify your crawler clearly
Implement depth limits: Use DEPTH_LIMIT to avoid crawling too deep
Use authentication when needed: Configure access_credentials for protected sites

Common Spider Options

Depending on the spider implementation, common options include:

start_urls: List of URLs to start crawling from
allowed_domains: List of domains the spider is allowed to crawl
max_depth: Maximum depth to crawl from start URLs
include_patterns: URL patterns to include
exclude_patterns: URL patterns to exclude
extract_metadata: Whether to extract metadata from pages

Consult your spider documentation for specific options supported by your spider implementation.

Prerequisites​

Step 1: Create the Custom Web Crawler Basic Configuration​

Example Configuration​

Step 2: Add Field Mapping Configuration​

Example Field Mappings​

Step 3: Configure Crawler Settings​

Example Advanced Configuration​

Step 4: Configure Access Rights​

Example Configuration with Access Rights​

Step 5: Create the Custom Web Crawler Content Source​

Crawling Behavior​

Best Practices​

Common Spider Options​