Skip to main content

Create a Custom Web Crawler Connector

A Custom Web Crawler connector enables you to ingest content from websites using Scrapy-based crawlers. This guide shows you how to create and configure a Custom Web Crawler connector for your data ingestion workflows.

Info: This guide presents an example configuration for a Custom Web Crawler connector. For a complete set of configuration options, see the Custom Web Crawler Connector Configuration Reference.

Prerequisites

Before you begin, ensure you have:

  1. Access to the Zeta Alpha Platform UI
  2. A tenant created
  3. An index created
  4. Knowledge of your target website's structure and crawling requirements

Step 1: Create the Custom Web Crawler Basic Configuration

To create a Custom Web Crawler connector, define a configuration file with the following basic fields:

  • is_document_owner: (boolean) Indicates whether this connector "owns" the crawled documents. When set to true, other connectors cannot crawl the same documents.
  • spider_name: (string) Name of the Scrapy spider to use for crawling. This determines the crawling logic.
  • spider_options: (array of objects) List of option dictionaries to pass to the spider. Each object represents a separate spider run with different options.
  • crawler_settings: (object, optional) Scrapy settings to customize crawler behavior (e.g., download delay, concurrent requests, user agent).
  • access_credentials: (object, optional) Basic authentication credentials if the website requires authentication:
    • username: Username for basic authentication
    • password: Password for basic authentication
  • content_source_name: (string, optional) Custom name for the content source.
  • logo_url: (string, optional) The URL of a logo to display on document cards

Example Configuration

{
"name": "My Web Crawler Connector",
"description": "My web crawler connector",
"is_indexable": true,
"connector": "custom_web_crawler",
"connector_configuration": {
"is_document_owner": true,
"spider_name": "documentation_spider",
"spider_options": [
{
"start_urls": [
"https://example.com/docs"
],
"allowed_domains": [
"example.com"
],
"max_depth": 3
}
],
"crawler_settings": {
"DOWNLOAD_DELAY": 1,
"CONCURRENT_REQUESTS": 8,
"USER_AGENT": "ZetaAlpha Bot 1.0"
},
"content_source_name": "Example Documentation",
"logo_url": "https://example.com/logo.png"
}
}

Step 2: Add Field Mapping Configuration

The crawled objects depend on your spider implementation. You can map the fields emitted by your spider to your index fields using the field_mappings configuration.

Example Field Mappings

{
...
"connector_configuration": {
...
"field_mappings": [
{
"content_source_field_name": "title",
"index_field_name": "DCMI.title"
},
{
"content_source_field_name": "url",
"index_field_name": "uri"
},
{
"content_source_field_name": "content",
"index_field_name": "document_content"
},
{
"content_source_field_name": "last_modified",
"index_field_name": "DCMI.modified"
}
],
...
}
}

Step 3: Configure Crawler Settings

Scrapy provides extensive configuration options to control crawler behavior:

  • DOWNLOAD_DELAY: Delay in seconds between requests to the same domain (helps avoid overwhelming servers)
  • CONCURRENT_REQUESTS: Maximum number of concurrent requests (default: 16)
  • CONCURRENT_REQUESTS_PER_DOMAIN: Maximum concurrent requests per domain (default: 8)
  • USER_AGENT: User agent string to identify your crawler
  • ROBOTSTXT_OBEY: Whether to obey robots.txt rules (default: True)
  • DEPTH_LIMIT: Maximum depth to crawl from start URLs
  • CLOSESPIDER_PAGECOUNT: Stop spider after crawling N pages

Example Advanced Configuration

{
...
"connector_configuration": {
...
"crawler_settings": {
"DOWNLOAD_DELAY": 2,
"CONCURRENT_REQUESTS": 4,
"CONCURRENT_REQUESTS_PER_DOMAIN": 2,
"USER_AGENT": "ZetaAlpha Documentation Crawler",
"ROBOTSTXT_OBEY": true,
"DEPTH_LIMIT": 5,
"CLOSESPIDER_PAGECOUNT": 1000
},
...
}
}

Step 4: Configure Access Rights

You can configure access rights to control who can view the crawled content:

  • allow_access_rights: (array of objects, optional) Users with any of these access rights will be able to access the documents.
  • deny_access_rights: (array of objects, optional) Users with these access rights will not be able to access the documents.

Example Configuration with Access Rights

{
...
"connector_configuration": {
...
"allow_access_rights": [
{
"name": "public"
}
],
...
}
}

Step 5: Create the Custom Web Crawler Content Source

To create your Custom Web Crawler connector in the Zeta Alpha Platform UI:

  1. Navigate to your tenant and click View next to your target index
  2. Click View under Content Sources for the index
  3. Click Create Content Source
  4. Paste your JSON configuration
  5. Click Submit

Crawling Behavior

The connector uses Scrapy to crawl websites based on your spider configuration. The spider determines what gets crawled and how the data is extracted. The crawled items are then mapped to your index fields using the field mappings.

The spider_options array allows you to configure multiple crawl runs with different parameters. Each object in the array triggers a separate spider execution.

Best Practices

  1. Respect robots.txt: Set ROBOTSTXT_OBEY: true to respect website crawling rules
  2. Use reasonable delays: Set DOWNLOAD_DELAY to avoid overwhelming servers
  3. Limit concurrent requests: Use CONCURRENT_REQUESTS_PER_DOMAIN to be respectful
  4. Set meaningful user agent: Identify your crawler clearly
  5. Implement depth limits: Use DEPTH_LIMIT to avoid crawling too deep
  6. Use authentication when needed: Configure access_credentials for protected sites

Common Spider Options

Depending on the spider implementation, common options include:

  • start_urls: List of URLs to start crawling from
  • allowed_domains: List of domains the spider is allowed to crawl
  • max_depth: Maximum depth to crawl from start URLs
  • include_patterns: URL patterns to include
  • exclude_patterns: URL patterns to exclude
  • extract_metadata: Whether to extract metadata from pages

Consult your spider documentation for specific options supported by your spider implementation.