Create a Custom Web Crawler Connector
A Custom Web Crawler connector enables you to ingest content from websites using Scrapy-based crawlers. This guide shows you how to create and configure a Custom Web Crawler connector for your data ingestion workflows.
Info: This guide presents an example configuration for a Custom Web Crawler connector. For a complete set of configuration options, see the Custom Web Crawler Connector Configuration Reference.
Prerequisites
Before you begin, ensure you have:
- Access to the Zeta Alpha Platform UI
- A tenant created
- An index created
- Knowledge of your target website's structure and crawling requirements
Step 1: Create the Custom Web Crawler Basic Configuration
To create a Custom Web Crawler connector, define a configuration file with the following basic fields:
is_document_owner
: (boolean) Indicates whether this connector "owns" the crawled documents. When set totrue
, other connectors cannot crawl the same documents.spider_name
: (string) Name of the Scrapy spider to use for crawling. This determines the crawling logic.spider_options
: (array of objects) List of option dictionaries to pass to the spider. Each object represents a separate spider run with different options.crawler_settings
: (object, optional) Scrapy settings to customize crawler behavior (e.g., download delay, concurrent requests, user agent).access_credentials
: (object, optional) Basic authentication credentials if the website requires authentication:username
: Username for basic authenticationpassword
: Password for basic authentication
content_source_name
: (string, optional) Custom name for the content source.logo_url
: (string, optional) The URL of a logo to display on document cards
Example Configuration
{
"name": "My Web Crawler Connector",
"description": "My web crawler connector",
"is_indexable": true,
"connector": "custom_web_crawler",
"connector_configuration": {
"is_document_owner": true,
"spider_name": "documentation_spider",
"spider_options": [
{
"start_urls": [
"https://example.com/docs"
],
"allowed_domains": [
"example.com"
],
"max_depth": 3
}
],
"crawler_settings": {
"DOWNLOAD_DELAY": 1,
"CONCURRENT_REQUESTS": 8,
"USER_AGENT": "ZetaAlpha Bot 1.0"
},
"content_source_name": "Example Documentation",
"logo_url": "https://example.com/logo.png"
}
}
Step 2: Add Field Mapping Configuration
The crawled objects depend on your spider implementation. You can map the fields emitted by your spider to your index fields using the field_mappings
configuration.
Example Field Mappings
{
...
"connector_configuration": {
...
"field_mappings": [
{
"content_source_field_name": "title",
"index_field_name": "DCMI.title"
},
{
"content_source_field_name": "url",
"index_field_name": "uri"
},
{
"content_source_field_name": "content",
"index_field_name": "document_content"
},
{
"content_source_field_name": "last_modified",
"index_field_name": "DCMI.modified"
}
],
...
}
}
Step 3: Configure Crawler Settings
Scrapy provides extensive configuration options to control crawler behavior:
DOWNLOAD_DELAY
: Delay in seconds between requests to the same domain (helps avoid overwhelming servers)CONCURRENT_REQUESTS
: Maximum number of concurrent requests (default: 16)CONCURRENT_REQUESTS_PER_DOMAIN
: Maximum concurrent requests per domain (default: 8)USER_AGENT
: User agent string to identify your crawlerROBOTSTXT_OBEY
: Whether to obey robots.txt rules (default: True)DEPTH_LIMIT
: Maximum depth to crawl from start URLsCLOSESPIDER_PAGECOUNT
: Stop spider after crawling N pages
Example Advanced Configuration
{
...
"connector_configuration": {
...
"crawler_settings": {
"DOWNLOAD_DELAY": 2,
"CONCURRENT_REQUESTS": 4,
"CONCURRENT_REQUESTS_PER_DOMAIN": 2,
"USER_AGENT": "ZetaAlpha Documentation Crawler",
"ROBOTSTXT_OBEY": true,
"DEPTH_LIMIT": 5,
"CLOSESPIDER_PAGECOUNT": 1000
},
...
}
}
Step 4: Configure Access Rights
You can configure access rights to control who can view the crawled content:
allow_access_rights
: (array of objects, optional) Users with any of these access rights will be able to access the documents.deny_access_rights
: (array of objects, optional) Users with these access rights will not be able to access the documents.
Example Configuration with Access Rights
{
...
"connector_configuration": {
...
"allow_access_rights": [
{
"name": "public"
}
],
...
}
}
Step 5: Create the Custom Web Crawler Content Source
To create your Custom Web Crawler connector in the Zeta Alpha Platform UI:
- Navigate to your tenant and click View next to your target index
- Click View under Content Sources for the index
- Click Create Content Source
- Paste your JSON configuration
- Click Submit
Crawling Behavior
The connector uses Scrapy to crawl websites based on your spider configuration. The spider determines what gets crawled and how the data is extracted. The crawled items are then mapped to your index fields using the field mappings.
The spider_options
array allows you to configure multiple crawl runs with different parameters. Each object in the array triggers a separate spider execution.
Best Practices
- Respect robots.txt: Set
ROBOTSTXT_OBEY: true
to respect website crawling rules - Use reasonable delays: Set
DOWNLOAD_DELAY
to avoid overwhelming servers - Limit concurrent requests: Use
CONCURRENT_REQUESTS_PER_DOMAIN
to be respectful - Set meaningful user agent: Identify your crawler clearly
- Implement depth limits: Use
DEPTH_LIMIT
to avoid crawling too deep - Use authentication when needed: Configure
access_credentials
for protected sites
Common Spider Options
Depending on the spider implementation, common options include:
start_urls
: List of URLs to start crawling fromallowed_domains
: List of domains the spider is allowed to crawlmax_depth
: Maximum depth to crawl from start URLsinclude_patterns
: URL patterns to includeexclude_patterns
: URL patterns to excludeextract_metadata
: Whether to extract metadata from pages
Consult your spider documentation for specific options supported by your spider implementation.