Create an ArXiv Connector
An ArXiv connector enables you to ingest academic papers from arXiv into the Zeta Alpha platform. This guide shows you how to create and configure an ArXiv connector for your data ingestion workflows.
Info: This guide presents an example configuration for an ArXiv connector. For a complete set of configuration options, see the ArXiv Connector Configuration Reference.
Prerequisites
Before you begin, ensure you have:
- Access to the Zeta Alpha Platform UI
- A tenant created
- An index created
Note: ArXiv does not require authentication credentials. The connector uses arXiv's public OAI-PMH protocol.
Step 1: Create the ArXiv Basic Configuration
To create an ArXiv connector, define a configuration file with the following basic fields:
is_document_owner
: (boolean) Indicates whether this connector "owns" the crawled documents. When set totrue
, other connectors cannot crawl the same documents.arxiv_url
: (string) URL of the OAI-PMH API (e.g.,"http://export.arxiv.org/oai2"
). Verify the arXiv documentation for the correct URL to avoid getting banned.arxiv_sets
: (array of strings) List of arXiv category groups to crawl (e.g., "cs", "stat", "eess").arxiv_categories
: (array of strings) List of specific arXiv categories to crawl (e.g., "cs.LG", "cs.AI", "stat.ML"). Note: if no categories are passed for a particular group, nothing will be crawled for that group.max_papers
: (integer, optional) Maximum number of papers to crawl on each run.since_crawl_date
: (string, optional) Publication date of the oldest paper to crawl (ISO 8601 format). Defaults to today.logo_url
: (string, optional) The URL of a logo to display on document cards
Example Configuration
{
"name": "My ArXiv Connector",
"description": "My ArXiv connector for AI papers",
"is_indexable": true,
"connector": "arxiv",
"connector_configuration": {
"is_document_owner": true,
"arxiv_url": "http://export.arxiv.org/oai2",
"arxiv_sets": [
"cs",
"stat"
],
"arxiv_categories": [
"cs.AI",
"cs.LG",
"cs.CL",
"stat.ML"
],
"max_papers": 1000,
"since_crawl_date": "2024-01-01",
"logo_url": "https://example.com/arxiv-logo.png"
}
}
Step 2: Add Field Mapping Configuration
When crawling ArXiv, the connector extracts document metadata and content as described in the ArXiv Connector Configuration Reference. You can map these ArXiv fields to your index fields using the field_mappings
configuration.
Example Field Mappings
The following example shows field mappings for the default index fields:
{
...
"connector_configuration": {
...
"field_mappings": [
{
"content_source_field_name": "title",
"index_field_name": "DCMI.title"
},
{
"content_source_field_name": "abstract",
"index_field_name": "DCMI.abstract"
},
{
"content_source_field_name": "authors.full_name",
"index_field_name": "DCMI.creator"
},
{
"content_source_field_name": "created_at",
"index_field_name": "DCMI.created"
},
{
"content_source_field_name": "last_updated_at",
"index_field_name": "DCMI.modified"
},
{
"content_source_field_name": "categories",
"index_field_name": "DCMI.subject"
},
{
"content_source_field_name": "identifiers",
"index_field_name": "DCMI.identifier"
},
{
"content_source_field_name": "uri",
"index_field_name": "uri"
},
{
"content_source_field_name": "pdf_url",
"index_field_name": "document_content_path.url_content.url"
},
{
"content_source_field_name": "document_content_type",
"index_field_name": "document_content_type"
}
],
...
}
}
Step 3: Configure Access Rights
You can configure access rights to control who can view the ingested ArXiv papers:
allow_access_rights
: (array of objects, optional) Users with any of these access rights will be able to access the documents. If not passed, no user will be able to retrieve the documents.name
: (string) The name of the access right (e.g., user UUID or "public")type
: (string, optional) The type of access right (e.g., "user_uuid")content_source_id
: (string, optional) If the right is scoped to a particular content source
deny_access_rights
: (array of objects, optional) Users with these access rights will not be able to access the documents.
Example Configuration with Access Rights
{
...
"connector_configuration": {
...
"allow_access_rights": [
{
"name": "public"
}
],
...
}
}
Step 4: Understand Category Groups and Categories
ArXiv organizes papers into category groups (sets) and specific categories:
- Category Groups (sets): Broad subject areas like "cs" (Computer Science), "stat" (Statistics), "eess" (Electrical Engineering and Systems Science)
- Categories: Specific topics like "cs.AI" (Artificial Intelligence), "cs.LG" (Machine Learning), "stat.ML" (Machine Learning in Statistics)
Important: You must specify both arxiv_sets
and arxiv_categories
. If you include a category group in arxiv_sets
but don't include any of its categories in arxiv_categories
, no papers from that group will be crawled.
For example:
{
"arxiv_sets": ["cs", "stat"],
"arxiv_categories": ["cs.AI", "cs.LG"]
// ❌ No "stat.*" categories specified, so no statistics papers will be crawled
}
Step 5: Create the ArXiv Content Source
To create your ArXiv connector in the Zeta Alpha Platform UI:
- Navigate to your tenant and click View next to your target index
- Click View under Content Sources for the index
- Click Create Content Source
- Paste your JSON configuration
- Click Submit
Crawling Behavior
The connector crawls papers from arXiv using the OAI-PMH protocol, extracting:
- Paper title and abstract
- Author information
- ArXiv ID and other identifiers (DOI if available)
- Categories and subject classifications
- Creation and update timestamps
- PDF URL for the full paper
- Journal information (if published)
The connector fetches papers based on your date range and category filters, respecting the max_papers
limit. Papers are downloaded as PDFs for full-text indexing.
Note: To avoid being banned by arXiv, ensure you use the correct API URL and implement reasonable crawl limits. ArXiv has usage guidelines that should be followed.