Skip to main content

Create an ArXiv Connector

An ArXiv connector enables you to ingest academic papers from arXiv into the Zeta Alpha platform. This guide shows you how to create and configure an ArXiv connector for your data ingestion workflows.

Info: This guide presents an example configuration for an ArXiv connector. For a complete set of configuration options, see the ArXiv Connector Configuration Reference.

Prerequisites

Before you begin, ensure you have:

  1. Access to the Zeta Alpha Platform UI
  2. A tenant created
  3. An index created

Note: ArXiv does not require authentication credentials. The connector uses arXiv's public OAI-PMH protocol.

Step 1: Create the ArXiv Basic Configuration

To create an ArXiv connector, define a configuration file with the following basic fields:

  • is_document_owner: (boolean) Indicates whether this connector "owns" the crawled documents. When set to true, other connectors cannot crawl the same documents.
  • arxiv_url: (string) URL of the OAI-PMH API (e.g., "http://export.arxiv.org/oai2"). Verify the arXiv documentation for the correct URL to avoid getting banned.
  • arxiv_sets: (array of strings) List of arXiv category groups to crawl (e.g., "cs", "stat", "eess").
  • arxiv_categories: (array of strings) List of specific arXiv categories to crawl (e.g., "cs.LG", "cs.AI", "stat.ML"). Note: if no categories are passed for a particular group, nothing will be crawled for that group.
  • max_papers: (integer, optional) Maximum number of papers to crawl on each run.
  • since_crawl_date: (string, optional) Publication date of the oldest paper to crawl (ISO 8601 format). Defaults to today.
  • logo_url: (string, optional) The URL of a logo to display on document cards

Example Configuration

{
"name": "My ArXiv Connector",
"description": "My ArXiv connector for AI papers",
"is_indexable": true,
"connector": "arxiv",
"connector_configuration": {
"is_document_owner": true,
"arxiv_url": "http://export.arxiv.org/oai2",
"arxiv_sets": [
"cs",
"stat"
],
"arxiv_categories": [
"cs.AI",
"cs.LG",
"cs.CL",
"stat.ML"
],
"max_papers": 1000,
"since_crawl_date": "2024-01-01",
"logo_url": "https://example.com/arxiv-logo.png"
}
}

Step 2: Add Field Mapping Configuration

When crawling ArXiv, the connector extracts document metadata and content as described in the ArXiv Connector Configuration Reference. You can map these ArXiv fields to your index fields using the field_mappings configuration.

Example Field Mappings

The following example shows field mappings for the default index fields:

{
...
"connector_configuration": {
...
"field_mappings": [
{
"content_source_field_name": "title",
"index_field_name": "DCMI.title"
},
{
"content_source_field_name": "abstract",
"index_field_name": "DCMI.abstract"
},
{
"content_source_field_name": "authors.full_name",
"index_field_name": "DCMI.creator"
},
{
"content_source_field_name": "created_at",
"index_field_name": "DCMI.created"
},
{
"content_source_field_name": "last_updated_at",
"index_field_name": "DCMI.modified"
},
{
"content_source_field_name": "categories",
"index_field_name": "DCMI.subject"
},
{
"content_source_field_name": "identifiers",
"index_field_name": "DCMI.identifier"
},
{
"content_source_field_name": "uri",
"index_field_name": "uri"
},
{
"content_source_field_name": "pdf_url",
"index_field_name": "document_content_path.url_content.url"
},
{
"content_source_field_name": "document_content_type",
"index_field_name": "document_content_type"
}
],
...
}
}

Step 3: Configure Access Rights

You can configure access rights to control who can view the ingested ArXiv papers:

  • allow_access_rights: (array of objects, optional) Users with any of these access rights will be able to access the documents. If not passed, no user will be able to retrieve the documents.
    • name: (string) The name of the access right (e.g., user UUID or "public")
    • type: (string, optional) The type of access right (e.g., "user_uuid")
    • content_source_id: (string, optional) If the right is scoped to a particular content source
  • deny_access_rights: (array of objects, optional) Users with these access rights will not be able to access the documents.

Example Configuration with Access Rights

{
...
"connector_configuration": {
...
"allow_access_rights": [
{
"name": "public"
}
],
...
}
}

Step 4: Understand Category Groups and Categories

ArXiv organizes papers into category groups (sets) and specific categories:

  • Category Groups (sets): Broad subject areas like "cs" (Computer Science), "stat" (Statistics), "eess" (Electrical Engineering and Systems Science)
  • Categories: Specific topics like "cs.AI" (Artificial Intelligence), "cs.LG" (Machine Learning), "stat.ML" (Machine Learning in Statistics)

Important: You must specify both arxiv_sets and arxiv_categories. If you include a category group in arxiv_sets but don't include any of its categories in arxiv_categories, no papers from that group will be crawled.

For example:

{
"arxiv_sets": ["cs", "stat"],
"arxiv_categories": ["cs.AI", "cs.LG"]
// ❌ No "stat.*" categories specified, so no statistics papers will be crawled
}

Step 5: Create the ArXiv Content Source

To create your ArXiv connector in the Zeta Alpha Platform UI:

  1. Navigate to your tenant and click View next to your target index
  2. Click View under Content Sources for the index
  3. Click Create Content Source
  4. Paste your JSON configuration
  5. Click Submit

Crawling Behavior

The connector crawls papers from arXiv using the OAI-PMH protocol, extracting:

  • Paper title and abstract
  • Author information
  • ArXiv ID and other identifiers (DOI if available)
  • Categories and subject classifications
  • Creation and update timestamps
  • PDF URL for the full paper
  • Journal information (if published)

The connector fetches papers based on your date range and category filters, respecting the max_papers limit. Papers are downloaded as PDFs for full-text indexing.

Note: To avoid being banned by arXiv, ensure you use the correct API URL and implement reasonable crawl limits. ArXiv has usage guidelines that should be followed.