Skip to main content

API Reference

Tenant Settings

A tenant is created or updated by posting to the /v1/tenants endpoint. It has the following configuration:

Request body

  • tenant: (string, required) The unique identifier for the tenant. For example "zeta-alpha".
  • authentication_settings: (object, required) Configures the authentication method for the tenant.
    • protocol: (enum, required) The authentication protocol to use. Possible values are "oidc", "zetaalpha".
    • protocol_settings: (object, required) Settings for the selected authentication protocol.
      • oidc: (object, optional) Configuration for OpenID Connect authentication. Required when protocol is "oidc".
        • issuer: (string, required) The OIDC issuer URL. For example "https://accounts.google.com".
        • client_id: (string, required) The client ID for the OIDC application.
        • redirect_uri: (string, required) The redirect URI after authentication.
      • zetaalpha: (object, optional) Configuration for Zeta Alpha native authentication using username/password. Required when protocol is "zetaalpha". This authentication method allows user registration directly through the platform.
    • allow_anonymous_users: (boolean, optional) Whether to allow users to access the tenant without authentication. Defaults to false.
  • storage_settings: (object, required) Specifies how auxiliary data for this tenant is stored. These settings serve as defaults for all indexes in the tenant unless explicitly overridden in the index configuration. This follows the same structure as index storage settings.
    • ingesting: (object, required) Specifies how large data ingested by the pipeline is stored.
      • backend: (enum, required) The backend to use for storing the ingested data. Possible values are "s3", "azure", "disk".
      • s3: (object, optional) Configuration for the S3 backend. See index configuration for detailed schema.
      • azure: (object, optional) Configuration for the Azure backend. See index configuration for detailed schema.
      • disk: (object, optional) Configuration for the disk backend. See index configuration for detailed schema.
      • max_file_size: (integer, optional) The maximum size of files to store in the backend in bytes. Defaults to 1073741824 (1GB).
    • processing: (object, required) Specifies how data used by the pipeline is stored.
      • backend: (enum, required) The backend to use for storing the data. Possible values are "s3", "azure", "disk".
      • s3: (object, optional) Configuration for the S3 backend. See index configuration for detailed schema.
      • azure: (object, optional) Configuration for the Azure backend. See index configuration for detailed schema.
      • disk: (object, optional) Configuration for the disk backend. See index configuration for detailed schema.
      • compression: (object, optional) Configuration for data compression. See index configuration for detailed schema.
  • features: (object, optional) Configures the available features for this tenant.
    • recommendations_settings: (object, optional) Configures the recommendations feature.
      • status: (enum, optional) The status of the recommendations feature. Possible values are "active", "disabled", "hidden". When hidden, the recommendations feature is not visible in the UI, but recommendations will still be generated for benchmarking purposes if an engine is configured.
      • engines: (array of objects, optional) List of configured recommendation engines.
        • legacy: (object, optional) Empty object to enable the legacy recommendations engine.
        • agentic: (object, optional) Configuration for an agent-based recommendations engine.
          • agent_identifier: (string, required) The identifier of the agent to use for recommendations.
      • email: (object, optional) Configuration for recommendations email notifications.
    • llm_configurations: (array of objects, optional) List of Large Language Model configurations available for this tenant.
      • name: (string, required) The unique name for this LLM configuration.
      • vendor: (enum, required) The LLM provider. Possible values are "openai", "ollama", "anthropic".
      • vendor_configuration: (object, required) Provider-specific configuration.
        • openai: (object, optional) Configuration for OpenAI. Required when vendor is "openai".
          • openai_api_key: (string, required) The API key for OpenAI.
          • openai_org: (string, required) The organization ID for OpenAI.
          • openai_api_base: (string, optional) The base URL for the OpenAI API. Use this for custom endpoints.
          • openai_api_type: (string, optional) The API type.
          • openai_api_version: (string, optional) The API version.
        • anthropic: (object, optional) Configuration for Anthropic. Required when vendor is "anthropic".
        • ollama: (object, optional) Configuration for Ollama. Required when vendor is "ollama".
      • model_configuration: (object, required) Configuration for the model to use.
        • name: (string, required) The name of the model. For example "gpt-4", "claude-3-opus".
        • type: (enum, required) The type of model. Possible values are "chat", "prompt", "prompt_with_logits".
        • temperature: (number, required) The sampling temperature to use, between 0 and 2.
        • max_tokens: (integer, optional) The maximum number of tokens to generate.
        • json_output: (boolean, optional) Whether to force JSON output format. Defaults to false.
        • seed: (integer, optional) Random seed for deterministic sampling.
        • reasoning_effort: (string, optional) Constrains effort on reasoning for reasoning models.
        • logprobs: (boolean, optional) Whether to return log probabilities of the output tokens.
        • parallel_tool_calls: (boolean, optional) Whether to enable parallel function calling during tool use.
    • llm_tracing_configurations: (array of objects, optional) List of LLM tracing configurations for monitoring and debugging.
      • name: (string, required) The unique name for this tracing configuration.
      • vendor: (enum, required) The tracing vendor. Possible values are "langfuse", "capture".
      • vendor_configuration: (object, required) Vendor-specific configuration.
    • chat_bot_setups: (array of objects, optional) List of chat bot configurations available for this tenant.
      • bot_identifier: (string, required) The unique identifier for this bot.
      • llm_configuration_name: (string, required) The name of the LLM configuration to use (must reference an entry in llm_configurations).
      • llm_tracing_configuration_name: (string, optional) The name of the tracing configuration to use.
      • agent_name: (string, optional) The name of the agent to use for this bot.
      • serving_url: (string, optional) The URL of the service that serves this bot.
      • bot_configuration: (object, optional) Additional bot-specific configuration.
      • sub_agent_mapping: (object, optional) Mapping of sub-agents for this bot.
    • federated_search_settings: (object, optional) Configuration for federated search across multiple indexes.
      • indexes: (array of objects, required) List of federated search indexes to include.
    • document_upload_settings: (object, optional) Configuration for document upload functionality.
      • status: (enum, optional) The status of the document upload feature. Possible values are "active", "disabled", "hidden".
    • people_settings: (object, optional) Configuration for people/expert search functionality.
  • client_settings: (object, optional) Defines tenant-specific rendering and UI configuration.
    • white_label_settings: (object, optional) Branding and customization settings for the UI.
      • app: (object, optional) Application-level settings like name and description.
      • colors: (object, optional) Color scheme configuration for the UI.
      • favicon: (string, optional) URL of the favicon.
      • logo: (object, optional) Logo configuration.
      • header: (object, optional) Header configuration.
      • footer: (object, optional) Footer configuration.
      • homepage: (object, optional) Homepage customization.
      • icons: (object, optional) Icon customization.
      • main_content: (object, optional) Main content area customization.
      • onboarding: (object, optional) Onboarding flow customization.
      • pages: (object, optional) Custom pages configuration.
      • search_bar: (object, optional) Search bar customization.
      • tags: (object, optional) Tags display customization.
        • default_chip_color: (string, required) The default color for tag chips.
      • pdf_note_export: (object, optional) Configuration for PDF note export.
        • image: (object, optional) Image configuration.
          • logo: (object, optional) Logo configuration.
            • src: (string, required) The URL of the logo to use in the PDF note export.
            • width: (integer, required) The width of the logo in pixels.
            • height: (integer, required) The height of the logo in pixels.
          • position: (enum, optional) The position of the image. Possible values are "left", "right", "centered".
        • header: (object, optional) Header configuration of the PDF note export.
          • text: (string, optional) The text to use in the header.
          • position: (enum, optional) The position of the text. Possible values are "left", "right", "centered".
    • indexes: (array of objects, optional) List of indexes visible to this tenant.
      • type: (enum, required) The type of index. Possible values are "internal", "external".
      • index_id: (string, required) The ID of the index.
    • widgets: (object, optional) Configuration for widgets available in the UI.
      • analytics: (object, optional) Configuration for the analytics widget.
      • expert_search: (object, optional) Configuration for the expert search widget.
        • title: (string, required) The title displayed for this widget.
        • layout: (object, optional) Layout settings for the widget.
      • find_authored_by: (object, optional) Configuration for the "find authored by" widget.
      • find_similar_document: (object, optional) Configuration for the "find similar documents" widget.
      • qa: (object, optional) Configuration for the Q&A widget.
      • query_analysis: (object, optional) Configuration for the query analysis widget.
      • search_results: (object, optional) Configuration for the search results widget.
      • vos_viewer: (object, optional) Configuration for the VOS viewer widget.
    • pdf_viewer: (object, optional) Configuration for the PDF viewer.
    • tags_section: (object, optional) Configuration for the tags section.
      • chat: (object, optional) Chat feature configuration for tags.
        • default_bot_identifier: (string, required) The default bot to use.
        • extra_bot_identifiers: (array of strings, optional) Additional bots that can be used.
        • show_sender_avatar: (boolean, optional) Whether to show sender avatars.
      • sharing_options: (array of strings, optional) List of sharing modes available.
      • export_options: (array of strings, optional) List of export modes available.
    • document_card: (object, optional) Configuration for how document cards are displayed.
    • note_card: (object, optional) Configuration for how note cards are displayed.
    • bot_configurations: (array of objects, optional) List of bot configurations available in the UI.
    • chat_page: (object, optional) Configuration for the chat page.
      • chat: (object, optional) Chat bot configuration.
        • default_bot_identifier: (string, required) The default bot the chat page should use.
        • extra_bot_identifiers: (array of strings, optional) A list of other bots the chat page could use.
        • show_sender_avatar: (boolean, optional) Whether to show sender avatars in the chat.
      • sharing_options: (array of strings, optional) List of sharing modes to share the chat conversation.
      • export_options: (array of strings, optional) List of export modes to export a chat conversation.

Note: The API also provides endpoints for listing, retrieving, updating, and deleting tenants.

Index configuration

A new index is created by posting to the /v1/indexes endpoint. It has the following configuration:

Query params

  • tenant: The name of the tenant that owns the index.

Request body

  • name: (string, required) Human friendly name given to the index. For example "Zeta Alpha".
  • default: (boolean, optional) Whether the index is used at query time if no other index is specified. Only one index can be set as default per tenant.
  • description: (string, optional) A description of the index. For example "Main index for the research navigator app".
  • cluster_connection: (object, required) It specifies what index backend to use and how to access it.
    • backend: (string, required) The type of backend to use. Possible values are "opensearch".
    • host: (string, required) The hostname of the index. For example "opensearch-cluster-master-headless.opensearch.svc.cluster.local".
    • port: (integer, required) The port of the index. For example 9200.
    • settings: (object, optional) Connection settings.
      • use_ssl: (boolean, optional) Whether to use SSL when connecting to the index.
      • http_auth: (array of strings, optional) Credentials to use when connecting to the index. For example ["username", "password"].
      • verify_certs: (boolean, optional) Whether to verify the SSL certificate when connecting to the index.
      • ssl_show_warn: (boolean, optional) Whether to show a warning when connecting to the index with verify_certs disabled.
      • ca_certs: (string, optional) The path to the CA certificate to use when connecting to the index.
      • client_cert: (string, optional) The path to the client certificate to use when connecting to the index.
      • client_key: (string, optional) The path to the client key to use when connecting to the index.
  • storage_settings: (object, required) It specifies how the auxiliary data for this index is stored.
    • ingesting: (object, required) It specifies how large data ingested by the pipeline is stored. When ingesting data into the document_content or document_content_path.base64_content fields, then this data is stored in the backend specified here.
      • backend: (enum, required) The backend to use for storing the ingested data. Possible values are "s3", "azure", "disk".
      • s3: (object, optional) The configuration for the S3 backend.
        • s3_bucket_name: (string, required) The name of the S3 bucket.
        • s3_key_prefix: (string, optional) The prefix to use when storing the data in the S3 bucket.
        • aws_access_key_id: (string, optional) The AWS access key ID for S3 authentication. For security, using IAM roles for service accounts (IRSA) is preferred over static credentials.
        • aws_secret_access_key: (string, optional) The AWS secret access key for S3 authentication. For security, using IAM roles for service accounts (IRSA) is preferred over static credentials.
        • aws_region: (string, optional) The AWS region for S3 operations.
        • aws_endpoint_url: (string, optional) Custom endpoint URL for S3-compatible services. Use this for MinIO, LocalStack, or other S3-compatible storage systems that are not hosted on AWS.
      • azure: (object, optional) The configuration for the Azure backend
        • azure_account_url: (string, required) The URL of the Azure account.
        • azure_container_name: (string, required) The name of the Azure container.
        • azure_blob_prefix: (string, optional) The prefix to use when storing the data in the Azure container.
        • azure_credential: (string, optional) The storage account shared key (account key or access key) to use when connecting to the Azure account.
      • disk: (object, optional) The configuration for the disk backend.
        • disk_location: (string, optional) The path to the directory where the data will be stored.
      • max_file_size: (integer, optional) The maximum size of the files to store in the backend. This is in bytes. The default value is 1024**3 (100MB).
    • processing: (object, required) It specifies how the data used by the pipeline is stored.
      • backend: (enum, required) The backend to use for storing the data. Possible values are "s3", "azure", "disk".
      • s3: (object, optional) The configuration for the S3 backend.
        • s3_bucket_name: (string, required) The name of the S3 bucket.
        • s3_key_prefix: (string, optional) The prefix to use when storing the data in the S3 bucket.
        • aws_access_key_id: (string, optional) The AWS access key ID for S3 authentication. For security, using IAM roles for service accounts (IRSA) is preferred over static credentials.
        • aws_secret_access_key: (string, optional) The AWS secret access key for S3 authentication. For security, using IAM roles for service accounts (IRSA) is preferred over static credentials.
        • aws_region: (string, optional) The AWS region for S3 operations.
        • aws_endpoint_url: (string, optional) Custom endpoint URL for S3-compatible services. Use this for MinIO, LocalStack, or other S3-compatible storage systems that are not hosted on AWS.
      • azure: (object, optional) The configuration for the Azure backend
        • azure_account_url: (string, required) The URL of the Azure account.
        • azure_container_name: (string, required) The name of the Azure container.
        • azure_blob_prefix: (string, optional) The prefix to use when storing the data in the Azure container.
        • azure_credential: (string, optional) The storage account shared key (account key or access key) to use when connecting to the Azure account.
      • disk: (object, optional) The configuration for the disk backend.
        • disk_location: (string, optional) The path to the directory where the data will be stored.
      • compression: (object, optional) The configuration for the compression of the data.
        • compression_algorithm: (enum, required) The algorithm to use for compressing the data. Possible values are "bz2", "gzip", "lzma", "snappy", "zlib", "zstd". The recommended value is "zlib".
        • level: (integer, optional) The level of compression to use. This is an integer between 0 and 9, where 0 is no compression and 9 is the maximum compression.
  • features: (object, optional) Configures the available features for this index.
    • neural_search: (object, optional) Configures the neural search feature.
      • model_serving_url: (string, required) The URL of the model server that will be used to compute vector embeddings. For example "http://sentence-encoder-api.production.svc.cluster.local:8080".
      • model_serving_url_pipeline: (string, optional) The URL of the model server that will be used to compute vector embeddings in the pipeline (offline). For example "http://sentence-encoder-api.production.svc.cluster.local:8080". If not passed, then the model_serving_url will be used.
      • compression_factor: (enum, optional) The compression factor to use for computing vector similarity. This considerably reduces the memory footprint of the index with limited impact on the quality of the results. Possible values are null, "1x", "2x", "4x", "8x", "16x", "32x. The default value is null (no compression), which is equivalent to "1x".
      • embedding_dimension: (integer) The dimension of the embeddings. This depends on the model that is selected in the model_serving_url. If not passed, then the default value is 768.
      • hnsw_search_params: (object, optional) Settings for HNSW search.
        • k: (object, optional) Overrides the k parameter by index type. For example, {"documents": 250, "chunks": 150} sets k to 250 for document retrieval and 150 for chunks (based on index name suffix).
    • tags: (object, optional) Include an empty object to enable the tag indexing feature.
  • capacity_configuration: (object, optional) Configures the storage and replication parameters for this index.
    • storage_units: (integer >= 1, optional) The number of storage units to provision for the index. One storage unit supports approximately 50GB of data. Defaults to 1. Note: this value can't be changed after the index is created (as of now).
    • replication_factor: (integer >= 1, optional) The number of times the data is replicated in the index. For example, if the replication factor is 3, the data is stored 3 times in the index. If the replication factor is 1, the data is stored only once in the index. This setting is useful for high availability and fault tolerance. Defaults to 1.
  • document_fields_configuration: (array of objects, optional) Specifies the name of the fields that the tenant wants in the index, as well as how they behave during indexing and retrieval.
    • name: (string, required) The name of the field that will be used to store values in the index, as well as when retrieving and filtering documents. This can be a nested field, for example `authors.first_name``.
    • type: (enum, required) The type of field. Possible values are "document_id", "string", "date", "number", "geolocation", "bounding_box", "document_content". Note that any of these types can be multi-valued, meaning they will accept a list of values. Furthermore, when defining a nested field name like authors.first_name, the values will be stored in a flat structure. For example, if we define authors.first_name and authors.last_name and later ingest data like {"authors": [{"first_name": "John", "last_name": "Doe"}, {"first_name": "Jane", "last_name": "Doe"}]}, then the index will contain the following fields: authors.first_name: ["John", "Jane"] and authors.last_name: ["Doe", "Doe"]. This structure minimizes indexing time and storage as well as retrieval time. However, it will not be possible to restrict search results to the ones that contain at least one author with first_name=John and last_name=Doe. If this is a requirement, then the tenant should define the field as nested. The fields document_id and document_content are only relevant when used inside a nested field.
    • alias: (string, optional) An alternative field name that can be used to retrieve documents. Note that the alias must be unique. This field is also used to enable special processing of the fields. In particular, to let the system know that the field should be treated as a document "title" you should set an alias to "metadata.DCMI.title", and also to let the system know that a document field should be treated as a document description you should set an alias to "metadata.DCMI.abstract", both the title and description are used to create embeddings along with the content. Another special alias is "metadata.DCMI.created" which tells the system this field hold the creation date of the document, during processing a field with this alias is defaulted to the ingestion date if not passed and is also used for sorting by the FrontEnd.
    • search_options: (object, optional) How the field behaves during indexing and retrieval. Note that it also determines how the field is configured under the hood.
      • is_sort_field: (boolean, optional) Whether the field can be used for sorting documents at retrieval time.
      • is_facet_field: (boolean, optional) Whether the field can be used for faceted search. In other words, the search API is able to return a list of existing values for this field along with the document counts.
      • is_filter_field: (boolean, optional) Whether the field can be used for filtering documents at retrieval time.
      • is_returned_in_search_results: (boolean, optional) Whether the field is part of the search API response payload.
      • is_used_in_search: (boolean, optional) Whether the field is used in full text search.
      • supporting_subqueries: (boolean, optional) Whether the field can be used in subqueries. Subqueries are used to filter nested objects as if they were root-level documents. Search results return the parent document with the nested object filtered.
    • analyzer_options: (object, optional) How the field is analyzed during indexing and retrieval. This is only for fields of type string that also have search_options.is_used_in_search set to true.
      • analyzer: (string, optional) The name of the analyzer to use. We provide some default analyzers, like the zav-en-nostem. Otherwise the name needs to match a custom analyzer defined in the document_field_analysis section of the index payload. This analyzer will be used for both indexing and retrieval. If the tenant wants to have separate analyzers for indexing and retrieval, then they should not define analyzer and instead pass the name of the analyzers in the index_analyzer and search_analyzer fields. A list of build it analyzer can be found here.
      • search_analyzer: (string, optional) The name of the analyzer to use at retrieval time.
      • index_analyzer: (string, optional) The name of the analyzer to use at indexing time.
    • nested_fields: (array of objects, optional) When type=nested then this contains the list of nested fields. The schema of each object in the array is the same as for the document_fields_configuration field. Note that the name field should not be prefixed with the name of the parent field. For example, if the parent field is authors, then the nested fields should be first_name and last_name, not authors.first_name and authors.last_name.
  • document_field_analysis: (object, optional) Defines custom analyzers and normalizers that can be used in the document_fields_configuration. A custom analyzer can be defined using character filters (char_filter), token filters (filter) and a tokenizer.
    • filter: (object, optional) Use this to define a token filter. You can find a list of custom filters here. Each key of this object is the name of the filter and the value depends on the filter type.
    • char_filter: (object, optional) Use this to configure a character filter; you can find available filters here. Each key of this object is the name of the filter and the value depends on the filter type.
    • normalizer: (object, optional) A normalizer is like an analyzer but with no tokenizer and only outputs a single token. Each key of this object is a name of a custom tokenizer and its value is an object with the following schema:
      • type: (string, required) Normalizer type, use custom to declare a custom normalizer.
      • char_filter: (array of strings, required) List of character filters to use for the normalizer, this filter makes reference to the one keys defined on char_filter object of document_field_analysis.
      • filter (array of strings, required) List of filter names, this filters could be a built-in filter or a key of the object defined on filter of document_field_analysis.
    • analyzer: (object, optional) Use this to configure an analyzer, each key of this object is the name of an analyzer, and the value of the object is the analyzer configuration with the following schema:
      • type: (string, required) Type of analyzer, you can use one of the one define here or custom to define a custom analyzer.
      • filter (array of strings, required) List of filter names, this filters could be a built-in filter or a key of the object defined on filter of document_field_analysis.
      • tokenizer: (string, required) Tokenizer to use on the analyzer. A list of built-in tokenizer can be found here.
  • client_settings: (object, optional) Defines index specific rendering information.
    • display_configuration: (object, optional) Defines how the index content is rendered in the frontend. The field names refer to the names provided in the document_fields_configuration.
      • title_field: (string, optional) The field used for rendering the title of the document card.
      • date_field: (string, optional) The field used for rendering the date of the document card.
      • created_by_field: (string, optional) The field used for rendering the authors of the document card.
      • description_field: (string, optional) The field used for rendering the description of the document card.
      • url_field: (string, optional) The field used to render the link to the source content.
      • source_field: (string, optional) The field used for rendering the source of the document card.
      • bounding_boxes_field: (string, optional) The field used for rendering the bounding boxes in the PDF viewer.
      • image_url_field: (string, optional) The field used for rendering the image of the document card.
      • document_metadata_fields: (array of objects, optional) Defines how other metadata is rendered in the card. This could be the number of references to this document (for example in scientific documents), a list of documents linked to the current one, etc.
        • type: (enum, required) The type of metadata. Possible values are "github", "twitter", "counter".
        • field_name: (string, optional) The field that contains the data to be rendered (used in the counter type).
        • url_field: (string, optional) If the rendered element is clickable, this specifies the link URL.
        • list_field_name: (string, optional) The field that contains the list of metadata to be rendered (used for the github and twitter types).
        • icon: (enum. optional) The icon associated with the rendered element. Possible values are "github", "twitter", "reference", "citation".
        • label: (string, optional) The label associated with the rendered element. This can be used in the tooltip, for example.
    • editable_fields: (array of objects, optional) Defines which document fields can be edited by users in the UI.
      • field_name: (string, required) The name of the field that can be edited. This refers to a field defined in document_fields_configuration.
      • label: (string, required) The display label for this editable field in the UI.
      • input_type: (enum, required) The type of input control to use. Possible values are "string", "number", "date", "strings" (for multi-valued string fields).
    • search_filters_configuration: (array of objects, optional) When defined, the search filters will be limited to the ones defined in this list of filters.
      • field_name: (string, required) This refers to the field name in the document_fields_configuration that this filter will filter by.
      • display_name: (string, required) The name that will be displayed as the filter name in the front end.
      • filter_type: (string, optional) Identifier of the filter type, this string is used by the front end to choose the widget that will display this filter.
      • url_param: (string, optional) This string will be used by the front end to display in the url as a url param.
      • filter_type_settings: (object, optional) Filter specific configuration, this could include default values and display names.
        • checkbox: (object, optional) Display configuration for checkboxes.
          • values: (array, required) List of values to be displayed and filtered by in the checkbox.
            • label: (string, required) Display name of the value to filter by.
            • value: (string, required) Value to filter by.
        • autocomplete: (object, optional) Display configuration for autocomplete.
          • values: (array of strings, required) List of values to be displayed and filtered by in the autocomplete.
        • nested_checkbox: (object, optional) Display configuration for nested checkboxes. A nested checkbox is a checkbox that contains other checkboxes. This is useful for visualize the filter with values on a nested structure.
          • separator: (string, required) The separator used to separate the nested fields. For example, if the field is Category.Subcategory, then the separator is . and Subcategory will be displayed inside Category.
          • values: (array of strings, required) List of values to be displayed and filtered by in the nested checkbox.
        • faceted_checkbox: (object, optional) Display configuration for faceted checkboxes. A faceted checkbox shows counts for each filter value and supports additional settings.
          • separator: (string, optional) The separator used to separate nested fields. For example, if the field is Category.Subcategory, then the separator is . and Subcategory will be displayed inside Category. No nested structure will be shown if this is not defined.
          • zero_count: (enum, required) Whether to display facet values with zero counts. Possible values are "show", "hide", "grayed_out".
          • count_mode: (enum, required) How to compute facet counts. Possible values are "filtered" (counts consider current facet filters) or "static" (counts ignore own facet filters).
          • max_terms: (integer, optional) Maximum number of terms to display in the facet.
          • sort: (enum, required) Sorting order for facet values. Possible values are "term_count" or "term_value".
      • retrieval_unit_configuration: (object, optional) Defines units used for retrieving content. For example, documents can be retrieved as a whole (document) or in chunks.
        • display_name: (string, required) The display name for the retrieval unit selection.
        • url_param: (string, required) The URL parameter used to identify the selected retrieval unit.
        • retrieval_units: (array of objects, required) List of available retrieval units.
          • display_name: (string, required) Display name of the retrieval unit.
          • value: (enum, required) Value of the retrieval unit. Possible values are "document" or "chunk".
        • default_retrieval_unit: (enum, required) Default retrieval unit to use. Must be one of the values defined in retrieval_units.
      • retrieval_method_configuration: (object, optional) Defines the available methods for retrieving content.
        • default: (object, required) Default retrieval method configuration.
          • label: (string, required) Display name of the default retrieval method.
          • value: (string, required) Value of the default retrieval method.
        • status: (enum, optional) Whether to show the retrieval method selector on the front end, set to "active" to show or "hidden".
        • options: (array of objects, optional) List of available retrieval methods.
          • label: (string, required) Display name of the retrieval method.
          • value: (string, required) Value of the retrieval method.
    • search_sorting_configuration: (object, optional) Defines the ordering options for the frontend. The field names refer to the names provided in the document_fields_configuration.
      • field_name: (string, required) The field to sort by.
      • display_name: (string, required) The name that will be rendered in the frontend for this sorting option.
      • url_param (string, optional) The parameter that the frontend will use in the URL when this sorting option is selected.
      • retrieval_unit: (string, optional) If specified, this option is only shown for the selected retrieval unit.
    • search_relevance_configuration: (array of objects, optional) Define the default search profile to use, per retrieval unit.
      • retrieval_unit: (string, optional) If specified, this search profile will only be used as default when searching for the retrieval unit.
      • retrieval_method: (string, optional) If specified, this search profile will only be used as default when searching with the retrieval method.
      • search_profile_name: (string, required) Search profile to use as default, this name refers to the name field for the profiles defined under the search_profiles_configuration.
  • default_filters_configuration: (object, optional) Defines the default filters that are always applied to the search queries. The field names refer to the names provided in the document_fields_configuration.
    • and_operator: (array of objects, optional) Defines the filters that are applied with the AND operator. Each object in the array has the same schema as the default_filters_configuration object.
    • or_operator: (array of objects, optional) Defines the filters that are applied with the OR operator. Each object in the array has the same schema as the default_filters_configuration object.
    • not_operator: (object, optional) Defines the filters that are applied with the NOT operator. The object has the same schema as the default_filters_configuration object.
    • nested_operator: (object, optional) Defines the filters that are applied with the nested operator. This filter is used with nested fields.
      • field_path: (string, required) The path to the nested field. For example, if the nested field is authors, then the nested filters may refer to authors.first_name.
      • nested_filter: (object, required) The filter to apply to the nested fields. The object has the same schema as the default_filters_configuration object. The field names used inside this filter could be relative to the nested field or not. For example for authors, the field names could be first_name or authors.first_name.
    • exists: (object, optional) Defines the filters that are applied with the exists operator.
      • field_path: (string, required) The path to the field.
    • equals_to: (object, optional) Defines the filters that are applied with the exact match operator.
      • field_path: (string, required) The path to the field.
      • field_value: (string, required) The value to compare to.
    • greater_than: (object, optional) Defines the filters that are applied with the greater than operator.
      • field_path: (string, required) The path to the field.
      • field_value: (string, required) The value to compare to.
    • greater_than_or_equal_to: (object, optional) Defines the filters that are applied with the greater than or equal to operator.
      • field_path: (string, required) The path to the field.
      • field_value: (string, required) The value to compare to.
    • less_than: (object, optional) Defines the filters that are applied with the less than operator.
      • field_path: (string, required) The path to the field.
      • field_value: (string, required) The value to compare to.
    • less_than_or_equal_to: (object, optional) Defines the filters that are applied with the less than or equal to operator.
      • field_path: (string, required) The path to the field.
      • field_value: (string, required) The value to compare to.
    • is_in: (object, optional) Defines the filters that are applied with the is in operator.
      • field_path: (string, required) The path to the field.
      • field_values: (array of strings, required) The values to compare to.
    • geo_distance: (object, optional) Defines the filters that are applied with the geo distance operator.
      • field_path: (string, required) The path to the field.
      • point: (object, required) The point to compare to.
        • lat: (float, required) The latitude of the point to compare to.
        • lon: (float, required) The longitude of the point to compare to.
      • distance: (string, required) The distance to compare to.
    • geo_bounding_box: (object, optional) Defines the filters that are applied with the geo bounding box operator.
      • field_path: (string, required) The path to the field.
      • top_left_point: (object, required) The top left point of the bounding box.
        • lat: (float, required) The latitude of the point.
        • lon: (float, required) The longitude of the point.
      • bottom_right_point: (object, required) The bottom right point of the bounding box.
        • lat: (float, required) The latitude of the point.
        • lon: (float, required) The longitude of the point.
  • search_profiles_configuration: (object, optional) Define a set of search profiles available for the index. A search profile is collection of configuration to change the default search relevance of the documents. The relevance of the documents are determine by the retrieval method (keyword, knn, mixed or reranked), as well as the boosting based on document attributes.
    • profiles: (array of object, required) Holds all search profiles for this index. Only one set of settings can be defined per search profile. Mixed and reranker profiles can make reference to other search profiles.
      • name: (string, optional) Name of the search profile. This name is used to select a default profile in the search_relevance_configuration or at query time using the search_profile_name field.
      • keyword_settings: (object, optional) Use this configuration to use keyword search and define value based boosting weights.
        • query_settings: (object, optional) This is use to define boosting when the query is contained in on one or many document fields. For example, if the user query for the term "physics" you can boost the documents that have a field "topic" with a "physics" value.
          • field_search_configs: (array of objects, required) Each element of this array represents a document field to be boosted.
            • field_path: (string, required) The path of the field to be boosted.
            • must_match_query: (boolean, optional) If true, only documents containing the search query on the document field will be returned. Defaults to false.
            • boosting_score: (object, optional) If the query is included in the document field, the document will be boosted.
              • weight: (float, required) Multiplier to be used for boosting.
            • constant_score: (object, optional) If defined and the query is included in the document, then the document relevance score will be equal to the weight.
              • weight: (float, required) Relevance score of the document.
        • functions_boosting_settings: (object, optional) Use this configuration to boost a document based on its field value. For example, if your documents have "source" field and you have trusted sources that you want to boost, you achieve that by boosting a document when the "source" field is one of the trusted sources.
          • score_aggregation_method: (enum, optional) Refers to the method that is used to combine the scores given by each of the functions on function_configs. Possible values are "sum" (default), "multiply", "avg", "first", "max", "min".
          • function_configs: (array of objects, required) List of functions to be applied. Each function can refer to different document fields or the same document field but with different values to be boosted.
            • field_path: (string, required) The path of the field.
            • function_type: (enum, required) This specifies the function to be applied. Currently, the only supported function is "weighted_value," which boosts the document when the field value is set to a specific value, for example boosting a specific source or a specific author.
            • function_config (object, required) Define the function configuration.
              • weighted_value (object, required) Configuration for the weighted_value function.
                • field_value: (numeric or string, required) The function will be applied when the document field is equal to this parameter.
                • weight: (float, required) Score multiplier.
      • knn_settings: (object, optional) Use this configuration to use knn search, you can also specify functions to tune the relevance.
        • hnsw_settings: (object, optional) Settings for HNSW search.
          • k: (integer, optional) The number of nearest neighbors to retrieve.
        • functions_boosting_settings: (object, optional) Use this configuration to boost a document based on its field value. For example, if your documents have "source" field and you have trusted sources that you want to boost, you achieve that by boosting a document when the "source" field is one of the trusted sources.
          • score_aggregation_method: (enum, optional) Refers to the method that is used to combine the scores given by each of the functions on function_configs. Possible values are "sum" (default), "multiply", "avg", "first", "max", "min".
          • function_configs: (array of objects, required) List of functions to be applied. Each function can refer to different document fields or the same document field but with different values to be boosted.
            • field_path: (string, required) The path of the field.
            • function_type: (enum, required) This specifies the function to be applied. Currently, the only supported function is "weighted_value," which boosts the document when the field value is set to a specific value, for example boosting a specific source or a specific author.
            • function_config (object, required) Define the function configuration.
              • weighted_value (object, required) Configuration for the weighted_value function.
                • field_value: (numeric or string, required) The function will be applied when the document field is equal to this parameter.
                • weight: (float, required) Score multiplier.
      • mixed_retrieval_settings: (object, optional) Use this configuration for hybrid retrieval method. This is a retrieval method that takes multiple retrievers and combines them. Currently two different combination algorithms are supported. The base retrievers can be either fully defined on the configuration or you can make reference using the profile name.
        • mixing_strategy_name: (enum, required) Mixing strategy name, this can either rff for Reciprocal Rank Fusion or linear_combination to make a weighted sum of the scores of the base retrievers.
        • mixing_strategy_config: (object, required) Configuration for the mixing strategy.
          • rrf: (object, optional) Configuration for Reciprocal Rank Fusion
            • k: (integer, required) Rank constant, must be greater or equal to 1. The greater the constant, the more influence lower ranked documents have.
            • max_documents_multiplier: (integer, required) This parameter determines the number of documents of each individual retriever to be considered on the mixed. The number of documents considered is the page size (determined at query time) multiplied by this number.
          • linear_combination: (object, optional) Configuration for linear combination strategy.
            • weights: (array of floats, required) List of weights by which each score will be multiplied. This list should have the same length as the retrievers defined either on retriever_profile_names or retriever_profiles and be in the same order.
            • max_documents_multiplier: (integer, required) This parameter determines the number of documents of each individual retriever to be considered on the mixed. The number of documents considered is the page size (determined at query time) multiplied by this number.
        • retriever_profile_names: (array of strings, optional): Profile names of the base retrievers, the retriever should be defined as one in profiles.
        • retriever_profiles: (array of objects, optional): Array of search profile objects of the base retrievers, The schema of each object in the array is the same as for the profiles field.
      • reranker_settings: (object, optional) Use this configuration to configure a reranked retreiver.
        • reranker_service: (str, optional) Reranker service endpoint, for example "http://reranker-api.production.svc.cluster.local:8080".
        • aggregate_results: (boolean, optional) If the same document is multiple times on the base retriever, whether to aggregate the scores to one score per document. Defaults to True.
        • rerank_top_n: (integer, optional) Number of documents to rerank. Defaults to 50.
        • retriever_profile_names: (array of strings, optional): Profile names of the base retrievers, the retriever should be defined as one in profiles.
        • retriever_profiles: (array of objects, optional): Array of search profile objects of the base retrievers, The schema of each object in the array is the same as for the profiles field.

On successful response, the Zeta Alpha platform will create and configure a new index for the tenant. Take note of the id field in the response payload. This field will be the required index_id field of other endpoints.

Note: The API also provides endpoints for listing, retrieving, updating, and deleting indexes.

Content Source configuration

A new content source is created by posting to the /v1/indexes/{index_id}/content-sources/ endpoint. It has the following configuration:

Path params

  • index_id: (string, required) The id of the index that will store the data from this content source.

Query params

  • tenant: The name of the tenant that owns the content source. Note that this must be the same tenant that owns the index, otherwise the endpoint will return an error.

Request body

  • name: (string, required) Human friendly name given to the content source. For example, "google-drive-daily".
  • connector: (enum, required) What connector type this is. Possible values are "s3", "custom_enhancement", "custom", "join_enhancement", "arxiv", "google_drive", "twitter", "bluesky", "box", "custom_web_crawler", "github", "slack", "confluence", "notes", "openreview", "user_documents", "sharepoint", "teams", "tags_enhancement", "metadata_extractor_enhancement", "agent_processor_enhancement", "federated_search", "edit_enhancement".
  • description: (string, optional) A description of the content source. For example "Daily ingestion of google drive files".
  • integration_id: (string, optional) The ID of an integration to use for this content source. Used when the connector requires OAuth or other external authentication.
  • is_indexable: (boolean, optional, default: true) Whether the ingested data will be processed and indexed. One use case for the false value is when the crawled data only serves as a source for an enhancement connector, and the tenant doesn't want individual documents to be searchable.
  • schedule: (string, optional) A cron-style schedule that defines how often the connector will run. This works for all connectors except "custom" and "custom_enhancement". If this value is not passed, the connector runs only when an ingestion job is manually triggered (explained later).
  • workflow_name_overrides: (object, optional) Custom workflow names to use for different operations.
    • ingest: (string, optional) Custom workflow name for ingestion operations.
    • reingest: (string, optional) Custom workflow name for reingestion operations.
    • delete: (string, optional) Custom workflow name for deletion operations.
    • access_rights: (string, optional) Custom workflow name for access rights operations.
  • connector_configuration: (object, optional) The connector-specific configuration. Only one configuration must be passed, corresponding to the value chosen in connector.

On successful response, the Zeta Alpha platform will create and configure the source connector for the selected tenant and index. Take note of the id field in the response payload. This field will be used in the content_source_id field of other endpoints.

Note: The API also provides endpoints for listing, retrieving, updating, and deleting content sources.

The following subsections will explain how each connector is configured.

Custom connector

This is a generic connector to ingest custom documents. This connector lets the user ingest custom documents using directly the document batches endpoint.

Crawled Objects

The crawled object can be any document with its fields defined on the document_fields_configuration in the Index configuration or in the field_mappings of the connector configuration.

Connector Configuration

The configuration object passed to the content source endpoint is the following:

S3 connector

This connector crawls documents stored on a S3 bucket. A metadata json should be provided along with the document content file for the documents to be ingested by this connector:

  1. Metadata file: This file contains the metadata information about the document like title, authors, etc. The filename of the json should match the content filename followed by .metadata.json, for example, if the filename is my_document.pdf the metadata json filename should be my_document.metadata.json (omit the content file extension). An example metadata json for an index with the default fields would look like this:
    {
    "DCMI.title": "Title of the document",
    "DCMI.abstract": "summary of the document",
    "DCMI.creator": [
    {
    "full_name": "author name"
    }
    ],
    "DCMI.created": "2025",
    "DCMI.date": "2025-01-12",
    "DCMI.source": "source",
    "DCMI.language": "en",
    "DCMI.identifier": [
    "https://example.com/123"
    ],
    "access_rights": [
    "user_role_id:XYZ"
    ],
    "uri": "https://example.com/123"
    }

Alternatively, you can define the fields declared in the field_mappings of the connector configuration. In case of an index with custom fields though, the fields of the json should be the ones defined on the document_fields_configuration in the Index configuration (or in the field_mappings of the connector configuration).

  1. Content file: Content file to be ingested, supported extensions are ".pdf", ".txt", ".html", ".md", ".csv", ".json", ".doc", ".docx", ".odt", ".xls", ".xlsx", ".ods", ".ppt", ".pptx", ".odp", ".vsd" and ".odg".

Crawled Objects

The crawled object consists of the fields declared in the metadata json file with the content file as document content.

Connector Configuration

The configuration object passed to the content source endpoint is the following:

  • logo_url
  • is_document_owner
  • field_mappings
  • bucket_name: (string, required) Name of the bucket where the documents are.
  • prefix: (string, optional) Bucket prefix of the bucket.
  • aws_access_key_id: (string, optional) The AWS access key ID for S3 authentication. For security, using IAM roles for service accounts (IRSA) is preferred over static credentials.
  • aws_secret_access_key: (string, optional) The AWS secret access key for S3 authentication. For security, using IAM roles for service accounts (IRSA) is preferred over static credentials.
  • aws_region: (string, optional) The AWS region for S3 operations.
  • aws_endpoint_url: (string, optional) Custom endpoint URL for S3-compatible services. Use this for MinIO, LocalStack, or other S3-compatible storage systems that are not hosted on AWS.

ArXiv connector

This connector crawls arXiv papers using the OAI-PMH protocol (https://arxiv.org/help/oa/index). Currently arXiv doesn't require authentication for this procedure.

Crawled objects

The following lists specifies the schema of each of the crawled documents along with their default mapping to the DocumentRequest model (see Document Request model). Note that reserved fields are not supposed to be mapped in the connector configuration, this is done under the hood. They are listed here for completeness.

  • abstract_url: (string, required) The url to the abstract. For example "https://arxiv.org/abs/{arxiv_id}"
  • title (mapped to title): (string, required) Title of the paper.
  • abstract: (string, required) Abstract of the paper.
  • categories: (array of strings, required) ArXiv categories that the paper belongs to.
  • created_at (mapped to created_at): (datetime, required) The date at which the paper was originally submitted to arXiv.
  • last_updated_at (mapped to last_updated_at): (datetime, required) The date at which the paper submission was last updated.
  • authors: (array of objects, required) Authors of the paper.
    • first_name: (string, required) First name of the author.
    • last_name: (string, required) Last name of the author.
    • full_name (mapped to authors): (string, required) The first name concatenated with the last name.
  • identifiers: (array of strings, required) List of paper identifiers. It includes the arXiv id, the abstract URL, DOI (if known), and DOI URL (if known).
  • journal: (string, required) The journal where the paper was published (if known). Otherwise it's an empty string.
  • content_source_name: (string, required) "arXiv".
  • format: (string, required) "scientific paper".
  • language: (string, required) "EN".
  • pdf_url: (string, required) The arXiv URL to the paper's PDF.
  • (reserved)uri (mapped to uri): (string, required) Same as abstract_url.
  • (reserved)image_urls (mapped to image_urls): (array of strings, optional) Null.
  • (reserved)document_content_type (mapped to document_content_type): (string, required) "application/pdf".
  • (reserved)document_content_url (mapped to document_content_path.url_content.url): (string, required) Same as pdf_url.

Connector configuration

The configuration object passed to the content source endpoint is the following:

  • arxiv_url: (string, required) URL of the OAI-PMH API. For example, "http://export.arxiv.org/oai2". Verify the arXiv documentation for the right URL to use to avoid getting the connector banned.
  • arxiv_sets: (array of strings, required) List of arXiv category groups to crawl. For example "cs", "stat", "eess".
  • arxiv_categories: (array of strings, required) List of arXiv categories to crawl. For example "cs.LG", "cs.AI", "stat.ML". Note that if no categories are passed for a particular group, then nothing will be crawled for that group. That means that if arxiv_sets contains "eess" but no category that starts with "eess." is passed in arxiv_categories, then no papers in the "eess" group are crawled.
  • max_papers: (integer, optional) Maximum number of papers to crawl on each run.
  • since_crawl_date: (string, optional) Defines the publication date of the oldest paper to crawl. If not passed, it defaults to today.
  • allow_access_rights: (array of objects, optional) If a Zeta Alpha user has any of the access rights in the list, then they will be able to access the documents crawled by this connector. If not passed, then no user will be able to retrieve any of the documents.
    • name: (string, required) The name of the access right. For example the user UUID of a particular user, or "public" to allow everyone with access to the tenant.
    • type: (string, optional) The type of access right. For example "user_uuid" if the right is for a particular user.
    • content_source_id: (string, optional) If the access right is scoped to only a particular content source, then the id should be passed here. For example if a tenant wants to prevent a user from having access to this content source, even though they may have access to it in the source's system.
  • deny_access_rights: (array of objects, optional) If a Zeta Alpha user has any of the access rights in the list, then they will not be able to access the documents crawled by this connector. The object schema is exactly the same as for allow_access_rights.
  • logo_url
  • is_document_owner
  • field_mappings

Google Drive connector

This connector crawls Google Drive documents using the Google Drive V3 API. Authentication credentials are needed to access the API in the form of a service account JSON file. The service account must have access to the Google Drive documents of the tenant.

Crawled objects

The following lists specifies the schema of each of the crawled documents along with their default mapping to the DocumentRequest model (see Document Request model). Note that reserved fields are not supposed to be mapped in the connector configuration, this is done under the hood. They are listed here for completeness.

  • id: (string, required) Id of the document as given by the Google Drive API.
  • web_view_link: (string, required) A link for opening the file in a relevant Google editor or viewer in a browser.
  • created_time (mapped to created_at): (string, required) The time at which the file was created (RFC 3339 date-time).
  • size: (string, optional) The size of the file's content in bytes. This is only applicable to files with binary content in Google Drive.
  • full_file_extension: (string, optional) The full file extension extracted from the name field. May contain multiple concatenated extensions, such as "tar.gz". This is only available for files with binary content in Google Drive.
  • mime_type: (string, required) The MIME type of the file.
  • modified_time (mapped to last_updated_at): (string, required) The last time the file was modified by anyone (RFC 3339 date-time).
  • name (mapped to title): (string, required) The name of the file. This is not necessarily unique within a folder.
  • path: (string, required) The path of the file within Google Drive. The root is the name of the (shared) drive.
  • content_source_name: (string, required) "Google Drive".
  • owners: (array of objects, required) The owners of the file. Currently, only certain legacy files may have more than one owner. Not populated for items in shared drives.
    • display_name (mapped to authors): (string, required) A plain text displayable name for this user.
    • permission_id: (string, required) The user's ID as visible in Permission resources.
    • email_address: (email, required) The email address of the user. This may not be present in certain contexts if the user has not made their email address visible to the requester.
    • photo_link: (email, required) A link to the user's profile photo, if available.
  • version (mapped to version): (string, required) A monotonically increasing version number for the file. This reflects every change made to the file on the server, even those not visible to the user.
  • permissions: (array of objects, required) The Google Drive permissions assigned to the document.
    • id: (string, required) A unique identifier for this permission.
    • display_name: (string, required) The "pretty" name of the value of the permission (for example the user's full name, the name of a Google Group, the domain).
    • type: (string, required) The type of grantee. Possible values are "user", "group", "domain", "anyone".
    • email_address: (string, required) The email address of the user or group to which this permission refers.
    • role: (string, required) The role granted by this permission. Possible values are "owner", "organizer", "fileOrganizer", "writer", "commenter", "reader".
    • deleted: (boolean, required) Whether the account associated with this permission has been deleted.
  • export_links: (object, optional) A mapping between mime types and the link to export the document in such a mime type.
  • (reserved)uri (mapped to uri): (string, required) Same as web_view_link.
  • (reserved)image_urls (mapped to image_urls): (array of strings, optional) Null.
  • (reserved)access_rights (mapped to allow_access_rights): (array of objects, required) The access rights parsed from the permissions field.
    • name: (string, required) The name of the access right. This is given by the value of permissions.email_address.
    • type: (string, optional) The type of access right. This is set to google_drive_{permissions.type}.
    • content_source_id: (string, optional) Null.
  • (reserved)document_content_type (mapped to document_content_type): (string, optional) One of the supported mime types.
  • (reserved)base64_content (mapped to document_content_path.base64_content): (string, optional) The base64-encoded document content. The document content is converted to PDF using Google Drive's export functionality if its mime type is not one of the supported mime types. Otherwise the document is downloaded as is and encoded using base64.

Connector configuration

The configuration object passed to the content source endpoint is the following:

  • service_account: (object, required) Credentials that grants access to the Google Drive documents of the tenant. This is the JSON file exported in the authentication setup.
    • type: (string, required)
    • project_id: (string, required)
    • private_key_id: (string, required)
    • private_key: (string, required)
    • client_email: (string, required)
    • client_id: (string, required)
    • auth_uri: (string, required)
    • token_uri: (string, required)
    • auth_provider_x509_cert_url: (string, required)
    • client_x509_cert_url: (string, required)
  • path_include_regex_patterns: (array of strings, optional) Google Drive documents whose path matches any of the regular expressions in the list will be crawled. If a document matches both an include and exclude pattern, the exclude pattern takes precedence and the document is not crawled. If not passed, then all documents are crawled.
  • path_exclude_regex_patterns: (array of strings, optional) Google Drive documents whose path matches any of the regular expressions in the list will not be crawled. If a document matches both an include and exclude pattern, the exclude pattern takes precedence and the document is not crawled.
  • logo_url
  • is_document_owner
  • field_mappings

SharePoint connector

This type of connector crawls documents stored on a SharePoint site. Authentication credentials for the account are needed to crawl the documents.

Crawled objects

The following lists specifies the schema of each of the crawled documents along with their default mapping to the DocumentRequest model (see Document Request model). Note that reserved fields are not supposed to be mapped in the connector configuration, this is done under the hood. They are listed here for completeness.

  • document_id: (string, required) Id of the document this is derived formthe id on the SharePoint API.
  • content_source_name: (string, required) A name for the content source, this can be set on the connector configuration, defaults to "SharePoint".
  • formatted_id: (string, required) A formatted id composed of the site name, drive name and item path.
  • name: (string, required) The name of the document.
  • path: (string, required) The path of the document within SharePoint.
  • parent: (object, required) Reference to the parent folder.
    • id: (string, required) Id of the parent folder.
    • name: (string, required) Name of the parent folder.
    • path: (string, required) Path of the parent folder.
  • drive: (object, required) Information about the drive containing the document.
    • id: (string, required) Id of the drive.
    • name: (string, required) Name of the drive.
    • web_url: (string, required) URL to access the drive in SharePoint.
  • site: (object, required) Information about the site containing the document.
    • id: (string, required) Id of the site.
    • name: (string, optional) Name of the site.
    • web_url: (string, required) URL to access the site in SharePoint.
  • author: (array of objects, required) List of authors of the document.
    • id: (string, optional) Id of the author.
    • display_name: (string, optional) Display name of the author.
  • last_modified_by: (object, required) Identity of the user who last modified the document.
    • group: (object, optional) Group information if modified by a group.
      • id: (string, optional) Id of the group.
      • display_name: (string, optional) Display name of the group.
    • user: (object, optional) User information if modified by a user.
      • id: (string, optional) Id of the user.
      • display_name: (string, optional) Display name of the user.
  • created_date_time: (datetime, required) Creation time of the document.
  • last_modified_date_time: (datetime, optional) Last modification time of the document.
  • c_tag: (string, required) An eTag for the file content.
  • e_tag: (string, required) An eTag for the file's properties.
  • shared: (object, optional) Sharing information for the document.
  • (reserved)uri (mapped to uri): (string, required) URL to the document.
  • (reserved)image_urls (mapped to image_urls): (array of strings, optional) Null.
  • (reserved)access_rights (mapped to allow_access_rights): (array of objects, required) The access rights parsed from the permissions endpoint.
    • name: (string, required) Id of the permission, this can be a Microsoft Group Id, a SharePoint Group ID or a user email.
    • type: (string, optional) The type of access right. This could be msft_user, msft_group, sharepoint_site_group or sharepoint_site_user.
    • content_source_id: (string, optional) Null.
  • (reserved)document_content_type (mapped to document_content_type): (string, optional) One of the supported mime types.
  • (reserved)base64_content (mapped to document_content_path.base64_content): (string, optional) The base64-encoded document content. The document content is converted to PDF using the endpoint format functionality. If the content is not converted to PDF and its mime type is not one of the supported mime types, the document is skipped.

Connector configuration

The configuration object passed to the content source endpoint is the following:

  • access_credentials: (object, required) The credentials needed to access SharePoint.
    • client_id: (string, required) The client ID of the Azure AD application.
    • client_secret: (string, required) The client secret of the Azure AD application.
    • tenant_id: (string, required) The tenant ID of the Azure AD.
  • include_one_drives: (boolean, optional, default: false) Whether to crawl OneDrive drives.
  • path_inclusion_regex_patterns: (array of strings, optional) Files whose path matches any of the regular expressions in the list will be crawled. If a file path matches both an include and exclude pattern, the exclude pattern takes precedence and the file is not crawled. If not passed, then all files are crawled.
  • path_exclusion_regex_patterns: (array of strings, optional) Files whose path matches any of the regular expressions in the list will not be crawled. If a file path matches both an include and exclude pattern, the exclude pattern takes precedence and the file is not crawled.
  • drive_inclusion_regex_patterns: (array of strings, optional) Drives whose name matches any of the regular expressions in the list will be crawled. If a drive name matches both an include and exclude pattern, the exclude pattern takes precedence and the drive is not crawled. If not passed, then all drives are crawled.
  • drive_exclusion_regex_patterns: (array of strings, optional) Drives whose name matches any of the regular expressions in the list will not be crawled. If a drive name matches both an include and exclude pattern, the exclude pattern takes precedence and the drive is not crawled.
  • site_inclusion_regex_patterns: (array of strings, optional) Sites whose path matches any of the regular expressions in the list will be crawled. If a site path matches both an include and exclude pattern, the exclude pattern takes precedence and the site is not crawled. If not passed, then all sites are crawled.
  • site_exclusion_regex_patterns: (array of strings, optional) Sites whose path matches any of the regular expressions in the list will not be crawled. If a site path matches both an include and exclude pattern, the exclude pattern takes precedence and the site is not crawled.
  • site_paths: (array of objects, optional) List of specific SharePoint sites to crawl.
    • collection_hostname: (string, required) The hostname of the SharePoint site collection.
    • site_relative_path: (string, required) The relative path of the site within the collection.
  • one_drive_users: (array of strings, optional) List of email of users whose OneDrive contents should be crawled. Only applicable when include_one_drives is true.
  • include_sub_sites: (boolean, optional, default: false) Whether to crawl sub-sites of the main sites.
  • logo_url
  • is_document_owner
  • field_mappings

Box connector

This connector crawls Box files using the Box SDK. Authentication credentials for the account are needed to crawl the documents.

Crawled objects

The following lists specifies the schema of each of the crawled documents along with their default mapping to the DocumentRequest model (see Document Request model). Note that reserved fields are not supposed to be mapped in the connector configuration, this is done under the hood. They are listed here for completeness.

  • file_id: (string, required) Id of the file as given by the Box API.
  • share_url: (string, optional) The URL used to share the file.
  • content_created_at (mapped to created_at): (datetime, optional) The time at which the file content was created.
  • content_modified_at (mapped to last_updated_at): (datetime, optional) The time at which the file content was last modified.
  • created_at: (datetime, optional) The time at which the file was created in Box.
  • modified_at: (datetime, optional) The time at which the file was last modified in Box.
  • etag: (string, optional) The file's etag.
  • classification: (string, optional) Classification of the file if available.
  • description: (string, optional) Description of the file.
  • comment_count: (integer, required) Number of comments on the file.
  • extension: (string, optional) Extension of the file.
  • metadata: (object, optional) Box metadata associated with the file.
  • name: (string, required) Name of the file.
  • filepath: (string, required) Full path of the file in Box.
  • sequence_id: (string, required) A unique string to identify the version of the file.
  • shared_link_access: (string, optional) Access level of the shared link.
  • size: (integer, optional) Size of the file in bytes.
  • tags: (array of strings, required) List of tags applied to the file.
  • content_source_name: (string, required) Name identifier for the content source, defaults to "Box".
  • owners: (array of objects, required) List with information about the file owner.
    • id: (string, required) Box user ID of the owner.
    • display_name: (string, optional) Display name of the owner.
    • email_address: (string, optional) Email address of the owner.
  • permissions: (array of objects, required) List of users and groups with access to the file.
    • collab_id: (string, required) ID of the collaboration.
    • entity_id: (string, required) ID of the user or group.
    • display_name: (string, optional) Display name of the user or group.
    • email_address: (string, optional) Email address if entity is a user.
    • role: (string, optional) Access role assigned to the entity.
    • status: (string, optional) Status of the collaboration.
    • entity_type: (string, required) Type of entity ("user" or "group").
  • (reserved)uri (mapped to uri): (string, required) URL to access the file in Box.
  • (reserved)image_urls (mapped to image_urls): (array of strings, optional) Null.
  • (reserved)access_rights (mapped to allow_access_rights): (array of objects, required) The access rights parsed from the permissions field.
    • name: (string, required) Email address or entity ID of the permission holder.
    • type: (string, required) Type of the access right, prefixed with "box_", e.g. "box_user", "box_group".
    • content_source_id: (string, optional) Null.
  • (reserved)document_content_type (mapped to document_content_type): (string, required) One of the supported mime types.
  • (reserved)base64_content (mapped to document_content_path.base64_content): (string, optional) The base64-encoded file content.

Connector configuration

The configuration object passed to the content source endpoint is the following:

  • access_credentials: (object, required) The credentials needed to access Box.
    • client_id: (string, required) The client ID of the Box application.
    • client_secret: (string, required) The client secret of the Box application.
    • enterprise_id: (string, required) The enterprise ID of the Box account.
  • path_inclusion_regex_patterns: (array of strings, optional) Files whose path matches any of the regular expressions in the list will be crawled. If a file path matches both an include and exclude pattern, the exclude pattern takes precedence and the file is not crawled. If not passed, then all files are crawled.
  • path_exclusion_regex_patterns: (array of strings, optional) Files whose path matches any of the regular expressions in the list will not be crawled. If a file path matches both an include and exclude pattern, the exclude pattern takes precedence and the file is not crawled.
  • folder_ids: (array of strings, optional) List of Box folder IDs to crawl. A folder id is the numeric sequence on a folder url. If not passed, then all folders are crawled.
  • since_crawl_date: (datetime, optional) Only crawl files updated after this datetime. If full_crawl is false, and this field is not set, this parameter will default to last day.The format is "YYYY-MM-DDTHH:MM:SSZ" (including the timezone). For example "2023-01-01T00:00:00Z".
  • crawl_limit: (integer, optional) The maximum number of files to crawl.
  • full_crawl: (boolean, optional) Whether to perform a full crawl of all files in the Box account. If the connector has been previously scheduled, it will crawled only the files that have been updated since the last crawl. Performing a full crawl will ingest all files, but it will not delete any files that have been removed from Box.
  • last_streaming_token: (string, optional) The last streaming position received from Box. This is used to resume the streaming of files from the last point. This attribute will be overridden by the next connector run.
  • logo_url
  • is_document_owner
  • field_mappings

Federated Search connector

This connector queries a federated index using the search API and ingests matching documents based on specified queries.

Crawled objects

The following lists specifies the schema of each of the crawled documents along with their default mapping to the DocumentRequest model (see Document Request model). Note that reserved fields are not supposed to be mapped in the connector configuration, this is done under the hood. They are listed here for completeness.

  • abstract: (string, required) Abstract of the document.
  • title: (string, required) Title of the document.
  • created_at (mapped to created_at): (datetime, required) The date at which the document was created.
  • authors: (array of objects, required) Authors of the document.
    • full_name (mapped to authors): (string, required) The full name of the author.
  • content_source_name: (string, required) The name of the content source where the document originated.
  • (reserved)uri (mapped to uri): (string, required) The URI of the document.
  • (reserved)image_urls (mapped to image_urls): (array of strings, optional) Images associated with the document.
  • (reserved)document_content_type (mapped to document_content_type): (string, optional) The MIME type of the document content, typically "application/pdf".
  • (reserved)document_content_url (mapped to document_content_path.url_content.url): (string, optional) URL to the document's PDF or content file.

Connector configuration

The configuration object passed to the content source endpoint is the following:

  • logo_url
  • is_document_owner
  • field_mappings
  • queries: (array of objects, required) List of query specifications. Each object must contain at least a query_string field and any additional parameters such as filters, year, date, or sources. For more information, see the Zeta Alpha Search API documentation.
  • source_index_id: (string, required) The name of the search engine to be used. Supported values are "google_scholar", "google", and "bing".
  • search_api_url: (string, optional) The URL of the search API endpoint. Defaults to the tenant's search endpoint.
  • sort_by_relevance: (boolean, optional, default: true) Whether to sort results by relevance score.
  • max_number_of_pages: (integer, optional, default: 10) Maximum number of pages of results to fetch.
  • page_size: (integer, optional, default: 10) Number of documents to fetch per page.
  • authorization: (string, optional) Authorization header value to include in Search API requests.
  • stop_early: (boolean, optional, default false) Stop fetching further pages when we hit a result that was already ingested before.
  • request_headers: (object, optional) Additional HTTP headers to include in the request.
  • fetch_abstracts: (boolean, optional, default false) Whether to fetch document abstracts directly from the document source.
  • allow_access_rights: (array of objects, optional) If a Zeta Alpha user has any of the access rights in the list, then they will be able to access the documents crawled by this connector. If not passed, then no user will be able to retrieve any of the documents.
    • name: (string, required) The name of the access right.
    • type: (string, optional) The type of access right.
    • content_source_id: (string, optional) If the access right is scoped to only a particular content source.
  • deny_access_rights: (array of objects, optional) If a Zeta Alpha user has any of the access rights in the list, then they will not be able to access the documents crawled by this connector. The object schema is exactly the same as for allow_access_rights.

Twitter connector

This connector crawls tweets using the Twitter API. Authentication credentials are needed to access the API in the form of a Bearer token.

Crawled objects

The following lists specifies the schema of each of the crawled documents along with their default mapping to the DocumentRequest model (see Document Request model). Note that reserved fields are not supposed to be mapped in the connector configuration, this is done under the hood. They are listed here for completeness.

  • id: (string, required) Unique identifier of the Tweet.
  • mentioned_urls: (array of strings, required) List of URLs mentioned in the Tweet.
  • retweet_count: (integer, required) Number of times this Tweet has been Retweeted.
  • reply_count: (integer, required) Number of Replies of this Tweet.
  • like_count: (integer, required) Number of Likes of this Tweet.
  • created_at (mapped to created_at): (datetime, optional) Creation time of the Tweet.
  • lang: (string, optional) Language of the Tweet, if detected by Twitter. Returned as a BCP47 language tag.
  • user_screen_name: (string, required) The Twitter screen name, handle, or alias that the Tweet author identifies themselves with.
  • user_followers_count: (integer, required) Number of followers that the Tweet author has.
  • user_following_count: (integer, required) Number of users that the Tweet author is following.
  • user_profile_url: (string, required) URL to the Tweet author's profile.
  • user_profile_image_url: (string, required) The URL to the profile image for the Tweet author.
  • tweet_type: (enum, required) Type of Tweet. Possible values are "tweet", "quoted", "retweet".
  • (reserved)uri (mapped to uri): (string, required) Link to the Tweet.
  • (reserved)image_urls (mapped to image_urls): (array of strings, optional) Null.
  • (reserved)document_content_type (mapped to document_content_type): (string, optional) "text/plain".
  • (reserved)document_content (mapped to document_content_path.base64_content): (string, optional) The Tweet content as a plain string.

Connector configuration

The configuration object passed to the content source endpoint is the following:

  • api_url: (string, required) URL of the Twitter API. For example "https://api.twitter.com/2/tweets/search/recent".
  • bearer_token: (string, required) Token that grants access to the Twitter API. This is obtained in the authentication setup.
  • search_query: (string, required) One query for matching Tweets. More information on building queries for Search Tweets. Note that the string size is limited to 512 characters for standard access, and to 1024 characters for Academic Research access. For example url:"https://arxiv.org" -from:PaperTrending -from:arxivabs -from:mathITbot -from:arxiv_cscv -from:arxivml.
  • crawl_limit: (integer, optional) The maximum number of Tweets to crawl on every run. If not passed, the limit is set to not exceed the Twitter API monthly limit.
  • start_date: (datetime, optional) The oldest UTC timestamp (from most recent seven days) from which the Tweets will be provided. Timestamp is in second granularity and is inclusive (for example, 12:00:01 includes the first second of the minute). If not passed, it will crawl Tweets from up to seven days ago.
  • end_date: (datetime, optional) The newest, most recent UTC timestamp to which the Tweets will be crawled. Timestamp is in second granularity and is exclusive (for example, 12:00:01 excludes the first second of the minute). If not passed, it will crawl Tweets from as recent as 30 seconds ago.
  • mentioned_url_regex: (string, optional) Limits the crawled mentioned_urls to only those that match the regular expression. If the filtered list of mentioned_urls is empty, the Tweet will not be crawled.
  • excluded_users: (array of strings, optional) Tweets whose author is in this list will not be crawled. Note: It's more efficient to pass the list of excluded users in the search_query if they fit within the character limit.
  • logo_url
  • is_document_owner
  • field_mappings

Bluesky connector

This connector crawls posts from Bluesky app. Authentication credentials for the account are needed to crawl the documents.

Crawled objects

The following lists specifies the schema of each of the crawled documents along with their default mapping to the DocumentRequest model (see Document Request model). Note that reserved fields are not supposed to be mapped in the connector configuration, this is done under the hood. They are listed here for completeness.

  • id: (string, required) Unique identifier of the post.
  • mentioned_urls: (array of strings, required) List of URLs mentioned in the post, if metioned_url_domain filter is set, only urls of the required domain will be included.
  • repost_count: (integer, required) Number of times this post has been reposted.
  • reply_count: (integer, required) Number of Replies of this post.
  • like_count: (integer, required) Number of Likes of this post.
  • created_at (mapped to created_at): (datetime, optional) Creation time of the post.
  • lang: (string, optional) Language of the post, if detected by Bluesky. Returned as a BCP47 language tag.
  • user_screen_name: (string, required) The Bluesky handle of the post author.
  • user_followers_count: (integer, required) Number of followers that the post author has.
  • user_following_count: (integer, required) Number of users that the post author is following.
  • user_profile_url: (string, required) URL to the post author's profile.
  • user_profile_image_url: (string, required) The URL to the profile image for the post author.
  • (reserved)uri (mapped to uri): (string, required) Link to the post.
  • (reserved)image_urls (mapped to image_urls): (array of strings, optional) Null.
  • (reserved)document_content_type (mapped to document_content_type): (string, optional) "text/plain".
  • (reserved)document_content (mapped to document_content_path.base64_content): (string, optional) The post content as a plain string.

Connector configuration

The configuration object passed to the content source endpoint is the following:

  • search_query: (string, required) Search query string, Lucene query syntax is recommended.
  • credentials: (object, required) Account credentials.
    • username: (string, required) Bluesky handle.
    • password: (string, required) Bluesky account password.
  • crawl_limit: (integer, optional) Maximum number of posts to crawl.
  • since_crawl_date: (date, optional) Filter the post after the indicated date (inclusive), defaults to today.
  • until_crawl_date: (date, optional) Filter the posts before the indicated date (exclusive).
  • mentioned_url_domain: (str, optional) Filter to posts with URLs linking to the given domain (hostname).
  • excluded_users: (array of string, optional) List of user handles, filter posts excluding posts from the listed users.

GitHub connector

This connector crawls GitHub repository content using the GitHub API. Authentication credentials are needed to access the API in the form of a Personal Access Token.

Crawled objects

The following lists specifies the schema of each of the crawled documents along with their default mapping to the DocumentRequest model (see Document Request model). Note that reserved fields are not supposed to be mapped in the connector configuration, this is done under the hood. They are listed here for completeness.

  • id: (string, required) SHA-1 hash of the Git blob object.
  • org: (string, required) Name of the organization that owns the repository.
  • repo: (string, required) Name of the repository where the file lives.
  • path: (string, required) Path of the file within the repository.
  • full_path (mapped to title): (string, required) Same as path but also including the repository name.
  • size: (integer, optional) Size of the file in bytes.
  • collaborators: (array of objects, required) List of collaborators of the repository. This includes outside collaborators, organization members who are direct collaborators, organization members with access through team memberships, organization members with access through default organization permissions, and organization owners.
    • id: (integer, required) Id of the collaborator in GitHub's database.
    • node_id: (string, required) Id of the collaborator in GitHub's GraphQL database.
    • login: (string, required) Username of the collaborator
    • type: (string, required) Type of collaborator, for example "User".
    • role_name: (string, optional) Name of the role that the collaborator has within the repository. Possible values are "write", "maintain", "admin".
  • committers: (array of objects, required) List of users who have committed changes to the file. Note that GitHub may not always match the author of the commit to a GitHub user (it depends on the Git config of the committer).
    • id: (integer, optional) Id of the commit author in GitHub's database. This is only populated if GitHib can match the commit author to a GitHub user.
    • node_id: (string, optional) Id of the commit author in GitHub's GraphQL database. This is only populated if GitHib can match the commit author to a GitHub user.
    • login: (string, optional) Username of the commit author. This is only populated if GitHib can match the commit author to a GitHub user.
    • name (mapped to authors): (string, required) Name of the commit author. This is set in the committer's Git configuration.
    • email: (string, required) Email of the commit author. This is set in the committer's Git configuration.
  • content_source_name: (string, required) "Private GitHub".
  • last_updated_at (mapped to last_updated_at): (datetime, optional) Date of the latest commit to the file.
  • (reserved)uri (mapped to uri): (string, required) Link to the file in GitHub's website.
  • (reserved)image_urls (mapped to image_urls): (array of strings, optional) The logo_url passed in the connector configuration.
  • (reserved)access_rights (mapped to allow_access_rights): (array of objects, required) The access rights parsed from the collaborators field.
    • name: (string, required) The name of the access right. This is given by the value of collaborators.login.
    • type: (string, optional) The type of access right. This is set to github_{collaborators.type.lower()}.
    • content_source_id: (string, optional) Null.
  • (reserved)document_content_type (mapped to document_content_type): (string, optional) One of the supported mime types. This is guessed from the file extension.
  • (reserved)base64_content (mapped to document_content_path.base64_content): (string, optional) The base64-encoded file content.

Connector configuration

The configuration object passed to the content source endpoint is the following:

  • org_name: (string, required) Name of the organization that owns the repositories to be crawled.
  • personal_access_token: (string, required) Token that grants access to the GitHub content of the tenant. This is obtained in the authentication setup.
  • allowed_repositories: (array of strings, required) List of repositories to be crawled.
  • content_configuration: (object, required) Configuration of the repository content to be crawled.
    • repository_files: (object, optional) Configuration for crawling files within the repository.
      • path_include_regex_patterns: (array of strings, optional) Files whose path (given by the path field in the crawled object) matches any of the regular expressions in the list will be crawled. If a file matches both an include and exclude pattern, the exclude pattern takes precedence and the file is not crawled. If not passed, then all files are crawled.
      • path_exclude_regex_patterns: (array of strings, optional) Files whose path (given by the path field in the crawled object) matches any of the regular expressions in the list will not be crawled. If a file matches both an include and exclude pattern, the exclude pattern takes precedence and the file is not crawled.
  • logo_url
  • is_document_owner
  • field_mappings

Slack connector

This connector crawls Slack messages using the Slack API. Authentication credentials are needed to access the API in the form of a Bearer token.

Crawled objects

The following lists specifies the schema of each of the crawled documents along with their default mapping to the DocumentRequest model (see Document Request model). Note that reserved fields are not supposed to be mapped in the connector configuration, this is done under the hood. They are listed here for completeness.

  • formatted_id (mapped to title): (string, required) The channel name where the message lives. Note that the channel name in a direct message is a comma separated list of the member names.
  • content_source_name: (string, required) "Slack".
  • channel_name: (string, required) Same as formatted_id but without formatting (i.e. without the # symbol for channels).
  • created_at (mapped to created_at): (datetime, required) The date at which the message was created.
  • last_updated_at (mapped to last_updated_at): (datetime, optional) The date at which the message was last edited.
  • thread_ts: (string, optional) Unique identifier of either a thread's parent message or a message in the thread. Messages without replies will not have this field populated.
  • member_info: (array of objects, required) Information of the user that created the message.
    • id: (string, required) Id of the user given by the Slack API.
    • email: (string, required) Email of the Slack user.
    • name (mapped to authors): (string, required) Full real name of the Slack user.
    • is_bot: (boolean, required) Whether the user is a bot.
    • avatar_url: (string, optional) The URL of the user avatar if it exists, otherwise it falls back to the team logo, otherwise it falls back to the logo_url passed in the connector configuration.
  • (reserved)uri (mapped to uri): (string, required) Permalink to the Slack message.
  • (reserved)image_urls (mapped to image_urls): (array of strings, optional) The avatar of the message author (member_info.avatar_url).
  • (reserved)access_rights (mapped to allow_access_rights): (array of objects, required) The access rights parsed from the members of the channel or direct message. If the message is in a public channel, then an extra "domain" access right is included.
    • name: (string, required) The name of the access right. This is given by the value of member_info.email for members of the channel or direct message, and the value of the team domain for the extra permission in public channels (for example "zeta-alpha-workspace").
    • type: (string, optional) The type of access right. This is set to slack_user for members of te channel or direct message, and to slack_workspace for the extra permission in public channels.
    • content_source_id: (string, optional) Null.
  • (reserved)document_content_type (mapped to document_content_type): (string, optional) Either "text/plain" or "text/markdown". Note that Slack has its own markup language, so the connector converts it to Markdown for better compatibility.
  • (reserved)base64_content (mapped to document_content_path.base64_content): (string, optional) The base64-encoded message content.

Connector configuration

The configuration object passed to the content source endpoint is the following:

  • workspace_id: (string, required) Id of the workspace to crawl. This can be found by opening the workspace in the browser. The URL will look like https://app.slack.com/client/<WORKSPACE ID>.
  • api_token: (string, required) Token that grants access to the Slack messages of the tenant. This is obtained in the authentication setup.
  • file_include_regex_patterns: (array of strings, optional) Attached files whose name matches any of the regular expressions in the list will be crawled. If a file matches both an include and exclude pattern, the exclude pattern takes precedence and the file is not crawled. If not passed, then all files are crawled.
  • file_exclude_regex_patterns: (array of strings, optional) Attached files whose name matches any of the regular expressions in the list will not be crawled. If a file matches both an include and exclude pattern, the exclude pattern takes precedence and the file is not crawled.
  • channel_include_names: (array of strings, optional) Channel names in the list will be crawled. If a channel name is in both the include and exclude lists, then the channel will not crawled. If not passed, then all channels are crawled.
  • channel_exclude_names: (array of strings, optional) Channel names in the list will not be crawled. If a channel name is in both the include and exclude lists, then the channel will not crawled.
  • entities: (array of enums, required) Types of Slack channels to crawl. Possible values are "public_channel", "private_channel", "group_message", "private_message".
  • include_bot_messages: (boolean, optional, default: false) Whether to include messages created by bots.
  • include_archived_messages: (boolean, optional, default: false) Whether to include messages in archived channels.
  • logo_url
  • is_document_owner
  • field_mappings

Confluence connector

This connector crawls the pages using the Confluence API. Authentication credentials are needed to access the API in the form of an API token.

Crawled Objects

The following lists specifies the schema of each of the crawled documents along with their default mapping to the DocumentRequest model (see Document Request model). Note that reserved fields are not supposed to be mapped in the connector configuration, this is done under the hood. They are listed here for completeness.

  • space_key: (string, required) The space key of the space, example "https://example.confluence.com/wiki/spaces/{space_key}/overview".
  • space_name: (string, required) The name of the space.
  • labels: (array of strings, required) List of labels of the page.
  • created_date (mapped to created_at): (datetime, required) The date at which the page was originally created.
  • last_updated (mapped to last_updated_at): (datetime, required) The date at which the page was last edited.
  • created_by: (object, required) Original creator of the page. The object schema is exactly the same as for contributors.
  • title (mapped to title): (string, required) The title of the page.
  • content_source_name: (string, required) "Confluence".
  • contributors: (array of objects, required) List of users that have contribute to the page.
    • account_id: (string, required) Account id of the user, example "https://example.confluence.com/wiki/people/{account_id}".
    • email: (string, required) Email of the user.
    • display_name (mapped to authors): (string, required) Registered name of the user.
    • avatar_url: (String, optional) URL of the profile picture of the user.
  • status: (string, required) Page status, either "archived" or "active"
  • item_type: (enum, required) Type of page, either global or personal.
  • parent_id: (string, optional) Id of the parent page.

Connector configuration

The configuration object passed to the content source endpoint is the following:

  • access_credentials: (string, required) Credentials to be used to crawl the content:
    • instance_url: Your Confluence instance URL, example: "https://example.confluence.com".
    • username: The user of the crawling account.
    • password: The api-token created for this user.
  • page_include_regex_patterns: (array of strings, optional) Attached pages whose titles matches any of the regular expressions in the list will be crawled. If a page title matches both an include and exclude pattern, the exclude pattern takes precedence and the page is not crawled. If not passed, then all pages are crawled.
  • page_exclude_regex_patterns: (array of strings, optional) Attached pages whose title matches any of the regular expressions in the list will not be crawled. If a page title matches both an include and exclude pattern, the exclude pattern takes precedence and the file is not crawled.
  • space_include_keys: (array of strings, optional) Space ids in the list will be crawled. If a space id is in both the include and exclude lists, then the space will not crawled. If not passed, then all spaces are crawled.
  • space_exclude_keys: (array of strings, optional) Space ids in the list will not be crawled. If a space id is in both the include and exclude lists, then the space will not crawled.
  • include_archived_spaces: (boolean, required) Whether to include archived spaces.
  • include_personal_spaces: (boolean, required) Whether to include personal spaces.
  • logo_url
  • is_document_owner
  • field_mappings

Teams connector

This connector crawls Microsoft Teams messages using the Microsoft Graph API. Authentication credentials are needed to access the API.

Crawled objects

The following lists specifies the schema of each of the crawled documents along with their default mapping to the DocumentRequest model (see Document Request model). Note that reserved fields are not supposed to be mapped in the connector configuration, this is done under the hood. They are listed here for completeness.

  • formatted_id (mapped to title): (string, optional) The channel name where the message lives, formatted as {team_name}/{channel_name}.
  • content_source_name: (string, required) Name identifier for the content source, defaults to "Teams".
  • message_id: (string, optional) Unique identifier of the message.
  • author: (array of objects, required) Information about the message author.
    • id: (string, optional) Id of the author.
    • display_name (mapped to authors): (string, optional) Display name of the author.
  • created_date_time (mapped to created_at): (datetime, optional) Creation time of the message.
  • last_edited_date_time: (datetime, optional) Last time the message was edited.
  • last_modified_date_time (mapped to last_updated_at): (datetime, optional) Last time the message was modified.
  • message_type: (string, optional) Type of the message.
  • reply_to_id: (string, optional) ID of the message this is a reply to.
  • channel: (object, required) Information about the channel containing the message.
    • id: (string, required) Channel ID.
    • web_url: (string, optional) URL to the channel.
    • display_name: (string, optional) Display name of the channel.
  • team: (object, required) Information about the team containing the message.
    • id: (string, required) Team ID.
    • web_url: (string, optional) URL to the team.
    • display_name: (string, optional) Display name of the team.
  • (reserved)uri (mapped to uri): (string, required) Permalink to the Teams message.
  • (reserved)image_urls (mapped to image_urls): (array of strings, optional) Null.
  • (reserved)access_rights (mapped to allow_access_rights): (array of objects, required) The access rights parsed from the team members.
    • name: (string, required) User ID or email of the team member.
    • type: (string, required) Type of access right, such as "msft_user", "msft_user_id", or "teams_user_entity_id".
    • content_source_id: (string, optional) Null.
  • (reserved)document_content_type (mapped to document_content_type): (string, optional) Either "text/plain" or "text/html".
  • (reserved)base64_content (mapped to document_content_path.base64_content): (string, optional) The base64-encoded message content.

Connector configuration

The configuration object passed to the content source endpoint is the following:

  • access_credentials: (object, required) The Microsoft credentials needed to access Teams.
    • client_id: (string, required) The client ID of the Azure AD application.
    • client_secret: (string, required) The client secret of the Azure AD application.
    • tenant_id: (string, required) The tenant ID of the Azure AD.
  • team_name_inclusion_regex_patterns: (array of strings, optional) Teams whose names match any of the regular expressions in the list will be crawled. If a team name matches both an include and exclude pattern, the exclude pattern takes precedence and the team is not crawled. If not passed, then all teams are crawled.
  • team_name_exclusion_regex_patterns: (array of strings, optional) Teams whose names match any of the regular expressions in the list will not be crawled. If a team name matches both an include and exclude pattern, the exclude pattern takes precedence and the team is not crawled.
  • channel_name_inclusion_regex_patterns: (array of strings, optional) Channels whose names match any of the regular expressions in the list will be crawled. If a channel name matches both an include and exclude pattern, the exclude pattern takes precedence and the channel is not crawled. If not passed, then all channels are crawled.
  • channel_name_exclusion_regex_patterns: (array of strings, optional) Channels whose names match any of the regular expressions in the list will not be crawled. If a channel name matches both an include and exclude pattern, the exclude pattern takes precedence and the channel is not crawled.
  • content_source_name: (string, optional) Custom name for the content source. Defaults to "Teams".
  • logo_url
  • is_document_owner
  • field_mappings

OpenReview connector

This connector crawls papers from OpenReview conferences using the OpenReview API.

Crawled objects

The following lists specifies the schema of each of the crawled documents along with their default mapping to the DocumentRequest model (see Document Request model). Note that reserved fields are not supposed to be mapped in the connector configuration, this is done under the hood. They are listed here for completeness.

  • title (mapped to title): (string, optional) Title of the paper.
  • abstract: (string, optional) Abstract of the paper.
  • description: (string, optional) Description of the paper.
  • subject: (array of strings, required) Subject areas of the paper.
  • created (mapped to created_at): (datetime, optional) Date when the paper was originally submitted.
  • modified (mapped to last_updated_at): (datetime, optional) Date when the paper was last modified.
  • authors: (array of objects, required) Authors of the paper.
    • full_name (mapped to authors): (string, required) Full name of the author.
  • identifiers: (array of strings, required) List of paper identifiers.
  • bibliographic_citation: (string, required) Citation information for the paper.
  • source: (string, required) Source conference name.
  • references: (string, required) Paper references.
  • subsource: (string, required) Subsource information.
  • acceptance: (string, optional) Acceptance status of the paper.
  • (reserved)uri (mapped to uri): (string, required) URL to the paper on OpenReview.
  • (reserved)document_content_path (mapped to document_content_path): (object, required) Path to the paper content (typically PDF).
  • (reserved)document_content_type (mapped to document_content_type): (string, required) Content type of the document.

Connector configuration

The configuration object passed to the content source endpoint is the following:

  • base_url: (string, required) URL of the OpenReview API. For example "https://api.openreview.net" or "https://api2.openreview.net".
  • conference: (object, required) Configuration for the conference to crawl.
    • name: (string, required) Name of the conference.
    • venues: (array of strings, required) List of venue identifiers to crawl within the conference.
    • id_type: (string, required) Type of identifier used for venues, typically "venueid" or "invitation".
  • since: (datetime, required) Only crawl papers submitted or updated after this datetime. The format is "YYYY-MM-DDTHH:MM:SSZ".
  • crawl_no_decision: (boolean, optional, default: true) Whether to crawl papers that have no acceptance decision yet.
  • limit: (integer, optional) Maximum number of papers to crawl.
  • skips: (integer, optional) Number of papers to skip at the beginning.
  • allow_access_rights: (array of objects, optional) If a Zeta Alpha user has any of the access rights in the list, then they will be able to access the documents crawled by this connector. If not passed, then no user will be able to retrieve any of the documents.
    • name: (string, required) The name of the access right.
    • type: (string, optional) The type of access right.
    • content_source_id: (string, optional) If the access right is scoped to only a particular content source.
  • deny_access_rights: (array of objects, optional) If a Zeta Alpha user has any of the access rights in the list, then they will not be able to access the documents crawled by this connector. The object schema is exactly the same as for allow_access_rights.
  • content_source_name: (string, optional) Custom name for the content source. Defaults to "OpenReview".
  • logo_url
  • is_document_owner
  • field_mappings

Notes connector

This connector ingests user-created notes. Notes are typically created through the Zeta Alpha UI and linked to documents or annotations.

Crawled objects

The following lists specifies the schema of each of the crawled documents along with their default mapping to the DocumentRequest model (see Document Request model). Note that reserved fields are not supposed to be mapped in the connector configuration, this is done under the hood. They are listed here for completeness.

  • identifier: (string, required) Unique identifier for the note.
  • created_at (mapped to created_at): (string, required) Creation date of the note.
  • last_updated_at (mapped to last_updated_at): (string, optional) Last update date of the note.
  • format: (string, required) Format of the note, derived from object type with "_note" suffix.
  • references: (array of strings, required) List of document IDs that this note references.
  • summary: (string, required) Content of the note, including annotation highlights if present.
  • (reserved)document_id: (string, required) Document ID, same as identifier.
  • (reserved)uri (mapped to uri): (string, required) URI to the note, formatted as notes/{note_id}.
  • (reserved)image_urls (mapped to image_urls): (array of strings, optional) Null.
  • (reserved)access_rights (mapped to allow_access_rights): (array of objects, required) Access rights specifying who can view the note.
    • name: (string, required) The name of the access right.
  • (reserved)document_content_type (mapped to document_content_type): (string, optional) "text/markdown".
  • (reserved)document_content (mapped to document_content): (string, optional) The note content in markdown format.

Connector configuration

The configuration object passed to the content source endpoint is the following:

Custom Web Crawler connector

This connector uses Scrapy to crawl websites. It provides flexibility to configure custom crawling behavior through spider options and settings.

Crawled objects

The crawled objects depend on the spider implementation and configuration. The spider should yield items that match the document structure defined in the index configuration.

Connector configuration

The configuration object passed to the content source endpoint is the following:

  • spider_name: (string, required) Name of the Scrapy spider to use for crawling.
  • spider_options: (array of objects, required) List of option dictionaries to pass to the spider. Each object in the array represents a separate spider run with different options.
  • crawler_settings: (object, optional) Scrapy settings to customize the crawler behavior. For example, you can set download delay, concurrent requests, user agent, etc.
  • access_credentials: (object, optional) Basic authentication credentials if the website requires authentication.
    • username: (string, required) Username for basic authentication.
    • password: (string, required) Password for basic authentication.
  • allow_access_rights: (array of objects, optional) If a Zeta Alpha user has any of the access rights in the list, then they will be able to access the documents crawled by this connector. If not passed, then no user will be able to retrieve any of the documents.
    • name: (string, required) The name of the access right.
    • type: (string, optional) The type of access right.
    • content_source_id: (string, optional) If the access right is scoped to only a particular content source.
  • deny_access_rights: (array of objects, optional) If a Zeta Alpha user has any of the access rights in the list, then they will not be able to access the documents crawled by this connector. The object schema is exactly the same as for allow_access_rights.
  • content_source_name: (string, optional) Custom name for the content source.
  • logo_url
  • is_document_owner
  • field_mappings

User Documents connector

This connector is used for ingesting documents uploaded directly by users through the Zeta Alpha UI. Users can upload files and provide metadata through the platform interface.

Crawled objects

The following lists specifies the schema of each of the crawled documents along with their default mapping to the DocumentRequest model (see Document Request model). Note that reserved fields are not supposed to be mapped in the connector configuration, this is done under the hood. They are listed here for completeness.

  • identifier: (string, required) Unique identifier for the user document.
  • title (mapped to title): (string, optional) Title of the document provided by the user.
  • description: (string, optional) Description of the document provided by the user.
  • year: (integer, optional) Year of the document (e.g., publication year).
  • date: (datetime, optional) Date associated with the document.
  • source: (string, required) Source of the document, defaults to "private-document".
  • authors: (array of strings, optional) List of author names.
  • (reserved)document_id: (string, required) Unique document ID in the system.
  • (reserved)uri (mapped to uri): (string, required) URI for the user document, formatted as "user_documents/{owner}/{id}" or "external_documents/{owner}/{id}".
  • (reserved)uri_hash: (string, required) Hash of the URI for deduplication.
  • (reserved)image_urls (mapped to image_urls): (array of strings, optional) Null.
  • (reserved)access_rights (mapped to allow_access_rights): (array of objects, required) Access rights specifying who can view the document.
    • name: (string, required) The name of the access right (typically the owner or user ID).
  • (reserved)document_content_type (mapped to document_content_type): (string, optional) The MIME type of the uploaded document.
  • (reserved)base64_content (mapped to document_content_path.base64_content): (string, optional) The base64-encoded document content of the uploaded file.

Connector configuration

The configuration object passed to the content source endpoint is the following:

Metadata Extractor Enhancement

With this enhancement, you can extract metadata information from the document content. This enhancement usually extracts information using an AI agent configured in the tenant configuration.

Enhancement Connector Configuration

The configuration object passed to the content source endpoint is the following:

  • enhancement_id: (string, required) The id of the enhancement, should always be "metadata".
  • target_content_source_ids: (array of strings, required) The content source ids that the enhancement will be applied to. If not passed, the enhancement will be applied to all content sources.
  • extractor_backend: (string, required) The backend that will be used to extract the metadata. Use agent for AI extractor.
  • extractor_backend_configuration: (object, required) The configuration of the backend.
    • agent: (string, required) The agent that will be used to extract the metadata.
      • agent_identifier: (string, required) The id of the agent that will be used to extract the metadata. This should be one of the agents configured in the tenant configuration.
  • fields_to_extract: (array of objects, required) The fields that will be extracted from the document content.
    • field_name: (string, required) The name of the field that will be extracted.
    • field_type: (string, required) The type of the field that will be extracted.
  • field_mappings

Agent Processor Enhancement

This enhancement connector processes documents using AI agents configured in the tenant. The agent can perform various tasks such as summarization, classification, or custom processing based on the agent's configuration. The processed output can be stored in specific fields of the document.

Enhancement Connector Configuration

The configuration object passed to the content source endpoint is the following:

  • enhancement_id: (string, required) The id of the enhancement, must be "agent_output".
  • field_mappings

Tags Enhancement

This enhancement connector adds or removes tags from documents based on user events. Tags are managed through the Zeta Alpha UI.

Enhancement Connector Configuration

The configuration object passed to the content source endpoint is the following:

  • enhancement_id: (string, required) The id of the enhancement, must be "tags".

Edit Enhancement

This enhancement connector tracks and stores user edits to document fields. When users edit document metadata through the Document Edits API, these changes are stored as enhancements.

Enhancement Connector Configuration

The configuration object passed to the content source endpoint is the following:

  • enhancement_id: (string, required) The id of the enhancement, must be "edit".
  • field_mappings

Join Enhancement

This enhancement connector joins data from one content source (the enhancer) with documents from another content source (the target) based on matching field values. This is useful for enriching documents with additional metadata from related sources.

Enhancement Connector Configuration

The configuration object passed to the content source endpoint is the following:

  • enhancement_id: (string, required) The id of the enhancement.
  • target_content_source_id: (string, required) The id of the content source containing the documents to be enhanced.
  • enhancer_content_source_id: (string, required) The id of the content source containing the enhancing documents.
  • join_fields: (object, required) Specifies which fields to use for joining the two content sources.
    • target_field_name: (string, required) The field name in the target documents to match on. Typically "document_id" or "uri".
    • enhancer_field_name: (string, required) The field name in the enhancer documents to match on.
  • custom_metadata: (object, optional) Static custom metadata to add to all enhanced documents.
  • custom_metadata_aggregates: (array of objects, optional) Defines how to aggregate data from the enhancer documents. Each object can be one of the following types:
    • CustomMetadataAggregateOnField: Aggregate values from a specific field.
      • index_field_name: (string, required) The name of the field in the index where the aggregated value will be stored.
      • enhancer_field_name: (string, required) The name of the field in the enhancer documents to aggregate.
      • operator: (enum, required) The aggregation operator. Possible values: "sum", "max", "first".
    • CustomMetadataAggregateArray: Collect values into an array.
      • index_field_name: (string, required) The name of the field in the index where the array will be stored.
      • operator: (enum, required) The aggregation operator. Must be "push".
      • array_field_mapping: (array of objects, optional) Field mappings for array elements.
        • enhancer_field_name: (string, required) The name of the field in the enhancer documents.
        • index_field_name: (string, required) The name of the field in the index.
    • CustomMetadataAggregateCount: Count matching documents.
      • index_field_name: (string, required) The name of the field in the index where the count will be stored.
      • operator: (enum, required) The aggregation operator. Must be "count".

Custom Enhancement

This enhancement connector is a generic enhancement connector that allows you to implement custom enhancement logic by posting enhancement batches directly to the API.

Enhancement Connector Configuration

The configuration object passed to the content source endpoint is the following:

  • content_source_id: (string, required) The id of the content source that this enhancement connector enhances.

Shared connector configuration

This section defines the configuration that is shared across different content sources.

BaseConnectorConfiguration

  • logo_url: (string, optional) Fallback image assigned to the document when none can be extracted from the pdf. Typically this would be the logo of the connector.
  • is_document_owner: (string, optional, default: true) Whether this connector "owns" the documents that are crawled. The context of this setting is that a document can't be crawled by multiple different sources. So setting this field to true will allow another source to crawl the same document.

ContentSourceFieldMappings

  • field_mappings: (array of objects, optional) Maps fields crawled by the connector to fields configured in the document_fields_configuration of the index. See the field mapping section for more details on how this works.
    • content_source_field_name: (string, required) Name of the field crawled by the connector. Each connector documentation specifies which values can be passed here. Note that the name can reference nested structures, for example "owners.display_name".
    • index_field_name: (string, required) Name of the target field as given in the document_fields_configuration of the index. Note that the name can reference nested structures, for example "authors.full_name".
    • inner_field_mappings: (array of objects, optional) The schema of each object is the same as field_mappings. This is used for configuring nested object mappings. See the field mapping section for more details and examples.
Field mapping

The goal of the field mapping configuration is to convert the data structure of the crawled objects into a data structure that is suitable for the tenant's index. Renaming keys without changing the structure is the simplest case. For example

// Crawled object
{
"formatted_id": "#Engineering",
"content_source_name": "Slack",
"image_urls": ["cool-stuff.jpeg"],
"user": "Bert",
"user_id": "bert@transformers",
}
// Field mappings
[
{
"content_source_field_name": "formatted_id",
"index_field_name": "title",
},
{
"content_source_field_name": "content_source_name",
"index_field_name": "source",
},
{
"content_source_field_name": "image_urls",
"index_field_name": "images",
},
{
"content_source_field_name": "user",
"index_field_name": "creator_name",
},
{
"content_source_field_name": "user_id",
"index_field_name": "creator_id",
}
]
// Results in the mapped object
{
"images": ["cool-stuff.jpeg"],
"source": "Slack",
"title": "#Engineering",
"creator_name": "Bert",
"creator_id": "bert@transformers"
}

A slight modification of the data structure is picking values from the crawled object and placing them in a nested object. For example

// Crawled object
{
"user": "Bert",
"user_id": "bert@transformers",
}
// Field mappings
[
{
"content_source_field_name": "user",
"index_field_name": "creator.full_name",
},
{
"content_source_field_name": "user_id",
"index_field_name": "creator.id",
}
]
// Results in the mapped object
{
"creator": {
"id": "bert@transformers",
"full_name": "Bert"
}
}

where we used the syntax key_1.key_2 to represent the value located under key_2 which itself is under key_1. Note that this works for more than two keys: key_1.key_2.key_3. ....

This syntax also works in content_source_field_name. The next example shows the mapping of a list of objects:

// Crawled object
{
"owners": [
{
"display_name": "Bert",
"permission_id": "123",
"email_address": "bert@transformers",
"photo_link": "bert.jpeg",
},
{
"display_name": "RoBerta",
"permission_id": "123",
"email_address": "roberta@transformers",
"photo_link": "roberta.jpeg",
},
]
}
// Field mappings
[
{
"content_source_field_name": "owners.display_name",
"index_field_name": "creator.full_names",
},
{
"content_source_field_name": "owners.email_address",
"index_field_name": "creator.ids",
}
]
// Results in the mapped object
{
"creator": {
"ids": ["bert@transformers", "roberta@transformers"],
"full_names": ["Bert", "RoBerta"],
}
}

The reasoning behind this behavior is that the key display_name under owners is not unique, therefore the value of owners.display_name is a list.

What if we want to keep the original nested structure, but just change the key names? For that we can use the inner_field_mappings:

// Crawled object
{
"owners": {
"display_name": "Bert",
"permission_id": "123",
"email_address": "bert@transformers",
"photo_link": "bert.jpeg",
}
}
// Field mappings
[
{
"content_source_field_name": "owners",
"index_field_name": "creator",
"inner_field_mappings": [
{
"content_source_field_name": "display_name",
"index_field_name": "full_name",
},
{
"content_source_field_name": "email_address",
"index_field_name": "id",
}
]
}
]
// Results in the mapped object
{
"creator": {
"id": "bert@transformers",
"full_name": "Bert"
}
}

The idea here is that we first map the entire object from the key owners to the key creator, and then we apply the mapping rules recursively inside the creator object. Note that the key names used in inner_field_mappings are relative to their parent keys.

The same principle applies for nested lists:

// Crawled object
{
"owners": [
{
"display_name": "Bert",
"permission_id": "123",
"email_address": "bert@transformers",
"photo_link": "bert.jpeg",
},
{
"display_name": "RoBerta",
"permission_id": "123",
"email_address": "roberta@transformers",
"photo_link": "roberta.jpeg",
},
],
"co_creators": ["gpt3"]
}
// Field mappings
[
{
"content_source_field_name": "owners",
"index_field_name": "creator",
"inner_field_mappings": [
{
"content_source_field_name": "display_name",
"index_field_name": "full_name",
},
{
"content_source_field_name": "email_address",
"index_field_name": "id",
}
]
},
{
"content_source_field_name": "co_creators",
"index_field_name": "co_creator",
"inner_field_mappings": [
{
"content_source_field_name": ".",
"index_field_name": "full_name",
}
]
}
]
// Results in the mapped object
{
"creator": [
{
"id": "bert@transformers",
"full_name": "Bert"
},
{
"id": "roberta@transformers",
"full_name": "RoBerta"
},
],
"co_creator": [
{
"full_name": "gpt3"
},
],
}

Ingestion jobs

Ingestion jobs are the mechanism by which connectors are triggered. Ingestion jobs are created automatically for scheduled connectors but they can also be created manually at any time. They are a good source for observability since they track the number of crawled documents/enhancements that were indexed/updated/deleted, as well as the number of documents/enhancements that failed indexing.

In fact, when manually calling the Document batches endpoints, a new ingestion job is created if one was not provided. However, this special ingestion job will not trigger anything. Instead, it's used for tracking the progress of the manual ingestion batch.

A new ingestion job is created by posting to the /v1/indexes/{index_id}/ingestion-jobs endpoint. Depending on the request body, this may also trigger a content source run. It has the following configuration:

Path params

  • index_id: (string, required) The id of the index that will be associated with this ingestion job.

Query params

  • tenant: The name of the tenant that will be associated with this ingestion job. Note that this must be the same tenant that owns the index, otherwise the endpoint will return an error.

Request body

  • type: (string, optional) The type of ingestion job to create. This is mostly used for organizing purposes. A scheduled connector creates jobs with type "scheduled". Manual document batches that are created without an ingestion job will use types "_manual_document_batch", "_manual_document_delete_batch", "_manual_document_access_rights_batch". Notice that types that start with "_manual" do not trigger a content source run.
  • content_source_id: (string, optional) The id of the content source to trigger. If this value is not passed then no content source is triggered.

On successful response, the Zeta Alpha platform will return the id of the ingestion job. This field is used as content_ingestion_job_id in other endpoints. The content source with id content_source_id will be triggered, provided that type doesn't start with "_manual".

Note: The API also provides endpoints for listing, retrieving, and updating ingestion jobs.

Document batches

Creating document batches is the mechanism by which new or existing documents are ingested into the pipeline. This endpoint is typically called by the content source connectors under the hood. However, it may be necessary to manually call this at some point.

Document batches are created by posting to the /v1/indexes/{index_id}/document-batches endpoint. It has the following configuration:

Path params

  • index_id: (string, required) The id of the index that will store the documents.

Query params

  • tenant: The name of the tenant that owns the documents. Note that this must be the same tenant that owns the index, otherwise the endpoint will return an error.

Request body

  • documents: (array of objects, required) The batch of documents to be ingested. Note that this mirrors the Document request model.
    • document_id: (string, optional, default: random UUID V4) Unique identifier of the document (unique within the index, tenant combination).
    • is_indexable: (boolean, optional, default: true) Whether the ingested data will be processed and indexed. One use case of the false value is when the document will only serve as a source for another connector and the tenant doesn’t want the individual documents to be searchable.
    • custom_metadata: (object, optional) Relevant data of the document that doesn't fit within the declared fields.
    • title: (string, optional) The title of the document.
    • logo_url: (string, optional) URL to a logo image for the document.
    • authors: (string or array of strings, optional) The authors of the document.
    • created_at: (datetime, optional) The time at which the document was created. Note that this refers to the document content rather than the request for ingestion.
    • last_updated_at: (datetime, optional) The time at which the document was last modified. Note that this refers to the document content rather than the request for ingestion.
    • uri: (string, optional) The URI of the document.
    • uri_hash: (string, optional) A hash of the uri.
    • image_urls: (array of strings, optional) Images associated with the document.
    • content_source_id: (string, optional) The id of the content source associated with this document.
    • content_ingestion_job_id: (string, optional) The Id of the ingestion job associated with this document.
    • document_content_type: (enum, optional) The mime type of the document content. Possible values are listed in Supported mime types.
    • document_content: (string, optional) The actual content of the document in plain text.
    • document_content_path: (object, optional) Alternative ways of passing the document content when document_content is not ideal. Pick only one.
      • base64_content: (string, optional) The content of the document as a base64-encoded string.
      • blob_content: (object, optional) Points to an object storage like AWS S3, or local storage, that holds the document content.
        • url: (string, required) Path or URL where the content is.
        • backend: (enum, required) The service that hold the document, possible values are s3, azure or disk for local storage.
      • url_content: (object, optional) Points to a URL that holds the document content.
        • url: (string, required) The URL string.
    • allow_access_rights: (array of objects, optional) Grants access to this document to Zeta Alpha users that hold any of the access rights in the list. If not passed, then the no user will be able to retrieve the documents.
      • name: (string, required) The name of the access right. For example the user UUID of a particular user, or "public" to allow everyone with access to the tenant.
      • type: (string, optional) The type of access right. For example "user_uuid" if the right is for a particular user.
      • content_source_id: (string, optional) If the access right is scoped to only a particular content source, then the id should be passed here. For example if a tenant wants to prevent a user from having access to this document, even though they may have access to it in the source's system.
    • deny_access_rights: (array of objects, optional) Denies access to this document to Zeta Alpha users that hold any of the access rights in the list. The object schema is exactly the same as for allow_access_rights.

On successful response, the Zeta Alpha platform will create the request and the documents will be processed and indexed. The response will specify which documents were successfully recorded, which weren't (and therefore not ingested), and why. Note that this endpoint will not wait for the documents to be ingested, processed, and indexed. Instead, it will send an event to do this asynchronously.

Note: The API also provides endpoints for deleting documents, creating enhancement batches, and creating batches of access rights updates.

Document Request model

The document request model encodes a request for data ingestion. This is the interface used in the Document batches. The same interface is also used internally for ingesting documents from content sources. It has the following fields:

  • index_id: (string, required) Id of the index that the request belongs to.
  • tenant: (string, required) Tenant that owns the request.
  • is_indexable: (boolean, optional, default: true) Whether the ingested data will be processed and indexed. One use case of the false value is when the document will only serve as a source for another connector and the tenant doesn’t want the individual documents to be searchable.
  • request_created_at: (datetime, optional, default: now) Time at which the request was created.
  • request_last_updated_at: (datetime, optional, default: now) Time at which the request was last modified.
  • document_id: (string, optional, default: random UUID V4) Unique identifier of the document (unique within the index, tenant combination).
  • deleted: (boolean, optional, default: false) Whether the document is marked for deletion.
  • custom_metadata: (object, optional) Relevant data of the document that doesn't fit within the declared fields.
  • (reserved)logo_url: (string, optional) URL to a logo image for the document.
  • (reserved)title: (string, optional) The title of the document.
  • (reserved)authors: (string or array of strings, optional) The authors of the document.
  • (reserved)created_at: (datetime, optional) The time at which the document was created. Note that this refers to the document content rather than the request for ingestion.
  • (reserved)last_updated_at: (datetime, optional) The time at which the document was last modified. Note that this refers to the document content rather than the request for ingestion.
  • (reserved)uri: (string, optional) The URI of the document.
  • (reserved)uri_hash: (string, optional) A hash of the uri.
  • (reserved)image_urls: (array of strings, optional) Images associated with the document.
  • (reserved)content_source_id: (string, optional) The id of the content source that created this request.
  • (reserved)content_ingestion_job_id: (string, optional) The Id of the ingestion job that created this request.
  • (reserved)document_content_type: (enum, optional) The mime type of the document content. Possible values are listed in Supported mime types.
  • (reserved)document_content: (string, optional) The actual content of the document in plain text.
  • (reserved)document_content_path: (object, optional) Alternative ways of passing the document content when document_content is not ideal. Pick only one.
    • base64_content: (string, optional) The content of the document as a base64-encoded string.
      • blob_content: (object, optional) Points to an object storage like AWS S3, or local storage, that holds the document content.
        • url: (string, required) Path or URL where the content is.
        • backend: (enum, required) The service that hold the document, possible values are s3, azure or disk for local storage.
    • url_content: (object, optional) Points to a URL that holds the document content.
      • url: (string, required) The URL string.
  • (reserved)allow_access_rights: (array of objects, optional) Grants access to this document to Zeta Alpha users that hold any of the access rights in the list. If not passed, then the no user will be able to retrieve the documents.
    • name: (string, required) The name of the access right. For example the user UUID of a particular user, or "public" to allow everyone with access to the tenant.
    • type: (string, optional) The type of access right. For example "user_uuid" if the right is for a particular user.
    • content_source_id: (string, optional) If the access right is scoped to only a particular content source, then the id should be passed here. For example if a tenant wants to prevent a user from having access to this document, even though they may have access to it in the source's system.
  • (reserved)deny_access_rights: (array of objects, optional) Denies access to this document to Zeta Alpha users that hold any of the access rights in the list. The object schema is exactly the same as for allow_access_rights.

The model contains fields that refer to the request for ingestion, as well as fields that refer to the document to be ingested. The latter are custom_metadata and all the fields with (reserved) next to their name. The purpose of the reserved fields is to provide a reasonable default structure for holding information about documents in arbitrary domains. They are all optional. Whenever the document structure doesn't fit these fields, then it's recommended to use custom_metadata. Note that some functionality can only be supported by sticking with reserved fields. For example, access control relies on allow_access_rights and deny_access_rights. Tracking source provenance and organizing ingestion requests requires content_source_id and content_ingestion_job_id respectively. Finally, storage of large content requires document_content or document_content_path.

Supported mime types

The currently supported mime types (i.e. ones that can be converted to text in the pipeline) are:

  • application/pdf
  • text/plain
  • text/html
  • text/markdown
  • text/csv
  • application/json
  • application/msword
  • application/vnd.openxmlformats-officedocument.wordprocessingml.document
  • application/vnd.oasis.opendocument.text
  • application/vnd.ms-excel
  • application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
  • application/vnd.oasis.opendocument.spreadsheet
  • application/vnd.ms-powerpoint
  • application/vnd.openxmlformats-officedocument.presentationml.presentation
  • application/vnd.oasis.opendocument.presentation
  • application/vnd.visio
  • application/vnd.oasis.opendocument.graphics