Skip to main content

Ingestion API Reference

This reference provides comprehensive documentation of the Ingestion API, including endpoints, request parameters, request bodies, and responses.

Authentication

To access this API, the user must hold a bearer token or an API key that has the ingestion-manager role for their tenant. In case of a bearer token, the token must be passed in the Authorization header. In case of an API key, the key must be passed in the X-Auth header.

Endpoints

Ingest Document Batch

POST https://api.zeta-alpha.com/v0/service/ingestion/documents/document-batches

Creating document batches is the mechanism by which new or existing documents are ingested into the pipeline.

Request

Query Parameters:

  • tenant (string, required): The name of the tenant that owns the documents.

Request Body:

  • index_id (string, optional): The id of the index that will store the documents. If not provided, the default index for the tenant will be used.
  • documents (array of objects, required): The batch of documents to be ingested.
    • document_id (string, optional, default: random UUID V4): Unique identifier of the document (unique within the index, tenant combination). 1
    • is_indexable (boolean, optional, default: true): Whether the ingested data will be processed and indexed.
    • custom_metadata (object, optional): Custom data for the document. The allowed keys are defined in the index configuration.
    • document_content (object, optional): The content of the document.
      • content_type (string, required): The mime type of the document content.
      • from_ (object, required): Alternative ways of passing the document content.
        • plain_text_string (string, optional): Content of the document as a plain text string. This only makes sense when the content_type can be represented as plain text.
        • base64_string (string, optional): Content of the document as a base64-encoded string.
        • url (string, optional): URL holding the document content. The content will be downloaded by the platform so the URL must be accessible.
    • allow_access_rights (array of objects, optional): Grants access to this document to platform users with any of the listed access rights. If not passed, then the no user will be able to retrieve the documents.
      • name (string, required): The name of the access right. For example, the user UUID of a user, a role id, or public. The latter means anyone who has access to the tenant.
      • type (string, optional): The type of access right. For example, user_uuid, user_role_id.
      • content_source_id (string, optional): If the access right is scoped to only a specific content source, this field should be set to the content source id.
    • uri (string, optional): The URI of the document.
    • content_source_id (string, optional): The id of the content source.
    • content_ingestion_job_id (string, optional): The Id of the ingestion job. If not provided, a new ingestion job will be created.
    • title (string, optional): The title of the document. Only relevant if the index has no custom schema. Otherwise, use the declared value in custom_metadata.
    • authors (string or array of strings, optional): The authors of the document. Only relevant if the index has no custom schema. Otherwise, use the declared value in custom_metadata.
    • created_at (datetime, optional): The time at which the document was created. Note that this refers to the document content rather than the request for ingestion. Only relevant if the index has no custom schema. Otherwise, use the declared value in custom_metadata.
    • last_updated_at (datetime, optional): The time at which the document was last modified. Note that this refers to the document content rather than the request for ingestion. Only relevant if the index has no custom schema. Otherwise, use the declared value in custom_metadata.
    • image_urls (array of strings, optional): Images associated with the document. Only relevant if the index has no custom schema. Otherwise, use the declared value in custom_metadata.
    • logo_url (string, optional): The logo of the document. Only relevant if the index has no custom schema. Otherwise, use the declared value in custom_metadata.

Response

Response Body:

  • tenant (string, required): The name of the tenant that owns the documents.
  • index_id (string, required): The id of the index that will store the documents.
  • documents (array of objects, required): The batch of documents that were ingested.
    • document_id (string, required): Unique identifier of the document (unique within the index, tenant combination).
    • version (string, required, default: 1): The version of the document.
    • content_source_id (string, optional): The id of the content source associated with this document.
    • content_ingestion_job_id (string, optional): The Id of the ingestion job associated with this document.
    • status (enum, required): The status of the document. Possible values are ingesting, indexing, indexing_update, marked_for_deletion, marked_for_quarantine, quarantined, quarantine_failed, indexed, ingested, indexed_update, failed, not_found. Note: This will always be ingesting for the initial request.
    • error_codes (array of strings, optional): The error codes associated with the document. These are never passed in this endpoint.
    • error_message (string, optional): The error message associated with the document. This is never passed in this endpoint.
  • errors (array of objects, required): Errors that occurred during ingestion. Only present if there were errors in the batch. Note: Check this field to see if there were any errors in the batch.
    • document_id (string, required): Unique identifier of the document (unique within the index, tenant combination).
    • error_message (string, required): The error message.

Ingest Document Access Rights Batch

POST https://api.zeta-alpha.com/v0/service/ingestion/documents/access-rights-batches

This endpoints allows the user to quickly update the access rights of a batch of documents.

Request

Query Parameters:

  • tenant (string, required): The name of the tenant that owns the documents.

Request Body:

  • index_id (string, optional): The id of the index that contains the documents. If not provided, the default index for the tenant will be used.
  • content_source_id (string, optional): The id of the content source associated with this document.
  • content_ingestion_job_id (string, optional): The Id of the ingestion job associated with this document. If not provided, a new ingestion job will be created.
  • documents (array of objects, required): The batch of documents to be updated.
    • document_id (string, required): Unique identifier of the document (unique within the index, tenant combination).
    • allow_access_rights (array of objects, optional): Grants access to this document to platform users with any of the listed access rights. If not passed, then the no user will be able to retrieve the documents.
      • name (string, required): The name of the access right. For example, the user UUID of a user, a role id, or public. The latter means anyone who has access to the tenant.
      • type (string, optional): The type of access right. For example, user_uuid, user_role_id.
      • content_source_id (string, optional): If the access right is scoped to only a specific content source, this field should be set to the content source id.

Response

The response structure is the same as the response of the document batch ingestion endpoint.

Delete Document Batch

POST https://api.zeta-alpha.com/v0/service/ingestion/documents/delete-document-batches

Delete a batch of documents.

Request

Query Parameters:

  • tenant (string, required): The name of the tenant that owns the documents.

Request Body:

  • index_id (string, optional): The id of the index that contains the documents. If not provided, the default index for the tenant will be used.
  • content_source_id (string, optional): The id of the content source associated with this document.
  • content_ingestion_job_id (string, optional): The Id of the ingestion job associated with this document. If not provided, a new ingestion job will be created.
  • document_ids (array of strings, required): The ids of the documents to be deleted. These are the values of the document_id field in the document object.

Response

The response structure is the same as the response of the document batch ingestion endpoint.

Retrieve Document

GET https://api.zeta-alpha.com/v0/service/ingestion/documents/{document_id}

Retrieve a document by its id.

Request

Path Parameters:

  • document_id (string, required): The id of the document to be retrieved.

Query Parameters:

  • tenant (string, required): The name of the tenant that owns the document.
  • index_id (string, optional): The id of the index that will store the document. If not provided, the default index for the tenant will be used.
  • include_document_status (boolean, optional, default: false): Whether to include the status of the document in the response.

Response

Response Body:

  • tenant (string, required): The name of the tenant that owns the document.
  • index_id (string, required): The id of the index that contains the document.
  • document_id (string, required): Unique identifier of the document.
  • deleted (boolean, required): Whether the document is marked for deletion.
  • request_created_at (datetime, required): The time at which the ingestion request was created.
  • request_last_updated_at (datetime, required): The time at which the ingestion request was last updated.
  • is_indexable (boolean, optional): Whether the ingested data will be processed and indexed.
  • custom_metadata (object, optional): Relevant data that doesn't fit within the declared fields. The allowed keys are defined in the index configuration.
  • document_content (object, optional): The content of the document.
    • content_type (string, required): The mime type of the document content.
    • from_ (object, required): This field is write-only so it will always be empty.
    • download_url (string, optional): The endpoint to download the document content.
  • allow_access_rights (array of objects, optional): Grants access to this document to platform users with any of the listed access rights. If not passed, then the no user will be able to retrieve the documents.
    • name (string, required): The name of the access right. For example, the user UUID of a user, a role id, or public. The latter means anyone who has access to the tenant.
    • type (string, optional): The type of access right. For example, user_uuid, user_role_id.
    • content_source_id (string, optional): If the access right is scoped to only a specific content source, this field should be set to the content source id.
  • uri (string, optional): The URI of the document.
  • content_source_id (string, optional): The id of the content source.
  • content_ingestion_job_id (string, optional): The Id of the ingestion job.
  • title (string, optional): The title of the document. Only relevant if the index has no custom schema. Otherwise, the value will be in custom_metadata.
  • authors (string or array of strings, optional): The authors of the document. Only relevant if the index has no custom schema. Otherwise, the value will be in custom_metadata.
  • created_at (datetime, optional): The time at which the document was created. Note that this refers to the document content rather than the request for ingestion. Only relevant if the index has no custom schema. Otherwise, the value will be in custom_metadata.
  • last_updated_at (datetime, optional): The time at which the document was last modified. Note that this refers to the document content rather than the request for ingestion. Only relevant if the index has no custom schema. Otherwise, the value will be in custom_metadata.
  • image_urls (array of strings, optional): Images associated with the document. Only relevant if the index has no custom schema. Otherwise, the value will be in custom_metadata.
  • logo_url (string, optional): The logo of the document. Only relevant if the index has no custom schema. Otherwise, the value will be in custom_metadata.
  • status (enum, optional): The status of the document. Possible values are ingesting, indexing, indexing_update, marked_for_deletion, marked_for_quarantine, quarantined, quarantine_failed, indexed, ingested, indexed_update, failed, not_found.
  • error_codes (array of strings, optional): The error codes associated with the document.

Retrieve Document Content

GET https://api.zeta-alpha.com/v0/service/ingestion/documents/{document_id}/content

Download the content of a document by its id.

Request

Path Parameters:

  • document_id (string, required): The id of the document to be retrieved.

Query Parameters:

  • tenant (string, required): The name of the tenant that owns the documents.
  • index_id (string, optional): The id of the index that will store the documents. If not provided, the default index for the tenant will be used.

Response

The content bytes are returned in the response body with the proper headers to indicate the content attachment. If the content was stored as a URL, then the response will include a header location with the URL from which the content can be downloaded. Most clients will automatically follow the redirect and download the content.

Filter Documents

GET https://api.zeta-alpha.com/v0/service/ingestion/documents

Filter documents by a set of simple criteria.

Request

Query Parameters:

  • tenant (string, required): The name of the tenant that owns the documents.
  • index_id (string, optional): The id of the index that will store the documents. If not provided, the default index for the tenant will be used.
  • content_source_id (string, optional): Filter by the content source id.
  • content_ingestion_job_id (string, optional): Filter by the content ingestion job id.
  • request_last_updated_after (datetime, optional): Return only documents that were last updated after this date.
  • document_id_in (array of strings, optional): Return only documents with the specified ids.
  • uri_in (array of strings, optional): Return only documents with the specified URIs.
  • include_document_status (boolean, optional, default: false): Whether to include the status of the document in the response.
  • page (integer, optional, default: 1): The page number to return.
  • page_size (integer, optional, default: 100): The number of documents to return per page.

Response

Response Body:

  • count (integer, required): The total number of documents that match the criteria.
  • page (integer, optional): The page number returned.
  • page_size (integer, optional): The number of documents returned in the page.
  • next (object, optional): Only present if there are more pages.
    • page (integer, required): The page number of the next page.
    • page_size (integer, required): The number of documents to return per page.
  • previous (object, optional): Only present if there are previous pages.
    • page (integer, required): The page number of the previous page.
    • page_size (integer, required): The number of documents to return per page.
  • results (array of objects, required): The documents that match the criteria. The structure is the same as the response of the retrieve document endpoint.

Footnotes

  1. Currently the only supported document_id format is a string generated by the following function:

    from hashlib import sha1

    def compute_document_id(internal_id: str) -> str:
    internals = internal_id.encode("utf-8")
    return sha1(internals).hexdigest()