API Reference
Tenant Settings
A tenant is created or updated by posting to the /v1/tenants
endpoint. It has the following configuration:
Request body
tenant
: (string, required) The unique identifier for the tenant. For example "zeta-alpha".authentication_settings
: (object, required) Configures the authentication method for the tenant.protocol
: (enum, required) The authentication protocol to use. Possible values are "oidc", "zetaalpha".protocol_settings
: (object, required) Settings for the selected authentication protocol.oidc
: (object, optional) Configuration for OpenID Connect authentication. Required whenprotocol
is "oidc".issuer
: (string, required) The OIDC issuer URL. For example"https://accounts.google.com"
.client_id
: (string, required) The client ID for the OIDC application.redirect_uri
: (string, required) The redirect URI after authentication.
zetaalpha
: (object, optional) Configuration for Zeta Alpha native authentication using username/password. Required whenprotocol
is "zetaalpha". This authentication method allows user registration directly through the platform.
allow_anonymous_users
: (boolean, optional) Whether to allow users to access the tenant without authentication. Defaults to false.
storage_settings
: (object, required) Specifies how auxiliary data for this tenant is stored. These settings serve as defaults for all indexes in the tenant unless explicitly overridden in the index configuration. This follows the same structure as index storage settings.ingesting
: (object, required) Specifies how large data ingested by the pipeline is stored.backend
: (enum, required) The backend to use for storing the ingested data. Possible values are "s3", "azure", "disk".s3
: (object, optional) Configuration for the S3 backend. See index configuration for detailed schema.azure
: (object, optional) Configuration for the Azure backend. See index configuration for detailed schema.disk
: (object, optional) Configuration for the disk backend. See index configuration for detailed schema.max_file_size
: (integer, optional) The maximum size of files to store in the backend in bytes. Defaults to 1073741824 (1GB).
processing
: (object, required) Specifies how data used by the pipeline is stored.backend
: (enum, required) The backend to use for storing the data. Possible values are "s3", "azure", "disk".s3
: (object, optional) Configuration for the S3 backend. See index configuration for detailed schema.azure
: (object, optional) Configuration for the Azure backend. See index configuration for detailed schema.disk
: (object, optional) Configuration for the disk backend. See index configuration for detailed schema.compression
: (object, optional) Configuration for data compression. See index configuration for detailed schema.
features
: (object, optional) Configures the available features for this tenant.recommendations_settings
: (object, optional) Configures the recommendations feature.status
: (enum, optional) The status of the recommendations feature. Possible values are "active", "disabled", "hidden". When hidden, the recommendations feature is not visible in the UI, but recommendations will still be generated for benchmarking purposes if an engine is configured.engines
: (array of objects, optional) List of configured recommendation engines.legacy
: (object, optional) Empty object to enable the legacy recommendations engine.agentic
: (object, optional) Configuration for an agent-based recommendations engine.agent_identifier
: (string, required) The identifier of the agent to use for recommendations.
email
: (object, optional) Configuration for recommendations email notifications.
llm_configurations
: (array of objects, optional) List of Large Language Model configurations available for this tenant.name
: (string, required) The unique name for this LLM configuration.vendor
: (enum, required) The LLM provider. Possible values are "openai", "ollama", "anthropic".vendor_configuration
: (object, required) Provider-specific configuration.openai
: (object, optional) Configuration for OpenAI. Required whenvendor
is "openai".openai_api_key
: (string, required) The API key for OpenAI.openai_org
: (string, required) The organization ID for OpenAI.openai_api_base
: (string, optional) The base URL for the OpenAI API. Use this for custom endpoints.openai_api_type
: (string, optional) The API type.openai_api_version
: (string, optional) The API version.
anthropic
: (object, optional) Configuration for Anthropic. Required whenvendor
is "anthropic".ollama
: (object, optional) Configuration for Ollama. Required whenvendor
is "ollama".
model_configuration
: (object, required) Configuration for the model to use.name
: (string, required) The name of the model. For example "gpt-4", "claude-3-opus".type
: (enum, required) The type of model. Possible values are "chat", "prompt", "prompt_with_logits".temperature
: (number, required) The sampling temperature to use, between 0 and 2.max_tokens
: (integer, optional) The maximum number of tokens to generate.json_output
: (boolean, optional) Whether to force JSON output format. Defaults to false.seed
: (integer, optional) Random seed for deterministic sampling.reasoning_effort
: (string, optional) Constrains effort on reasoning for reasoning models.logprobs
: (boolean, optional) Whether to return log probabilities of the output tokens.parallel_tool_calls
: (boolean, optional) Whether to enable parallel function calling during tool use.
llm_tracing_configurations
: (array of objects, optional) List of LLM tracing configurations for monitoring and debugging.name
: (string, required) The unique name for this tracing configuration.vendor
: (enum, required) The tracing vendor. Possible values are "langfuse", "capture".vendor_configuration
: (object, required) Vendor-specific configuration.
chat_bot_setups
: (array of objects, optional) List of chat bot configurations available for this tenant.bot_identifier
: (string, required) The unique identifier for this bot.llm_configuration_name
: (string, required) The name of the LLM configuration to use (must reference an entry inllm_configurations
).llm_tracing_configuration_name
: (string, optional) The name of the tracing configuration to use.agent_name
: (string, optional) The name of the agent to use for this bot.serving_url
: (string, optional) The URL of the service that serves this bot.bot_configuration
: (object, optional) Additional bot-specific configuration.sub_agent_mapping
: (object, optional) Mapping of sub-agents for this bot.
federated_search_settings
: (object, optional) Configuration for federated search across multiple indexes.indexes
: (array of objects, required) List of federated search indexes to include.
document_upload_settings
: (object, optional) Configuration for document upload functionality.status
: (enum, optional) The status of the document upload feature. Possible values are "active", "disabled", "hidden".
people_settings
: (object, optional) Configuration for people/expert search functionality.
client_settings
: (object, optional) Defines tenant-specific rendering and UI configuration.white_label_settings
: (object, optional) Branding and customization settings for the UI.app
: (object, optional) Application-level settings like name and description.colors
: (object, optional) Color scheme configuration for the UI.favicon
: (string, optional) URL of the favicon.logo
: (object, optional) Logo configuration.header
: (object, optional) Header configuration.footer
: (object, optional) Footer configuration.homepage
: (object, optional) Homepage customization.icons
: (object, optional) Icon customization.main_content
: (object, optional) Main content area customization.onboarding
: (object, optional) Onboarding flow customization.pages
: (object, optional) Custom pages configuration.search_bar
: (object, optional) Search bar customization.tags
: (object, optional) Tags display customization.default_chip_color
: (string, required) The default color for tag chips.
pdf_note_export
: (object, optional) Configuration for PDF note export.image
: (object, optional) Image configuration.logo
: (object, optional) Logo configuration.src
: (string, required) The URL of the logo to use in the PDF note export.width
: (integer, required) The width of the logo in pixels.height
: (integer, required) The height of the logo in pixels.
position
: (enum, optional) The position of the image. Possible values are "left", "right", "centered".
header
: (object, optional) Header configuration of the PDF note export.text
: (string, optional) The text to use in the header.position
: (enum, optional) The position of the text. Possible values are "left", "right", "centered".
indexes
: (array of objects, optional) List of indexes visible to this tenant.type
: (enum, required) The type of index. Possible values are "internal", "external".index_id
: (string, required) The ID of the index.
widgets
: (object, optional) Configuration for widgets available in the UI.analytics
: (object, optional) Configuration for the analytics widget.expert_search
: (object, optional) Configuration for the expert search widget.title
: (string, required) The title displayed for this widget.layout
: (object, optional) Layout settings for the widget.
find_authored_by
: (object, optional) Configuration for the "find authored by" widget.find_similar_document
: (object, optional) Configuration for the "find similar documents" widget.qa
: (object, optional) Configuration for the Q&A widget.query_analysis
: (object, optional) Configuration for the query analysis widget.search_results
: (object, optional) Configuration for the search results widget.vos_viewer
: (object, optional) Configuration for the VOS viewer widget.
pdf_viewer
: (object, optional) Configuration for the PDF viewer.tags_section
: (object, optional) Configuration for the tags section.chat
: (object, optional) Chat feature configuration for tags.default_bot_identifier
: (string, required) The default bot to use.extra_bot_identifiers
: (array of strings, optional) Additional bots that can be used.show_sender_avatar
: (boolean, optional) Whether to show sender avatars.
sharing_options
: (array of strings, optional) List of sharing modes available.export_options
: (array of strings, optional) List of export modes available.
document_card
: (object, optional) Configuration for how document cards are displayed.note_card
: (object, optional) Configuration for how note cards are displayed.bot_configurations
: (array of objects, optional) List of bot configurations available in the UI.chat_page
: (object, optional) Configuration for the chat page.chat
: (object, optional) Chat bot configuration.default_bot_identifier
: (string, required) The default bot the chat page should use.extra_bot_identifiers
: (array of strings, optional) A list of other bots the chat page could use.show_sender_avatar
: (boolean, optional) Whether to show sender avatars in the chat.
sharing_options
: (array of strings, optional) List of sharing modes to share the chat conversation.export_options
: (array of strings, optional) List of export modes to export a chat conversation.
Note: The API also provides endpoints for listing, retrieving, updating, and deleting tenants.
Index configuration
A new index is created by posting to the /v1/indexes
endpoint. It has the following configuration:
Query params
tenant
: The name of the tenant that owns the index.
Request body
name
: (string, required) Human friendly name given to the index. For example "Zeta Alpha".default
: (boolean, optional) Whether the index is used at query time if no other index is specified. Only one index can be set as default per tenant.description
: (string, optional) A description of the index. For example "Main index for the research navigator app".cluster_connection
: (object, required) It specifies what index backend to use and how to access it.backend
: (string, required) The type of backend to use. Possible values are "opensearch".host
: (string, required) The hostname of the index. For example "opensearch-cluster-master-headless.opensearch.svc.cluster.local".port
: (integer, required) The port of the index. For example 9200.settings
: (object, optional) Connection settings.use_ssl
: (boolean, optional) Whether to use SSL when connecting to the index.http_auth
: (array of strings, optional) Credentials to use when connecting to the index. For example ["username", "password"].verify_certs
: (boolean, optional) Whether to verify the SSL certificate when connecting to the index.ssl_show_warn
: (boolean, optional) Whether to show a warning when connecting to the index withverify_certs
disabled.ca_certs
: (string, optional) The path to the CA certificate to use when connecting to the index.client_cert
: (string, optional) The path to the client certificate to use when connecting to the index.client_key
: (string, optional) The path to the client key to use when connecting to the index.
storage_settings
: (object, required) It specifies how the auxiliary data for this index is stored.ingesting
: (object, required) It specifies how large data ingested by the pipeline is stored. When ingesting data into thedocument_content
ordocument_content_path.base64_content
fields, then this data is stored in the backend specified here.backend
: (enum, required) The backend to use for storing the ingested data. Possible values are "s3", "azure", "disk".s3
: (object, optional) The configuration for the S3 backend.s3_bucket_name
: (string, required) The name of the S3 bucket.s3_key_prefix
: (string, optional) The prefix to use when storing the data in the S3 bucket.aws_access_key_id
: (string, optional) The AWS access key ID for S3 authentication. For security, using IAM roles for service accounts (IRSA) is preferred over static credentials.aws_secret_access_key
: (string, optional) The AWS secret access key for S3 authentication. For security, using IAM roles for service accounts (IRSA) is preferred over static credentials.aws_region
: (string, optional) The AWS region for S3 operations.aws_endpoint_url
: (string, optional) Custom endpoint URL for S3-compatible services. Use this for MinIO, LocalStack, or other S3-compatible storage systems that are not hosted on AWS.
azure
: (object, optional) The configuration for the Azure backendazure_account_url
: (string, required) The URL of the Azure account.azure_container_name
: (string, required) The name of the Azure container.azure_blob_prefix
: (string, optional) The prefix to use when storing the data in the Azure container.azure_credential
: (string, optional) The storage account shared key (account key or access key) to use when connecting to the Azure account.
disk
: (object, optional) The configuration for the disk backend.disk_location
: (string, optional) The path to the directory where the data will be stored.
max_file_size
: (integer, optional) The maximum size of the files to store in the backend. This is in bytes. The default value is 1024**3 (100MB).
processing
: (object, required) It specifies how the data used by the pipeline is stored.backend
: (enum, required) The backend to use for storing the data. Possible values are "s3", "azure", "disk".s3
: (object, optional) The configuration for the S3 backend.s3_bucket_name
: (string, required) The name of the S3 bucket.s3_key_prefix
: (string, optional) The prefix to use when storing the data in the S3 bucket.aws_access_key_id
: (string, optional) The AWS access key ID for S3 authentication. For security, using IAM roles for service accounts (IRSA) is preferred over static credentials.aws_secret_access_key
: (string, optional) The AWS secret access key for S3 authentication. For security, using IAM roles for service accounts (IRSA) is preferred over static credentials.aws_region
: (string, optional) The AWS region for S3 operations.aws_endpoint_url
: (string, optional) Custom endpoint URL for S3-compatible services. Use this for MinIO, LocalStack, or other S3-compatible storage systems that are not hosted on AWS.
azure
: (object, optional) The configuration for the Azure backendazure_account_url
: (string, required) The URL of the Azure account.azure_container_name
: (string, required) The name of the Azure container.azure_blob_prefix
: (string, optional) The prefix to use when storing the data in the Azure container.azure_credential
: (string, optional) The storage account shared key (account key or access key) to use when connecting to the Azure account.
disk
: (object, optional) The configuration for the disk backend.disk_location
: (string, optional) The path to the directory where the data will be stored.
compression
: (object, optional) The configuration for the compression of the data.compression_algorithm
: (enum, required) The algorithm to use for compressing the data. Possible values are "bz2", "gzip", "lzma", "snappy", "zlib", "zstd". The recommended value is "zlib".level
: (integer, optional) The level of compression to use. This is an integer between 0 and 9, where 0 is no compression and 9 is the maximum compression.
features
: (object, optional) Configures the available features for this index.neural_search
: (object, optional) Configures the neural search feature.model_serving_url
: (string, required) The URL of the model server that will be used to compute vector embeddings. For example"http://sentence-encoder-api.production.svc.cluster.local:8080"
.model_serving_url_pipeline
: (string, optional) The URL of the model server that will be used to compute vector embeddings in the pipeline (offline). For example"http://sentence-encoder-api.production.svc.cluster.local:8080"
. If not passed, then themodel_serving_url
will be used.compression_factor
: (enum, optional) The compression factor to use for computing vector similarity. This considerably reduces the memory footprint of the index with limited impact on the quality of the results. Possible values are null, "1x", "2x", "4x", "8x", "16x", "32x. The default value is null (no compression), which is equivalent to "1x".embedding_dimension
: (integer) The dimension of the embeddings. This depends on the model that is selected in themodel_serving_url
. If not passed, then the default value is 768.hnsw_search_params
: (object, optional) Settings for HNSW search.k
: (object, optional) Overrides thek
parameter by index type. For example,{"documents": 250, "chunks": 150}
setsk
to 250 for document retrieval and 150 for chunks (based on index name suffix).
tags
: (object, optional) Include an empty object to enable the tag indexing feature.
capacity_configuration
: (object, optional) Configures the storage and replication parameters for this index.storage_units
: (integer >= 1, optional) The number of storage units to provision for the index. One storage unit supports approximately 50GB of data. Defaults to 1. Note: this value can't be changed after the index is created (as of now).replication_factor
: (integer >= 1, optional) The number of times the data is replicated in the index. For example, if the replication factor is 3, the data is stored 3 times in the index. If the replication factor is 1, the data is stored only once in the index. This setting is useful for high availability and fault tolerance. Defaults to 1.
document_fields_configuration
: (array of objects, optional) Specifies the name of the fields that the tenant wants in the index, as well as how they behave during indexing and retrieval.name
: (string, required) The name of the field that will be used to store values in the index, as well as when retrieving and filtering documents. This can be a nested field, for example `authors.first_name``.type
: (enum, required) The type of field. Possible values are "document_id", "string", "date", "number", "geolocation", "bounding_box", "document_content". Note that any of these types can be multi-valued, meaning they will accept a list of values. Furthermore, when defining a nested field name likeauthors.first_name
, the values will be stored in a flat structure. For example, if we defineauthors.first_name
andauthors.last_name
and later ingest data like{"authors": [{"first_name": "John", "last_name": "Doe"}, {"first_name": "Jane", "last_name": "Doe"}]}
, then the index will contain the following fields:authors.first_name: ["John", "Jane"]
andauthors.last_name: ["Doe", "Doe"]
. This structure minimizes indexing time and storage as well as retrieval time. However, it will not be possible to restrict search results to the ones that contain at least one author withfirst_name=John
andlast_name=Doe
. If this is a requirement, then the tenant should define the field asnested
. The fieldsdocument_id
anddocument_content
are only relevant when used inside anested
field.alias
: (string, optional) An alternative field name that can be used to retrieve documents. Note that the alias must be unique. This field is also used to enable special processing of the fields. In particular, to let the system know that the field should be treated as a document "title" you should set an alias to "metadata.DCMI.title", and also to let the system know that a document field should be treated as a document description you should set an alias to "metadata.DCMI.abstract", both the title and description are used to create embeddings along with the content. Another special alias is "metadata.DCMI.created" which tells the system this field hold the creation date of the document, during processing a field with this alias is defaulted to the ingestion date if not passed and is also used for sorting by the FrontEnd.search_options
: (object, optional) How the field behaves during indexing and retrieval. Note that it also determines how the field is configured under the hood.is_sort_field
: (boolean, optional) Whether the field can be used for sorting documents at retrieval time.is_facet_field
: (boolean, optional) Whether the field can be used for faceted search. In other words, the search API is able to return a list of existing values for this field along with the document counts.is_filter_field
: (boolean, optional) Whether the field can be used for filtering documents at retrieval time.is_returned_in_search_results
: (boolean, optional) Whether the field is part of the search API response payload.is_used_in_search
: (boolean, optional) Whether the field is used in full text search.supporting_subqueries
: (boolean, optional) Whether the field can be used in subqueries. Subqueries are used to filter nested objects as if they were root-level documents. Search results return the parent document with the nested object filtered.
analyzer_options
: (object, optional) How the field is analyzed during indexing and retrieval. This is only for fields of typestring
that also havesearch_options.is_used_in_search
set to true.analyzer
: (string, optional) The name of the analyzer to use. We provide some default analyzers, like thezav-en-nostem
. Otherwise the name needs to match a custom analyzer defined in thedocument_field_analysis
section of the index payload. This analyzer will be used for both indexing and retrieval. If the tenant wants to have separate analyzers for indexing and retrieval, then they should not defineanalyzer
and instead pass the name of the analyzers in theindex_analyzer
andsearch_analyzer
fields. A list of build it analyzer can be found here.search_analyzer
: (string, optional) The name of the analyzer to use at retrieval time.index_analyzer
: (string, optional) The name of the analyzer to use at indexing time.
nested_fields
: (array of objects, optional) Whentype=nested
then this contains the list of nested fields. The schema of each object in the array is the same as for thedocument_fields_configuration
field. Note that thename
field should not be prefixed with the name of the parent field. For example, if the parent field isauthors
, then the nested fields should befirst_name
andlast_name
, notauthors.first_name
andauthors.last_name
.
document_field_analysis
: (object, optional) Defines custom analyzers and normalizers that can be used in thedocument_fields_configuration
. A custom analyzer can be defined using character filters (char_filter
), token filters (filter
) and a tokenizer.filter
: (object, optional) Use this to define a token filter. You can find a list of custom filters here. Each key of this object is the name of the filter and the value depends on the filter type.char_filter
: (object, optional) Use this to configure a character filter; you can find available filters here. Each key of this object is the name of the filter and the value depends on the filter type.normalizer
: (object, optional) A normalizer is like an analyzer but with no tokenizer and only outputs a single token. Each key of this object is a name of a custom tokenizer and its value is an object with the following schema:type
: (string, required) Normalizer type, usecustom
to declare a custom normalizer.char_filter
: (array of strings, required) List of character filters to use for the normalizer, this filter makes reference to the one keys defined onchar_filter
object ofdocument_field_analysis
.filter
(array of strings, required) List of filter names, this filters could be a built-in filter or a key of the object defined onfilter
ofdocument_field_analysis
.
analyzer
: (object, optional) Use this to configure an analyzer, each key of this object is the name of an analyzer, and the value of the object is the analyzer configuration with the following schema:type
: (string, required) Type of analyzer, you can use one of the one define here orcustom
to define a custom analyzer.filter
(array of strings, required) List of filter names, this filters could be a built-in filter or a key of the object defined onfilter
ofdocument_field_analysis
.tokenizer
: (string, required) Tokenizer to use on the analyzer. A list of built-in tokenizer can be found here.
client_settings
: (object, optional) Defines index specific rendering information.display_configuration
: (object, optional) Defines how the index content is rendered in the frontend. The field names refer to the names provided in thedocument_fields_configuration
.title_field
: (string, optional) The field used for rendering the title of the document card.date_field
: (string, optional) The field used for rendering the date of the document card.created_by_field
: (string, optional) The field used for rendering the authors of the document card.description_field
: (string, optional) The field used for rendering the description of the document card.url_field
: (string, optional) The field used to render the link to the source content.source_field
: (string, optional) The field used for rendering the source of the document card.bounding_boxes_field
: (string, optional) The field used for rendering the bounding boxes in the PDF viewer.image_url_field
: (string, optional) The field used for rendering the image of the document card.document_metadata_fields
: (array of objects, optional) Defines how other metadata is rendered in the card. This could be the number of references to this document (for example in scientific documents), a list of documents linked to the current one, etc.type
: (enum, required) The type of metadata. Possible values are "github", "twitter", "counter".field_name
: (string, optional) The field that contains the data to be rendered (used in the counter type).url_field
: (string, optional) If the rendered element is clickable, this specifies the link URL.list_field_name
: (string, optional) The field that contains the list of metadata to be rendered (used for the github and twitter types).icon
: (enum. optional) The icon associated with the rendered element. Possible values are "github", "twitter", "reference", "citation".label
: (string, optional) The label associated with the rendered element. This can be used in the tooltip, for example.
editable_fields
: (array of objects, optional) Defines which document fields can be edited by users in the UI.field_name
: (string, required) The name of the field that can be edited. This refers to a field defined indocument_fields_configuration
.label
: (string, required) The display label for this editable field in the UI.input_type
: (enum, required) The type of input control to use. Possible values are "string", "number", "date", "strings" (for multi-valued string fields).
search_filters_configuration
: (array of objects, optional) When defined, the search filters will be limited to the ones defined in this list of filters.field_name
: (string, required) This refers to the field name in thedocument_fields_configuration
that this filter will filter by.display_name
: (string, required) The name that will be displayed as the filter name in the front end.filter_type
: (string, optional) Identifier of the filter type, this string is used by the front end to choose the widget that will display this filter.url_param
: (string, optional) This string will be used by the front end to display in the url as a url param.filter_type_settings
: (object, optional) Filter specific configuration, this could include default values and display names.checkbox
: (object, optional) Display configuration for checkboxes.values
: (array, required) List of values to be displayed and filtered by in the checkbox.label
: (string, required) Display name of the value to filter by.value
: (string, required) Value to filter by.
autocomplete
: (object, optional) Display configuration for autocomplete.values
: (array of strings, required) List of values to be displayed and filtered by in the autocomplete.
nested_checkbox
: (object, optional) Display configuration for nested checkboxes. A nested checkbox is a checkbox that contains other checkboxes. This is useful for visualize the filter with values on a nested structure.separator
: (string, required) The separator used to separate the nested fields. For example, if the field isCategory.Subcategory
, then the separator is.
andSubcategory
will be displayed insideCategory
.values
: (array of strings, required) List of values to be displayed and filtered by in the nested checkbox.
faceted_checkbox
: (object, optional) Display configuration for faceted checkboxes. A faceted checkbox shows counts for each filter value and supports additional settings.separator
: (string, optional) The separator used to separate nested fields. For example, if the field isCategory.Subcategory
, then the separator is.
andSubcategory
will be displayed insideCategory
. No nested structure will be shown if this is not defined.zero_count
: (enum, required) Whether to display facet values with zero counts. Possible values are "show", "hide", "grayed_out".count_mode
: (enum, required) How to compute facet counts. Possible values are "filtered" (counts consider current facet filters) or "static" (counts ignore own facet filters).max_terms
: (integer, optional) Maximum number of terms to display in the facet.sort
: (enum, required) Sorting order for facet values. Possible values are "term_count" or "term_value".
retrieval_unit_configuration
: (object, optional) Defines units used for retrieving content. For example, documents can be retrieved as a whole (document) or in chunks.display_name
: (string, required) The display name for the retrieval unit selection.url_param
: (string, required) The URL parameter used to identify the selected retrieval unit.retrieval_units
: (array of objects, required) List of available retrieval units.display_name
: (string, required) Display name of the retrieval unit.value
: (enum, required) Value of the retrieval unit. Possible values are "document" or "chunk".
default_retrieval_unit
: (enum, required) Default retrieval unit to use. Must be one of the values defined inretrieval_units
.
retrieval_method_configuration
: (object, optional) Defines the available methods for retrieving content.default
: (object, required) Default retrieval method configuration.label
: (string, required) Display name of the default retrieval method.value
: (string, required) Value of the default retrieval method.
status
: (enum, optional) Whether to show the retrieval method selector on the front end, set to "active" to show or "hidden".options
: (array of objects, optional) List of available retrieval methods.label
: (string, required) Display name of the retrieval method.value
: (string, required) Value of the retrieval method.
search_sorting_configuration
: (object, optional) Defines the ordering options for the frontend. The field names refer to the names provided in thedocument_fields_configuration
.field_name
: (string, required) The field to sort by.display_name
: (string, required) The name that will be rendered in the frontend for this sorting option.url_param
(string, optional) The parameter that the frontend will use in the URL when this sorting option is selected.retrieval_unit
: (string, optional) If specified, this option is only shown for the selected retrieval unit.
search_relevance_configuration
: (array of objects, optional) Define the default search profile to use, per retrieval unit.retrieval_unit
: (string, optional) If specified, this search profile will only be used as default when searching for the retrieval unit.retrieval_method
: (string, optional) If specified, this search profile will only be used as default when searching with the retrieval method.search_profile_name
: (string, required) Search profile to use as default, this name refers to thename
field for theprofiles
defined under thesearch_profiles_configuration
.
default_filters_configuration
: (object, optional) Defines the default filters that are always applied to the search queries. The field names refer to the names provided in thedocument_fields_configuration
.and_operator
: (array of objects, optional) Defines the filters that are applied with the AND operator. Each object in the array has the same schema as thedefault_filters_configuration
object.or_operator
: (array of objects, optional) Defines the filters that are applied with the OR operator. Each object in the array has the same schema as thedefault_filters_configuration
object.not_operator
: (object, optional) Defines the filters that are applied with the NOT operator. The object has the same schema as thedefault_filters_configuration
object.nested_operator
: (object, optional) Defines the filters that are applied with the nested operator. This filter is used with nested fields.field_path
: (string, required) The path to the nested field. For example, if the nested field isauthors
, then the nested filters may refer toauthors.first_name
.nested_filter
: (object, required) The filter to apply to the nested fields. The object has the same schema as thedefault_filters_configuration
object. The field names used inside this filter could be relative to the nested field or not. For example forauthors
, the field names could befirst_name
orauthors.first_name
.
exists
: (object, optional) Defines the filters that are applied with the exists operator.field_path
: (string, required) The path to the field.
equals_to
: (object, optional) Defines the filters that are applied with the exact match operator.field_path
: (string, required) The path to the field.field_value
: (string, required) The value to compare to.
greater_than
: (object, optional) Defines the filters that are applied with the greater than operator.field_path
: (string, required) The path to the field.field_value
: (string, required) The value to compare to.
greater_than_or_equal_to
: (object, optional) Defines the filters that are applied with the greater than or equal to operator.field_path
: (string, required) The path to the field.field_value
: (string, required) The value to compare to.
less_than
: (object, optional) Defines the filters that are applied with the less than operator.field_path
: (string, required) The path to the field.field_value
: (string, required) The value to compare to.
less_than_or_equal_to
: (object, optional) Defines the filters that are applied with the less than or equal to operator.field_path
: (string, required) The path to the field.field_value
: (string, required) The value to compare to.
is_in
: (object, optional) Defines the filters that are applied with the is in operator.field_path
: (string, required) The path to the field.field_values
: (array of strings, required) The values to compare to.
geo_distance
: (object, optional) Defines the filters that are applied with the geo distance operator.field_path
: (string, required) The path to the field.point
: (object, required) The point to compare to.lat
: (float, required) The latitude of the point to compare to.lon
: (float, required) The longitude of the point to compare to.
distance
: (string, required) The distance to compare to.
geo_bounding_box
: (object, optional) Defines the filters that are applied with the geo bounding box operator.field_path
: (string, required) The path to the field.top_left_point
: (object, required) The top left point of the bounding box.lat
: (float, required) The latitude of the point.lon
: (float, required) The longitude of the point.
bottom_right_point
: (object, required) The bottom right point of the bounding box.lat
: (float, required) The latitude of the point.lon
: (float, required) The longitude of the point.
search_profiles_configuration
: (object, optional) Define a set of search profiles available for the index. A search profile is collection of configuration to change the default search relevance of the documents. The relevance of the documents are determine by the retrieval method (keyword, knn, mixed or reranked), as well as the boosting based on document attributes.profiles
: (array of object, required) Holds all search profiles for this index. Only one set of settings can be defined per search profile. Mixed and reranker profiles can make reference to other search profiles.name
: (string, optional) Name of the search profile. This name is used to select a default profile in thesearch_relevance_configuration
or at query time using thesearch_profile_name
field.keyword_settings
: (object, optional) Use this configuration to use keyword search and define value based boosting weights.query_settings
: (object, optional) This is use to define boosting when the query is contained in on one or many document fields. For example, if the user query for the term "physics" you can boost the documents that have a field "topic" with a "physics" value.field_search_configs
: (array of objects, required) Each element of this array represents a document field to be boosted.field_path
: (string, required) The path of the field to be boosted.must_match_query
: (boolean, optional) If true, only documents containing the search query on the document field will be returned. Defaults to false.boosting_score
: (object, optional) If the query is included in the document field, the document will be boosted.weight
: (float, required) Multiplier to be used for boosting.
constant_score
: (object, optional) If defined and the query is included in the document, then the document relevance score will be equal to theweight
.weight
: (float, required) Relevance score of the document.
functions_boosting_settings
: (object, optional) Use this configuration to boost a document based on its field value. For example, if your documents have "source" field and you have trusted sources that you want to boost, you achieve that by boosting a document when the "source" field is one of the trusted sources.score_aggregation_method
: (enum, optional) Refers to the method that is used to combine the scores given by each of the functions onfunction_configs
. Possible values are "sum" (default), "multiply", "avg", "first", "max", "min".function_configs
: (array of objects, required) List of functions to be applied. Each function can refer to different document fields or the same document field but with different values to be boosted.field_path
: (string, required) The path of the field.function_type
: (enum, required) This specifies the function to be applied. Currently, the only supported function is "weighted_value," which boosts the document when the field value is set to a specific value, for example boosting a specific source or a specific author.function_config
(object, required) Define the function configuration.weighted_value
(object, required) Configuration for theweighted_value
function.field_value
: (numeric or string, required) The function will be applied when the document field is equal to this parameter.weight
: (float, required) Score multiplier.
knn_settings
: (object, optional) Use this configuration to use knn search, you can also specify functions to tune the relevance.hnsw_settings
: (object, optional) Settings for HNSW search.k
: (integer, optional) The number of nearest neighbors to retrieve.
functions_boosting_settings
: (object, optional) Use this configuration to boost a document based on its field value. For example, if your documents have "source" field and you have trusted sources that you want to boost, you achieve that by boosting a document when the "source" field is one of the trusted sources.score_aggregation_method
: (enum, optional) Refers to the method that is used to combine the scores given by each of the functions onfunction_configs
. Possible values are "sum" (default), "multiply", "avg", "first", "max", "min".function_configs
: (array of objects, required) List of functions to be applied. Each function can refer to different document fields or the same document field but with different values to be boosted.field_path
: (string, required) The path of the field.function_type
: (enum, required) This specifies the function to be applied. Currently, the only supported function is "weighted_value," which boosts the document when the field value is set to a specific value, for example boosting a specific source or a specific author.function_config
(object, required) Define the function configuration.weighted_value
(object, required) Configuration for theweighted_value
function.field_value
: (numeric or string, required) The function will be applied when the document field is equal to this parameter.weight
: (float, required) Score multiplier.
mixed_retrieval_settings
: (object, optional) Use this configuration for hybrid retrieval method. This is a retrieval method that takes multiple retrievers and combines them. Currently two different combination algorithms are supported. The base retrievers can be either fully defined on the configuration or you can make reference using the profilename
.mixing_strategy_name
: (enum, required) Mixing strategy name, this can eitherrff
for Reciprocal Rank Fusion orlinear_combination
to make a weighted sum of the scores of the base retrievers.mixing_strategy_config
: (object, required) Configuration for the mixing strategy.rrf
: (object, optional) Configuration for Reciprocal Rank Fusionk
: (integer, required) Rank constant, must be greater or equal to 1. The greater the constant, the more influence lower ranked documents have.max_documents_multiplier
: (integer, required) This parameter determines the number of documents of each individual retriever to be considered on the mixed. The number of documents considered is the page size (determined at query time) multiplied by this number.
linear_combination
: (object, optional) Configuration for linear combination strategy.weights
: (array of floats, required) List of weights by which each score will be multiplied. This list should have the same length as the retrievers defined either onretriever_profile_names
orretriever_profiles
and be in the same order.max_documents_multiplier
: (integer, required) This parameter determines the number of documents of each individual retriever to be considered on the mixed. The number of documents considered is the page size (determined at query time) multiplied by this number.
retriever_profile_names
: (array of strings, optional): Profilename
s of the base retrievers, the retriever should be defined as one inprofiles
.retriever_profiles
: (array of objects, optional): Array of search profile objects of the base retrievers, The schema of each object in the array is the same as for theprofiles
field.
reranker_settings
: (object, optional) Use this configuration to configure a reranked retreiver.reranker_service
: (str, optional) Reranker service endpoint, for example "http://reranker-api.production.svc.cluster.local:8080".aggregate_results
: (boolean, optional) If the same document is multiple times on the base retriever, whether to aggregate the scores to one score per document. Defaults to True.rerank_top_n
: (integer, optional) Number of documents to rerank. Defaults to 50.retriever_profile_names
: (array of strings, optional): Profilename
s of the base retrievers, the retriever should be defined as one inprofiles
.retriever_profiles
: (array of objects, optional): Array of search profile objects of the base retrievers, The schema of each object in the array is the same as for theprofiles
field.
On successful response, the Zeta Alpha platform will create and configure a new index for the tenant. Take note of the id
field in the response payload. This field will be the required index_id
field of other endpoints.
Note: The API also provides endpoints for listing, retrieving, updating, and deleting indexes.
Content Source configuration
A new content source is created by posting to the /v1/indexes/{index_id}/content-sources/
endpoint. It has the following configuration:
Path params
index_id
: (string, required) The id of the index that will store the data from this content source.
Query params
tenant
: The name of the tenant that owns the content source. Note that this must be the same tenant that owns the index, otherwise the endpoint will return an error.
Request body
name
: (string, required) Human friendly name given to the content source. For example, "google-drive-daily".connector
: (enum, required) What connector type this is. Possible values are "s3", "custom_enhancement", "custom", "join_enhancement", "arxiv", "google_drive", "twitter", "bluesky", "box", "custom_web_crawler", "github", "slack", "confluence", "notes", "openreview", "user_documents", "sharepoint", "teams", "tags_enhancement", "metadata_extractor_enhancement", "agent_processor_enhancement", "federated_search", "edit_enhancement".description
: (string, optional) A description of the content source. For example "Daily ingestion of google drive files".integration_id
: (string, optional) The ID of an integration to use for this content source. Used when the connector requires OAuth or other external authentication.is_indexable
: (boolean, optional, default: true) Whether the ingested data will be processed and indexed. One use case for the false value is when the crawled data only serves as a source for an enhancement connector, and the tenant doesn't want individual documents to be searchable.schedule
: (string, optional) A cron-style schedule that defines how often the connector will run. This works for all connectors except "custom" and "custom_enhancement". If this value is not passed, the connector runs only when an ingestion job is manually triggered (explained later).workflow_name_overrides
: (object, optional) Custom workflow names to use for different operations.ingest
: (string, optional) Custom workflow name for ingestion operations.reingest
: (string, optional) Custom workflow name for reingestion operations.delete
: (string, optional) Custom workflow name for deletion operations.access_rights
: (string, optional) Custom workflow name for access rights operations.
connector_configuration
: (object, optional) The connector-specific configuration. Only one configuration must be passed, corresponding to the value chosen inconnector
.
On successful response, the Zeta Alpha platform will create and configure the source connector for the selected tenant and index. Take note of the id
field in the response payload. This field will be used in the content_source_id
field of other endpoints.
Note: The API also provides endpoints for listing, retrieving, updating, and deleting content sources.
The following subsections will explain how each connector is configured.
Custom connector
This is a generic connector to ingest custom documents. This connector lets the user ingest custom documents using directly the document batches endpoint.
Crawled Objects
The crawled object can be any document with its fields defined on the document_fields_configuration
in the Index configuration
or in the field_mappings
of the connector configuration.
Connector Configuration
The configuration object passed to the content source endpoint is the following:
S3 connector
This connector crawls documents stored on a S3 bucket. A metadata json should be provided along with the document content file for the documents to be ingested by this connector:
- Metadata file: This file contains the metadata information about the document like title, authors, etc. The filename of the json should match the content filename followed by
.metadata.json
, for example, if the filename ismy_document.pdf
the metadata json filename should bemy_document.metadata.json
(omit the content file extension). An example metadata json for an index with the default fields would look like this:{
"DCMI.title": "Title of the document",
"DCMI.abstract": "summary of the document",
"DCMI.creator": [
{
"full_name": "author name"
}
],
"DCMI.created": "2025",
"DCMI.date": "2025-01-12",
"DCMI.source": "source",
"DCMI.language": "en",
"DCMI.identifier": [
"https://example.com/123"
],
"access_rights": [
"user_role_id:XYZ"
],
"uri": "https://example.com/123"
}
Alternatively, you can define the fields declared in the field_mappings
of the connector configuration.
In case of an index with custom fields though, the fields of the json should be the ones defined on the document_fields_configuration
in the Index configuration
(or in the field_mappings
of the connector configuration).
- Content file: Content file to be ingested, supported extensions are ".pdf", ".txt", ".html", ".md", ".csv", ".json", ".doc", ".docx", ".odt", ".xls", ".xlsx", ".ods", ".ppt", ".pptx", ".odp", ".vsd" and ".odg".
Crawled Objects
The crawled object consists of the fields declared in the metadata json file with the content file as document content.
Connector Configuration
The configuration object passed to the content source endpoint is the following:
logo_url
is_document_owner
field_mappings
bucket_name
: (string, required) Name of the bucket where the documents are.prefix
: (string, optional) Bucket prefix of the bucket.aws_access_key_id
: (string, optional) The AWS access key ID for S3 authentication. For security, using IAM roles for service accounts (IRSA) is preferred over static credentials.aws_secret_access_key
: (string, optional) The AWS secret access key for S3 authentication. For security, using IAM roles for service accounts (IRSA) is preferred over static credentials.aws_region
: (string, optional) The AWS region for S3 operations.aws_endpoint_url
: (string, optional) Custom endpoint URL for S3-compatible services. Use this for MinIO, LocalStack, or other S3-compatible storage systems that are not hosted on AWS.
ArXiv connector
This connector crawls arXiv papers using the OAI-PMH protocol (https://arxiv.org/help/oa/index). Currently arXiv doesn't require authentication for this procedure.
Crawled objects
The following lists specifies the schema of each of the crawled documents along with their default mapping to the DocumentRequest
model (see Document Request model). Note that reserved fields are not supposed to be mapped in the connector configuration, this is done under the hood. They are listed here for completeness.
abstract_url
: (string, required) The url to the abstract. For example"https://arxiv.org/abs/{arxiv_id}"
title
(mapped totitle
): (string, required) Title of the paper.abstract
: (string, required) Abstract of the paper.categories
: (array of strings, required) ArXiv categories that the paper belongs to.created_at
(mapped tocreated_at
): (datetime, required) The date at which the paper was originally submitted to arXiv.last_updated_at
(mapped tolast_updated_at
): (datetime, required) The date at which the paper submission was last updated.authors
: (array of objects, required) Authors of the paper.first_name
: (string, required) First name of the author.last_name
: (string, required) Last name of the author.full_name
(mapped toauthors
): (string, required) The first name concatenated with the last name.
identifiers
: (array of strings, required) List of paper identifiers. It includes the arXiv id, the abstract URL, DOI (if known), and DOI URL (if known).journal
: (string, required) The journal where the paper was published (if known). Otherwise it's an empty string.content_source_name
: (string, required) "arXiv".format
: (string, required) "scientific paper".language
: (string, required) "EN".pdf_url
: (string, required) The arXiv URL to the paper's PDF.- (reserved)
uri
(mapped touri
): (string, required) Same asabstract_url
. - (reserved)
image_urls
(mapped toimage_urls
): (array of strings, optional) Null. - (reserved)
document_content_type
(mapped todocument_content_type
): (string, required) "application/pdf". - (reserved)
document_content_url
(mapped todocument_content_path.url_content.url
): (string, required) Same aspdf_url
.
Connector configuration
The configuration object passed to the content source endpoint is the following:
arxiv_url
: (string, required) URL of the OAI-PMH API. For example,"http://export.arxiv.org/oai2"
. Verify the arXiv documentation for the right URL to use to avoid getting the connector banned.arxiv_sets
: (array of strings, required) List of arXiv category groups to crawl. For example "cs", "stat", "eess".arxiv_categories
: (array of strings, required) List of arXiv categories to crawl. For example "cs.LG", "cs.AI", "stat.ML". Note that if no categories are passed for a particular group, then nothing will be crawled for that group. That means that ifarxiv_sets
contains "eess" but no category that starts with "eess." is passed inarxiv_categories
, then no papers in the "eess" group are crawled.max_papers
: (integer, optional) Maximum number of papers to crawl on each run.since_crawl_date
: (string, optional) Defines the publication date of the oldest paper to crawl. If not passed, it defaults to today.allow_access_rights
: (array of objects, optional) If a Zeta Alpha user has any of the access rights in the list, then they will be able to access the documents crawled by this connector. If not passed, then no user will be able to retrieve any of the documents.name
: (string, required) The name of the access right. For example the user UUID of a particular user, or "public" to allow everyone with access to the tenant.type
: (string, optional) The type of access right. For example "user_uuid" if the right is for a particular user.content_source_id
: (string, optional) If the access right is scoped to only a particular content source, then the id should be passed here. For example if a tenant wants to prevent a user from having access to this content source, even though they may have access to it in the source's system.
deny_access_rights
: (array of objects, optional) If a Zeta Alpha user has any of the access rights in the list, then they will not be able to access the documents crawled by this connector. The object schema is exactly the same as forallow_access_rights
.logo_url
is_document_owner
field_mappings
Google Drive connector
This connector crawls Google Drive documents using the Google Drive V3 API. Authentication credentials are needed to access the API in the form of a service account JSON file. The service account must have access to the Google Drive documents of the tenant.
Crawled objects
The following lists specifies the schema of each of the crawled documents along with their default mapping to the DocumentRequest
model (see Document Request model). Note that reserved fields are not supposed to be mapped in the connector configuration, this is done under the hood. They are listed here for completeness.
id
: (string, required) Id of the document as given by the Google Drive API.web_view_link
: (string, required) A link for opening the file in a relevant Google editor or viewer in a browser.created_time
(mapped tocreated_at
): (string, required) The time at which the file was created (RFC 3339 date-time).size
: (string, optional) The size of the file's content in bytes. This is only applicable to files with binary content in Google Drive.full_file_extension
: (string, optional) The full file extension extracted from the name field. May contain multiple concatenated extensions, such as "tar.gz". This is only available for files with binary content in Google Drive.mime_type
: (string, required) The MIME type of the file.modified_time
(mapped tolast_updated_at
): (string, required) The last time the file was modified by anyone (RFC 3339 date-time).name
(mapped totitle
): (string, required) The name of the file. This is not necessarily unique within a folder.path
: (string, required) The path of the file within Google Drive. The root is the name of the (shared) drive.content_source_name
: (string, required) "Google Drive".owners
: (array of objects, required) The owners of the file. Currently, only certain legacy files may have more than one owner. Not populated for items in shared drives.display_name
(mapped toauthors
): (string, required) A plain text displayable name for this user.permission_id
: (string, required) The user's ID as visible in Permission resources.email_address
: (email, required) The email address of the user. This may not be present in certain contexts if the user has not made their email address visible to the requester.photo_link
: (email, required) A link to the user's profile photo, if available.
version
(mapped toversion
): (string, required) A monotonically increasing version number for the file. This reflects every change made to the file on the server, even those not visible to the user.permissions
: (array of objects, required) The Google Drive permissions assigned to the document.id
: (string, required) A unique identifier for this permission.display_name
: (string, required) The "pretty" name of the value of the permission (for example the user's full name, the name of a Google Group, the domain).type
: (string, required) The type of grantee. Possible values are "user", "group", "domain", "anyone".email_address
: (string, required) The email address of the user or group to which this permission refers.role
: (string, required) The role granted by this permission. Possible values are "owner", "organizer", "fileOrganizer", "writer", "commenter", "reader".deleted
: (boolean, required) Whether the account associated with this permission has been deleted.
export_links
: (object, optional) A mapping between mime types and the link to export the document in such a mime type.- (reserved)
uri
(mapped touri
): (string, required) Same asweb_view_link
. - (reserved)
image_urls
(mapped toimage_urls
): (array of strings, optional) Null. - (reserved)
access_rights
(mapped toallow_access_rights
): (array of objects, required) The access rights parsed from thepermissions
field.name
: (string, required) The name of the access right. This is given by the value ofpermissions.email_address
.type
: (string, optional) The type of access right. This is set togoogle_drive_{permissions.type}
.content_source_id
: (string, optional) Null.
- (reserved)
document_content_type
(mapped todocument_content_type
): (string, optional) One of the supported mime types. - (reserved)
base64_content
(mapped todocument_content_path.base64_content
): (string, optional) The base64-encoded document content. The document content is converted to PDF using Google Drive's export functionality if its mime type is not one of the supported mime types. Otherwise the document is downloaded as is and encoded using base64.
Connector configuration
The configuration object passed to the content source endpoint is the following:
service_account
: (object, required) Credentials that grants access to the Google Drive documents of the tenant. This is the JSON file exported in the authentication setup.type
: (string, required)project_id
: (string, required)private_key_id
: (string, required)private_key
: (string, required)client_email
: (string, required)client_id
: (string, required)auth_uri
: (string, required)token_uri
: (string, required)auth_provider_x509_cert_url
: (string, required)client_x509_cert_url
: (string, required)
path_include_regex_patterns
: (array of strings, optional) Google Drive documents whose path matches any of the regular expressions in the list will be crawled. If a document matches both an include and exclude pattern, the exclude pattern takes precedence and the document is not crawled. If not passed, then all documents are crawled.path_exclude_regex_patterns
: (array of strings, optional) Google Drive documents whose path matches any of the regular expressions in the list will not be crawled. If a document matches both an include and exclude pattern, the exclude pattern takes precedence and the document is not crawled.logo_url
is_document_owner
field_mappings
SharePoint connector
This type of connector crawls documents stored on a SharePoint site. Authentication credentials for the account are needed to crawl the documents.
Crawled objects
The following lists specifies the schema of each of the crawled documents along with their default mapping to the DocumentRequest
model (see Document Request model). Note that reserved fields are not supposed to be mapped in the connector configuration, this is done under the hood. They are listed here for completeness.
document_id
: (string, required) Id of the document this is derived formtheid
on the SharePoint API.content_source_name
: (string, required) A name for the content source, this can be set on the connector configuration, defaults to "SharePoint".formatted_id
: (string, required) A formatted id composed of the site name, drive name and item path.name
: (string, required) The name of the document.path
: (string, required) The path of the document within SharePoint.parent
: (object, required) Reference to the parent folder.id
: (string, required) Id of the parent folder.name
: (string, required) Name of the parent folder.path
: (string, required) Path of the parent folder.
drive
: (object, required) Information about the drive containing the document.id
: (string, required) Id of the drive.name
: (string, required) Name of the drive.web_url
: (string, required) URL to access the drive in SharePoint.
site
: (object, required) Information about the site containing the document.id
: (string, required) Id of the site.name
: (string, optional) Name of the site.web_url
: (string, required) URL to access the site in SharePoint.
author
: (array of objects, required) List of authors of the document.id
: (string, optional) Id of the author.display_name
: (string, optional) Display name of the author.
last_modified_by
: (object, required) Identity of the user who last modified the document.group
: (object, optional) Group information if modified by a group.id
: (string, optional) Id of the group.display_name
: (string, optional) Display name of the group.
user
: (object, optional) User information if modified by a user.id
: (string, optional) Id of the user.display_name
: (string, optional) Display name of the user.
created_date_time
: (datetime, required) Creation time of the document.last_modified_date_time
: (datetime, optional) Last modification time of the document.c_tag
: (string, required) An eTag for the file content.e_tag
: (string, required) An eTag for the file's properties.shared
: (object, optional) Sharing information for the document.- (reserved)
uri
(mapped touri
): (string, required) URL to the document. - (reserved)
image_urls
(mapped toimage_urls
): (array of strings, optional) Null. - (reserved)
access_rights
(mapped toallow_access_rights
): (array of objects, required) The access rights parsed from thepermissions
endpoint.name
: (string, required) Id of the permission, this can be a Microsoft Group Id, a SharePoint Group ID or a user email.type
: (string, optional) The type of access right. This could bemsft_user
,msft_group
,sharepoint_site_group
orsharepoint_site_user
.content_source_id
: (string, optional) Null.
- (reserved)
document_content_type
(mapped todocument_content_type
): (string, optional) One of the supported mime types. - (reserved)
base64_content
(mapped todocument_content_path.base64_content
): (string, optional) The base64-encoded document content. The document content is converted to PDF using the endpoint format functionality. If the content is not converted to PDF and its mime type is not one of the supported mime types, the document is skipped.
Connector configuration
The configuration object passed to the content source endpoint is the following:
access_credentials
: (object, required) The credentials needed to access SharePoint.client_id
: (string, required) The client ID of the Azure AD application.client_secret
: (string, required) The client secret of the Azure AD application.tenant_id
: (string, required) The tenant ID of the Azure AD.
include_one_drives
: (boolean, optional, default: false) Whether to crawl OneDrive drives.path_inclusion_regex_patterns
: (array of strings, optional) Files whose path matches any of the regular expressions in the list will be crawled. If a file path matches both an include and exclude pattern, the exclude pattern takes precedence and the file is not crawled. If not passed, then all files are crawled.path_exclusion_regex_patterns
: (array of strings, optional) Files whose path matches any of the regular expressions in the list will not be crawled. If a file path matches both an include and exclude pattern, the exclude pattern takes precedence and the file is not crawled.drive_inclusion_regex_patterns
: (array of strings, optional) Drives whose name matches any of the regular expressions in the list will be crawled. If a drive name matches both an include and exclude pattern, the exclude pattern takes precedence and the drive is not crawled. If not passed, then all drives are crawled.drive_exclusion_regex_patterns
: (array of strings, optional) Drives whose name matches any of the regular expressions in the list will not be crawled. If a drive name matches both an include and exclude pattern, the exclude pattern takes precedence and the drive is not crawled.site_inclusion_regex_patterns
: (array of strings, optional) Sites whose path matches any of the regular expressions in the list will be crawled. If a site path matches both an include and exclude pattern, the exclude pattern takes precedence and the site is not crawled. If not passed, then all sites are crawled.site_exclusion_regex_patterns
: (array of strings, optional) Sites whose path matches any of the regular expressions in the list will not be crawled. If a site path matches both an include and exclude pattern, the exclude pattern takes precedence and the site is not crawled.site_paths
: (array of objects, optional) List of specific SharePoint sites to crawl.collection_hostname
: (string, required) The hostname of the SharePoint site collection.site_relative_path
: (string, required) The relative path of the site within the collection.
one_drive_users
: (array of strings, optional) List of email of users whose OneDrive contents should be crawled. Only applicable wheninclude_one_drives
is true.include_sub_sites
: (boolean, optional, default: false) Whether to crawl sub-sites of the main sites.logo_url
is_document_owner
field_mappings
Box connector
This connector crawls Box files using the Box SDK. Authentication credentials for the account are needed to crawl the documents.
Crawled objects
The following lists specifies the schema of each of the crawled documents along with their default mapping to the DocumentRequest
model (see Document Request model). Note that reserved fields are not supposed to be mapped in the connector configuration, this is done under the hood. They are listed here for completeness.
file_id
: (string, required) Id of the file as given by the Box API.share_url
: (string, optional) The URL used to share the file.content_created_at
(mapped tocreated_at
): (datetime, optional) The time at which the file content was created.content_modified_at
(mapped tolast_updated_at
): (datetime, optional) The time at which the file content was last modified.created_at
: (datetime, optional) The time at which the file was created in Box.modified_at
: (datetime, optional) The time at which the file was last modified in Box.etag
: (string, optional) The file's etag.classification
: (string, optional) Classification of the file if available.description
: (string, optional) Description of the file.comment_count
: (integer, required) Number of comments on the file.extension
: (string, optional) Extension of the file.metadata
: (object, optional) Box metadata associated with the file.name
: (string, required) Name of the file.filepath
: (string, required) Full path of the file in Box.sequence_id
: (string, required) A unique string to identify the version of the file.shared_link_access
: (string, optional) Access level of the shared link.size
: (integer, optional) Size of the file in bytes.tags
: (array of strings, required) List of tags applied to the file.content_source_name
: (string, required) Name identifier for the content source, defaults to "Box".owners
: (array of objects, required) List with information about the file owner.id
: (string, required) Box user ID of the owner.display_name
: (string, optional) Display name of the owner.email_address
: (string, optional) Email address of the owner.
permissions
: (array of objects, required) List of users and groups with access to the file.collab_id
: (string, required) ID of the collaboration.entity_id
: (string, required) ID of the user or group.display_name
: (string, optional) Display name of the user or group.email_address
: (string, optional) Email address if entity is a user.role
: (string, optional) Access role assigned to the entity.status
: (string, optional) Status of the collaboration.entity_type
: (string, required) Type of entity ("user" or "group").
- (reserved)
uri
(mapped touri
): (string, required) URL to access the file in Box. - (reserved)
image_urls
(mapped toimage_urls
): (array of strings, optional) Null. - (reserved)
access_rights
(mapped toallow_access_rights
): (array of objects, required) The access rights parsed from thepermissions
field.name
: (string, required) Email address or entity ID of the permission holder.type
: (string, required) Type of the access right, prefixed with "box_", e.g. "box_user", "box_group".content_source_id
: (string, optional) Null.
- (reserved)
document_content_type
(mapped todocument_content_type
): (string, required) One of the supported mime types. - (reserved)
base64_content
(mapped todocument_content_path.base64_content
): (string, optional) The base64-encoded file content.
Connector configuration
The configuration object passed to the content source endpoint is the following:
access_credentials
: (object, required) The credentials needed to access Box.client_id
: (string, required) The client ID of the Box application.client_secret
: (string, required) The client secret of the Box application.enterprise_id
: (string, required) The enterprise ID of the Box account.
path_inclusion_regex_patterns
: (array of strings, optional) Files whose path matches any of the regular expressions in the list will be crawled. If a file path matches both an include and exclude pattern, the exclude pattern takes precedence and the file is not crawled. If not passed, then all files are crawled.path_exclusion_regex_patterns
: (array of strings, optional) Files whose path matches any of the regular expressions in the list will not be crawled. If a file path matches both an include and exclude pattern, the exclude pattern takes precedence and the file is not crawled.folder_ids
: (array of strings, optional) List of Box folder IDs to crawl. A folder id is the numeric sequence on a folder url. If not passed, then all folders are crawled.since_crawl_date
: (datetime, optional) Only crawl files updated after this datetime. Iffull_crawl
is false, and this field is not set, this parameter will default to last day.The format is "YYYY-MM-DDTHH:MM:SSZ" (including the timezone). For example "2023-01-01T00:00:00Z".crawl_limit
: (integer, optional) The maximum number of files to crawl.full_crawl
: (boolean, optional) Whether to perform a full crawl of all files in the Box account. If the connector has been previously scheduled, it will crawled only the files that have been updated since the last crawl. Performing a full crawl will ingest all files, but it will not delete any files that have been removed from Box.last_streaming_token
: (string, optional) The last streaming position received from Box. This is used to resume the streaming of files from the last point. This attribute will be overridden by the next connector run.logo_url
is_document_owner
field_mappings
Federated Search connector
This connector queries a federated index using the search API and ingests matching documents based on specified queries.
Crawled objects
The following lists specifies the schema of each of the crawled documents along with their default mapping to the DocumentRequest
model (see Document Request model). Note that reserved fields are not supposed to be mapped in the connector configuration, this is done under the hood. They are listed here for completeness.
abstract
: (string, required) Abstract of the document.title
: (string, required) Title of the document.created_at
(mapped tocreated_at
): (datetime, required) The date at which the document was created.authors
: (array of objects, required) Authors of the document.full_name
(mapped toauthors
): (string, required) The full name of the author.
content_source_name
: (string, required) The name of the content source where the document originated.- (reserved)
uri
(mapped touri
): (string, required) The URI of the document. - (reserved)
image_urls
(mapped toimage_urls
): (array of strings, optional) Images associated with the document. - (reserved)
document_content_type
(mapped todocument_content_type
): (string, optional) The MIME type of the document content, typically "application/pdf". - (reserved)
document_content_url
(mapped todocument_content_path.url_content.url
): (string, optional) URL to the document's PDF or content file.
Connector configuration
The configuration object passed to the content source endpoint is the following:
logo_url
is_document_owner
field_mappings
queries
: (array of objects, required) List of query specifications. Each object must contain at least aquery_string
field and any additional parameters such asfilters
,year
,date
, orsources
. For more information, see the Zeta Alpha Search API documentation.source_index_id
: (string, required) The name of the search engine to be used. Supported values are"google_scholar"
,"google"
, and"bing"
.search_api_url
: (string, optional) The URL of the search API endpoint. Defaults to the tenant's search endpoint.sort_by_relevance
: (boolean, optional, default: true) Whether to sort results by relevance score.max_number_of_pages
: (integer, optional, default: 10) Maximum number of pages of results to fetch.page_size
: (integer, optional, default: 10) Number of documents to fetch per page.authorization
: (string, optional) Authorization header value to include in Search API requests.stop_early
: (boolean, optional, defaultfalse
) Stop fetching further pages when we hit a result that was already ingested before.request_headers
: (object, optional) Additional HTTP headers to include in the request.fetch_abstracts
: (boolean, optional, defaultfalse
) Whether to fetch document abstracts directly from the document source.allow_access_rights
: (array of objects, optional) If a Zeta Alpha user has any of the access rights in the list, then they will be able to access the documents crawled by this connector. If not passed, then no user will be able to retrieve any of the documents.name
: (string, required) The name of the access right.type
: (string, optional) The type of access right.content_source_id
: (string, optional) If the access right is scoped to only a particular content source.
deny_access_rights
: (array of objects, optional) If a Zeta Alpha user has any of the access rights in the list, then they will not be able to access the documents crawled by this connector. The object schema is exactly the same as forallow_access_rights
.
Twitter connector
This connector crawls tweets using the Twitter API. Authentication credentials are needed to access the API in the form of a Bearer token.
Crawled objects
The following lists specifies the schema of each of the crawled documents along with their default mapping to the DocumentRequest
model (see Document Request model). Note that reserved fields are not supposed to be mapped in the connector configuration, this is done under the hood. They are listed here for completeness.
id
: (string, required) Unique identifier of the Tweet.mentioned_urls
: (array of strings, required) List of URLs mentioned in the Tweet.retweet_count
: (integer, required) Number of times this Tweet has been Retweeted.reply_count
: (integer, required) Number of Replies of this Tweet.like_count
: (integer, required) Number of Likes of this Tweet.created_at
(mapped tocreated_at
): (datetime, optional) Creation time of the Tweet.lang
: (string, optional) Language of the Tweet, if detected by Twitter. Returned as a BCP47 language tag.user_screen_name
: (string, required) The Twitter screen name, handle, or alias that the Tweet author identifies themselves with.user_followers_count
: (integer, required) Number of followers that the Tweet author has.user_following_count
: (integer, required) Number of users that the Tweet author is following.user_profile_url
: (string, required) URL to the Tweet author's profile.user_profile_image_url
: (string, required) The URL to the profile image for the Tweet author.tweet_type
: (enum, required) Type of Tweet. Possible values are "tweet", "quoted", "retweet".- (reserved)
uri
(mapped touri
): (string, required) Link to the Tweet. - (reserved)
image_urls
(mapped toimage_urls
): (array of strings, optional) Null. - (reserved)
document_content_type
(mapped todocument_content_type
): (string, optional) "text/plain". - (reserved)
document_content
(mapped todocument_content_path.base64_content
): (string, optional) The Tweet content as a plain string.