Concept Reference

Index configuration

A new index is created by posting to the /v1/indexes endpoint. It has the following configuration:

Query params

tenant: The name of the tenant that owns the index.

Request body

name: (string, required) Human friendly name given to the index. For example "Zeta Alpha".
default: (boolean, optional) Whether the index is used at query time if no other index is specified. Only only index can be set as default per tenant.
description: (string, optional) A description of the index. For example "Main index for the research navigator app".
cluster_connection: (object, required) It specifies what index backend to use and how to access it.
- backend: (string, required) The type of backend to use. Possible values are "opensearch".
- host: (string, required) The hostname of the index. For example "opensearch-cluster-master-headless.opensearch.svc.cluster.local".
- port: (integer, required) The port of the index. For example 9200.
- settings: (object, optional) Connection settings.
  - use_ssl: (boolean, optional) Whether to use SSL when connecting to the index.
  - http_auth: (array of strings, optional) Credentials to use when connecting to the index. For example ["username", "password"].
  - verify_certs: (boolean, optional) Whether to verify the SSL certificate when connecting to the index.
  - ssl_show_warn: (boolean, optional) Whether to show a warning when connecting to the index with verify_certs disabled.
  - ca_certs: (string, optional) The path to the CA certificate to use when connecting to the index.
  - client_cert: (string, optional) The path to the client certificate to use when connecting to the index.
  - client_key: (string, optional) The path to the client key to use when connecting to the index.
storage_settings: (object, required) It specifies how the auxiliary data for this index is stored.
- ingesting: (object, required) It specifies how large data ingested by the pipeline is stored. When ingesting data into the document_content or document_content_path.base64_content fields, then this data is stored in the backend specified here.
  - backend: (enum, required) The backend to use for storing the ingested data. Possible values are "s3", "azure", "disk".
  - s3: (object, optional) The configuration for the S3 backend.
    - s3_bucket_name: (string, required) The name of the S3 bucket.
    - s3_key_prefix: (string, optional) The prefix to use when storing the data in the S3 bucket.
  - azure: (object, optional) The configuration for the Azure backend
    - azure_account_url: (string, required) The URL of the Azure account.
    - azure_container_name: (string, required) The name of the Azure container.
    - azure_blob_prefix: (string, optional) The prefix to use when storing the data in the Azure container.
    - azure_credential: (string, optional) The storage account shared key (account key or access key) to use when connecting to the Azure account.
  - disk: (object, optional) The configuration for the disk backend.
    - disk_location: (string, optional) The path to the directory where the data will be stored.
  - max_file_size: (integer, optional) The maximum size of the files to store in the backend. This is in bytes. The default value is 1024**3 (100MB).
- processing: (object, required) It specifies how the data used by the pipeline is stored.
  - backend: (enum, required) The backend to use for storing the data. Possible values are "s3", "azure", "disk".
  - s3: (object, optional) The configuration for the S3 backend.
    - s3_bucket_name: (string, required) The name of the S3 bucket.
    - s3_key_prefix: (string, optional) The prefix to use when storing the data in the S3 bucket.
  - azure: (object, optional) The configuration for the Azure backend
    - azure_account_url: (string, required) The URL of the Azure account.
    - azure_container_name: (string, required) The name of the Azure container.
    - azure_blob_prefix: (string, optional) The prefix to use when storing the data in the Azure container.
    - azure_credential: (string, optional) The storage account shared key (account key or access key) to use when connecting to the Azure account.
  - disk: (object, optional) The configuration for the disk backend.
    - disk_location: (string, optional) The path to the directory where the data will be stored.
  - compression: (object, optional) The configuration for the compression of the data.
    - compression_algorithm: (enum, required) The algorithm to use for compressing the data. Possible values are "bz2", "gzip", "lzma", "snappy", "zlib", "zstd". The recommended value is "zlib".
    - level: (integer, optional) The level of compression to use. This is an integer between 0 and 9, where 0 is no compression and 9 is the maximum compression.
features: (object, optional) Configures the available features for this index.
- neural_search: (object, optional) Configures the neural search feature.
  - model_serving_url: (string, required) The URL of the model server that will be used to compute vector embeddings. For example "http://sentence-encoder-api.production.svc.cluster.local:8080/v0.6".
  - model_serving_url_pipeline: (string, optional) The URL of the model server that will be used to compute vector embeddings in the pipeline (offline). For example "http://sentence-encoder-api.production.svc.cluster.local:8080/v0.6". If not passed, then the model_serving_url will be used.
  - compression_factor: (enum, optional) The compression factor to use for computing vector similarity. This considerably reduces the memory footprint of the index with limited impact on the quality of the results. Possible values are null, "1x", "2x", "4x", "8x", "16x", "32x. The default value is null (no compression), which is equivalent to "1x".
  - embedding_dimension: (integer) The dimension of the embeddings. This depends on the model that is selected in the model_serving_url. If not passed, then the default value is 768.
capacity_configuration: (object, optional) Configures the storage and replication parameters for this index.
- storage_units: (integer >= 1, optional) The number of storage units to provision for the index. One storage unit supports approximately 10GB of data. Defaults to 1. Note: this value can\t be changed after the index is created (as of now).
- replication_factor: (integer >= 1, optional) The number of times the data is replicated in the index. For example, if the replication factor is 3, the data is stored 3 times in the index. If the replication factor is 1, the data is stored only once in the index. This setting is useful for high availability and fault tolerance. Defaults to 1.
document_fields_configuration: (array of objects, optional) Specifies the name of the fields that the tenant wants in the index, as well as how they behave during indexing and retrieval.
- name: (string, required) The name of the field that will be used to store values in the index, as well as when retrieving and filtering documents. This can be a nested field, for example `authors.first_name``.
- type: (enum, required) The type of field. Possible values are "document_id", "string", "date", "number", "geolocation", "bounding_box", "document_content". Note that any of these types can be multi-valued, meaning they will accept a list of values. Furthermore, when defining a nested field name like authors.first_name, the values will be stored in a flat structure. For example, if we define authors.first_name and authors.last_name and later ingest data like {"authors": [{"first_name": "John", "last_name": "Doe"}, {"first_name": "Jane", "last_name": "Doe"}]}, then the index will contain the following fields: authors.first_name: ["John", "Jane"] and authors.last_name: ["Doe", "Doe"]. This structure minimizes indexing time and storage as well as retrieval time. However, it will not be possible to restrict search results to the ones that contain at least one author with first_name=John and last_name=Doe. If this is a requirement, then the tenant should define the field as nested. The fields document_id and document_content are only relevant when used inside a nested field.
- alias: (string, optional) An alternative field name that can be used to retrieve documents. Note that the alias must be unique. This field is also used to enable special processing of the fields. In particular, to let the system know that the field should be treated as a document "title" you should set an alias to "metadata.DCMI.title", and also to let the system know that a document field should be treated as a document description you should set an alias to "metadata.DCMI.abstract", both the title and description are used to create embeddings along with the content. Another special alias is "metadata.DCMI.created" which tells the system this field hold the creation date of the document, during processing a field with this alias is defaulted to the ingestion date if not passed and is also used for sorting by the FrontEnd.
- search_options: (object, optional) How the field behaves during indexing and retrieval. Note that it also determines how the field is configured under the hood.
  - is_sort_field: (boolean, optional) Whether the field can be used for sorting documents at retrieval time.
  - is_facet_field: (boolean, optional) Whether the field can be used for faceted search. In other words, the search API is able to return a list of existing values for this field along with the document counts.
  - is_filter_field: (boolean, optional) Whether the field can be used for filtering documents at retrieval time.
  - is_returned_in_search_results: (boolean, optional) Whether the field is part of the search API response payload.
  - is_used_in_search: (boolean, optional) Whether the field is used in full text search.
  - supporting_subqueries: (boolean, optional) Whether the field can be used in subqueries. Subqueries are used to filter nested objects as if they were root-level documents. Search results return the parent document with the nested object filtered.
- analyzer_options: (object, optional) How the field is analyzed during indexing and retrieval. This is only for fields of type string that also have search_options.is_used_in_search set to true.
  - analyzer: (string, optional) The name of the analyzer to use. We provide some default analyzers, like the zav-en-nostem. Otherwise the name needs to match a custom analyzer defined in the document_field_analysis section of the index payload. This analyzer will be used for both indexing and retrieval. If the tenant wants to have separate analyzers for indexing and retrieval, then they should not define analyzer and instead pass the name of the analyzers in the index_analyzer and search_analyzer fields. A list of build it analyzer can be found here.
  - search_analyzer: (string, optional) The name of the analyzer to use at retrieval time.
  - index_analyzer: (string, optional) The name of the analyzer to use at indexing time.
- nested_fields: (array of objects, optional) When type=nested then this contains the list of nested fields. The schema of each object in the array is the same as for the document_fields_configuration field. Note that the name field should not be prefixed with the name of the parent field. For example, if the parent field is authors, then the nested fields should be first_name and last_name, not authors.first_name and authors.last_name.
document_field_analysis: (object, optional) Defines custom analyzers and normalizers that can be used in the document_fields_configuration. A custom analyzer can be defined using character filters (char_filter), token filters (filter) and a tokenizer.
- filter: (object, optional) Use this to define a token filter. You can find a list of custom filters here. Each key of this object is the name of the filter and the value depends on the filter type.
- char_filter: (object, optional) Use this to configure a character filter; you can find available filters here. Each key of this object is the name of the filter and the value depends on the filter type.
- normalizer: (object, optional) A normalizer is like a normalizer but with no tokenizer and only outputs a single token. Each key of this object is a name of a custom tokenizer and its value is a object with the following schema:
  - type: (string, required) Normalizer type, use custom to declare a custom normalizer.
  - char_filter: (array of strings, required) List of character filters to use for the normalizer, this filter makes reference to the one keys defined on char_filter object of document_field_analysis.
  - filter (array of strings, required) List of filter names, this filters could be a built-in filter or a key of the object defined on filter of document_field_analysis.
- analyzer: (object, optional) Use this to configure an analyzer, each key of this object is the name of an analyzer, and the value of the object is the analyzer configuration with the following schema:
  - type: (string, required) Type of analyzer, you can use one of the one define here or custom to define a custom analyzer.
  - filter (array of strings, required) List of filter names, this filters could be a built-in filter or a key of the object defined on filter of document_field_analysis.
  - tokenizer: (string, required) Tokenizer to use on the analyzer. A list of built-in tokenizer can be found here.
client_settings: (object, optional) Defines index specific rendering information.
- display_configuration: (object, optional) Defines how the index content is rendered in the frontend. The field names refer to the names provided in the document_fields_configuration.
  - title_field: (string, optional) The field used for rendering the title of the document card.
  - date_field: (string, optional) The field used for rendering the date of the document card.
  - created_by_field: (string, optional) The field used for rendering the authors of the document card.
  - description_field: (string, optional) The field used for rendering the description of the document card.
  - url_field: (string, optional) The field used to render the link to the source content.
  - source_field: (string, optional) The field used for rendering the source of the document card.
  - bounding_boxes_field: (string, optional) The field used for rendering the bounding boxes in the PDF viewer.
  - image_url_field: (string, optional) The field used for rendering the image of the document card.
  - document_metadata_fields: (array of objects, optional) Defines how other metadata is rendered in the card. This could be the number of references to this document (for example in scientific documents), a list of documents linked to the current one, etc.
    - type: (enum, required) The type of metadata. Possible values are "github", "twitter", "counter".
    - field_name: (string, optional) The field that contains the data to be rendered (used in the counter type).
    - url_field: (string, optional) If the rendered element is clickable, this specifies the link URL.
    - list_field_name: (string, optional) The field that contains the list of metadata to be rendered (used for the github and twitter types).
    - icon: (enum. optional) The icon associated with the rendered element. Possible values are "github", "twitter", "reference", "citation".
    - label: (string, optional) The label associated with the rendered element. This can be used in the tooltip, for example.
- search_filters_configuration: (array of objects, optional) When defined, the search filters will be limited to the ones defined in this list of filters.
  - field_name: (string, required) This refers to the field name in the document_fields_configuration that this filter will filter by.
  - display_name: (string, required) The name that will be displayed as the filter name in the front end.
  - filter_type: (string, optional) Identifier of the filter type, this string is used by the front end to choose the widget that will display this filter.
  - url_param: (string, optional) This string will be used by front end to display in the url as a url param.
  - filter_type_settings: (object, optional) Filter specific configuration, this could include default values and display names.
    - checkbox: (object, optional) Display configuration for checkboxes.
      - values: (array, required) List of values to be display and filter by in the checkbox.
        
        label: (string, required) Display name of the value to filter by.
        
        value: (string, required) Value to filter by.
- search_sorting_configuration: (object, optional) Defines the ordering options for the frontend. The field names refer to the names provided in the document_fields_configuration.
  - field_name: (string, required) The field to sort by.
  - display_name: (string, required) The name that will be rendered in the frontend for this sorting option.
  - url_param (string, optional) The parameter that the frontend will use in the URL when this sorting option is selected.
  - retrieval_unit: (string, optional) If specified, this option is only shown for the selected retrieval unit.
- search_relevance_configuration: (array of objects, optional) Define the default search profile to use, per retrieval unit.
  - retrieval_unit: (string, optional) If specified, this search profile will only be used as default when searching for the retrieval unit.
  - search_profile_name: Search profile to use as default, this name refers to the name field for the profiles defined under the search_profiles_configuration.
default_filters_configuration: (object, optional) Defines the default filters that are always applied to the search queries. The field names refer to the names provided in the document_fields_configuration.
- and_operator: (array of objects, optional) Defines the filters that are applied with the AND operator. Each object in the array has the same schema as the default_filters_configuration object.
- or_operator: (array of objects, optional) Defines the filters that are applied with the OR operator. Each object in the array has the same schema as the default_filters_configuration object.
- not_operator: (object, optional) Defines the filters that are applied with the NOT operator. The object has the same schema as the default_filters_configuration object.
- nested_operator: (object, optional) Defines the filters that are applied with the nested operator. This filter is used with nested fields.
  - field_path: (string, required) The path to the nested field. For example, if the nested field is authors, then the nested filters may refer to authors.first_name.
  - nested_filter: (object, required) The filter to apply to the nested fields. The object has the same schema as the default_filters_configuration object. The field names used inside this filter could be relative to the nested field or not. For example for authors, the field names could be first_name or authors.first_name.
- exists: (object, optional) Defines the filters that are applied with the exists operator.
  - field_path: (string, required) The path to the field.
- equals_to: (object, optional) Defines the filters that are applied with the exact match operator.
  - field_path: (string, required) The path to the field.
  - field_value: (string, required) The value to compare to.
- greater_than: (object, optional) Defines the filters that are applied with the greater than operator.
  - field_path: (string, required) The path to the field.
  - field_value: (string, required) The value to compare to.
- greater_than_or_equal_to: (object, optional) Defines the filters that are applied with the greater than or equal to operator.
  - field_path: (string, required) The path to the field.
  - field_value: (string, required) The value to compare to.
- less_than: (object, optional) Defines the filters that are applied with the less than operator.
  - field_path: (string, required) The path to the field.
  - field_value: (string, required) The value to compare to.
- less_than_or_equal_to: (object, optional) Defines the filters that are applied with the less than or equal to operator.
  - field_path: (string, required) The path to the field.
  - field_value: (string, required) The value to compare to.
- is_in: (object, optional) Defines the filters that are applied with the is in operator.
  - field_path: (string, required) The path to the field.
  - field_values: (array of strings, required) The values to compare to.
- geo_distance: (object, optional) Defines the filters that are applied with the geo distance operator.
  - field_path: (string, required) The path to the field.
  - point: (object, required) The point to compare to.
    - lat: (float, required) The latitude of the point to compare to.
    - lon: (float, required) The longitude of the point to compare to.
  - distance: (string, required) The distance to compare to.
- geo_bounding_box: (object, optional) Defines the filters that are applied with the geo bounding box operator.
  - field_path: (string, required) The path to the field.
  - top_left_point: (object, required) The top left point of the bounding box.
    - lat: (float, required) The latitude of the point.
    - lon: (float, required) The longitude of the point.
  - bottom_right_point: (object, required) The bottom right point of the bounding box.
    - lat: (float, required) The latitude of the point.
    - lon: (float, required) The longitude of the point.
search_profiles_configuration: (object, optional) Define a set of search profiles available for the index. A search profile is collection of configuration to change the default search relevance of the documents. The relevance of the documents are determine by the retrieval method (keyword, knn, mixed or reranked), as well as the boosting based on document attributes.
- profiles: (array of object, required) Holds all search profiles for this index. Only one set of settings can be defined per search profile. Mixed and reranker profiles can make reference to other search profiles.
  - name: (string, optional) Name of the search profile. This name is used to select a default profile in the search_relevance_configuration or at query time using the search_profile_name field.
  - keyword_settings: (object, optional) Use this configuration to use keyword search and define value based boosting weights.
    - query_settings: (object, optional) This is use to define boosting when the query is contained in on one or many document fields. For example, if the user query for the term "physics" you can boost the documents that have a field "topic" with a "physics" value.
      - field_search_configs: (array of objects, required) Each element of this array represents a document field to be boosted.
        
        field_path: (string, required) The path of the field to be boosted.
        
        must_match_query: (boolean, optional) If true, only documents containing the search query on the document field will be returned. Defaults to false.
        
        boosting_score: (object, optional) If the query is included in the document field, the document will be boosted.
        
        weight: (float, required) Multiplier to be used for boosting.
        
        constant_score: (object, optional) If defined and the query is included in the document, then the document relevance score will be equal to the weight.
        
        weight: (float, required) Relevance score of the document.
    - functions_boosting_settings: (object, optional) Use this configuration to boost a document based on its field value. For example, if your documents have "source" field and you have trusted sources that you want to boost, you achieve that by boosting a document when the "source" field is one of the trusted sources.
      - score_aggregation_method: (enum, optional) Refers to the method that is used to combine the scores given by each of the functions on function_configs. Possible values are "sum" (default), "multiply", "avg", "first", "max", "min".
      - function_configs: (array of objects, required) List of functions to be applied. Each function can refer to different document fields or the same document field but with different values to be boosted.
        
        field_path: (string, required) The path of the field.
        
        function_type: (enum, required) This specifies the function to be applied. Currently, the only supported function is "weighted_value," which boosts the document when the field value is set to a specific value, for example boosting a specific source or a specific author.
        
        function_config (object, required) Define the function configuration.
        
        weighted_value (object, required) Configuration for the weighted_value function.
        
        field_value: (numeric or string, required) The function will be applied when the document field is equal to this parameter.
        
        weight: (float, required) Score multiplier.
  - knn_settings: (object, optional) Use this configuration to use knn search, you can also specify functions to tune the relevance.
    - functions_boosting_settings: (object, optional) Use this configuration to boost a document based on its field value. For example, if your documents have "source" field and you have trusted sources that you want to boost, you achieve that by boosting a document when the "source" field is one of the trusted sources.
      - score_aggregation_method: (enum, optional) Refers to the method that is used to combine the scores given by each of the functions on function_configs. Possible values are "sum" (default), "multiply", "avg", "first", "max", "min".
      - function_configs: (array of objects, required) List of functions to be applied. Each function can refer to different document fields or the same document field but with different values to be boosted.
        
        field_path: (string, required) The path of the field.
        
        function_type: (enum, required) This specifies the function to be applied. Currently, the only supported function is "weighted_value," which boosts the document when the field value is set to a specific value, for example boosting a specific source or a specific author.
        
        function_config (object, required) Define the function configuration.
        
        weighted_value (object, required) Configuration for the weighted_value function.
        
        field_value: (numeric or string, required) The function will be applied when the document field is equal to this parameter.
        
        weight: (float, required) Score multiplier.
  - mixed_retrieval_settings: (object, optional) Use this configuration for hybrid retrieval method. This is a retrieval method that takes multiple retrievers and combines them. Currently two different combination algorithms are supported. The base retrievers can be either fully defined on the configuration or you can make reference using the profile name.
    - mixing_strategy_name: (enum, required) Mixing strategy name, this can either rff for Reciprocal Rank Fusion or linear_combination to make a weighted sum of the scores of the base retrievers.
    - mixing_strategy_config: (object, required) Configuration for the mixing strategy.
      - rrf: (object, optional) Configuration for Reciprocal Rank Fusion
        
        k: (integer, required) Rank constant, must be greater or equal to 1. The greater the constant, the more influence lower ranked documents have.
        
        max_documents_multiplier: (integer, required) This parameter determines the number of documents of each individual retriever to be considered on the mixed. The number of documents considered is the page size (determined at query time) multiplied by this number.
      - linear_combination: (object, optional) Configuration for linear combination strategy.
        
        weights: (array of floats, required) List of weights by which each score will be multiplied. This list should have the same length as the retrievers defined either on retriever_profile_names or retriever_profiles and be in the same order.
        
        max_documents_multiplier: (integer, required) This parameter determines the number of documents of each individual retriever to be considered on the mixed. The number of documents considered is the page size (determined at query time) multiplied by this number.
    - retriever_profile_names: (array of strings, optional): Profile names of the base retrievers, the retriever should be defined as one in profiles.
    - retriever_profiles: (array of objects, optional): Array of search profile objects of the base retrievers, The schema of each object in the array is the same as for the profiles field.
  - reranker_settings: (object, optional) Use this configuration to configure a reranked retreiver.
    - reranker_service: (str, optional) Reranker service endpoint, for example "http://reranker-api.production.svc.cluster.local:8080/v2".
    - aggregate_results: (boolean, optional) If the same document is multiple times on the base retriever, whether to aggregate the scores to one score per document. Defaults to True.
    - rerank_top_n: (integer, optional) Number of documents to rerank. Defaults to 50.
    - retriever_profile_names: (array of strings, optional): Profile names of the base retrievers, the retriever should be defined as one in profiles.
    - retriever_profiles: (array of objects, optional): Array of search profile objects of the base retrievers, The schema of each object in the array is the same as for the profiles field.

On successful response, the pipeline service will create and configure a new index for the tenant. Take note of the id field in the response payload. This field will be the required index_id field of other endpoints.

Index configuration​

Index configuration