Skip to main content
Unlisted page
This page is unlisted. Search engines will not index it, and only users having a direct link can access it.

Concept Reference

Index configuration

A new index is created by posting to the /v1/indexes endpoint. It has the following configuration:

Query params

  • tenant: The name of the tenant that owns the index.

Request body

  • name: (string, required) Human friendly name given to the index. For example "Zeta Alpha".
  • default: (boolean, optional) Whether the index is used at query time if no other index is specified. Only only index can be set as default per tenant.
  • description: (string, optional) A description of the index. For example "Main index for the research navigator app".
  • cluster_connection: (object, required) It specifies what index backend to use and how to access it.
    • backend: (string, required) The type of backend to use. Possible values are "opensearch".
    • host: (string, required) The hostname of the index. For example "opensearch-cluster-master-headless.opensearch.svc.cluster.local".
    • port: (integer, required) The port of the index. For example 9200.
    • settings: (object, optional) Connection settings.
      • use_ssl: (boolean, optional) Whether to use SSL when connecting to the index.
      • http_auth: (array of strings, optional) Credentials to use when connecting to the index. For example ["username", "password"].
      • verify_certs: (boolean, optional) Whether to verify the SSL certificate when connecting to the index.
      • ssl_show_warn: (boolean, optional) Whether to show a warning when connecting to the index with verify_certs disabled.
      • ca_certs: (string, optional) The path to the CA certificate to use when connecting to the index.
      • client_cert: (string, optional) The path to the client certificate to use when connecting to the index.
      • client_key: (string, optional) The path to the client key to use when connecting to the index.
  • storage_settings: (object, required) It specifies how the auxiliary data for this index is stored.
    • ingesting: (object, required) It specifies how large data ingested by the pipeline is stored. When ingesting data into the document_content or document_content_path.base64_content fields, then this data is stored in the backend specified here.
      • backend: (enum, required) The backend to use for storing the ingested data. Possible values are "s3", "azure", "disk".
      • s3: (object, optional) The configuration for the S3 backend.
        • s3_bucket_name: (string, required) The name of the S3 bucket.
        • s3_key_prefix: (string, optional) The prefix to use when storing the data in the S3 bucket.
      • azure: (object, optional) The configuration for the Azure backend
        • azure_account_url: (string, required) The URL of the Azure account.
        • azure_container_name: (string, required) The name of the Azure container.
        • azure_blob_prefix: (string, optional) The prefix to use when storing the data in the Azure container.
        • azure_credential: (string, optional) The storage account shared key (account key or access key) to use when connecting to the Azure account.
      • disk: (object, optional) The configuration for the disk backend.
        • disk_location: (string, optional) The path to the directory where the data will be stored.
      • max_file_size: (integer, optional) The maximum size of the files to store in the backend. This is in bytes. The default value is 1024**3 (100MB).
    • processing: (object, required) It specifies how the data used by the pipeline is stored.
      • backend: (enum, required) The backend to use for storing the data. Possible values are "s3", "azure", "disk".
      • s3: (object, optional) The configuration for the S3 backend.
        • s3_bucket_name: (string, required) The name of the S3 bucket.
        • s3_key_prefix: (string, optional) The prefix to use when storing the data in the S3 bucket.
      • azure: (object, optional) The configuration for the Azure backend
        • azure_account_url: (string, required) The URL of the Azure account.
        • azure_container_name: (string, required) The name of the Azure container.
        • azure_blob_prefix: (string, optional) The prefix to use when storing the data in the Azure container.
        • azure_credential: (string, optional) The storage account shared key (account key or access key) to use when connecting to the Azure account.
      • disk: (object, optional) The configuration for the disk backend.
        • disk_location: (string, optional) The path to the directory where the data will be stored.
      • compression: (object, optional) The configuration for the compression of the data.
        • compression_algorithm: (enum, required) The algorithm to use for compressing the data. Possible values are "bz2", "gzip", "lzma", "snappy", "zlib", "zstd". The recommended value is "zlib".
        • level: (integer, optional) The level of compression to use. This is an integer between 0 and 9, where 0 is no compression and 9 is the maximum compression.
  • features: (object, optional) Configures the available features for this index.
  • document_fields_configuration: (array of objects, optional) Specifies the name of the fields that the tenant wants in the index, as well as how they behave during indexing and retrieval.
    • name: (string, required) The name of the field that will be used to store values in the index, as well as when retrieving and filtering documents. This can be a nested field, for example `authors.first_name``.
    • type: (enum, required) The type of field. Possible values are "document_id", "string", "date", "number", "geolocation", "bounding_box", "document_content". Note that any of these types can be multi-valued, meaning they will accept a list of values. Furthermore, when defining a nested field name like authors.first_name, the values will be stored in a flat structure. For example, if we define authors.first_name and authors.last_name and later ingest data like {"authors": [{"first_name": "John", "last_name": "Doe"}, {"first_name": "Jane", "last_name": "Doe"}]}, then the index will contain the following fields: authors.first_name: ["John", "Jane"] and authors.last_name: ["Doe", "Doe"]. This structure minimizes indexing time and storage as well as retrieval time. However, it will not be possible to restrict search results to the ones that contain at least one author with first_name=John and last_name=Doe. If this is a requirement, then the tenant should define the field as nested. The fields document_id and document_content are only relevant when used inside a nested field.
    • alias: (string, optional) An alternative field name that can be used to retrieve documents. Note that the alias must be unique.
    • search_options: (object, optional) How the field behaves during indexing and retrieval. Note that it also determines how the field is configured under the hood.
      • is_sort_field: (boolean, optional) Whether the field can be used for sorting documents at retrieval time.
      • is_facet_field: (boolean, optional) Whether the field can be used for faceted search. In other words, the search API is able to return a list of existing values for this field along with the document counts.
      • is_filter_field: (boolean, optional) Whether the field can be used for filtering documents at retrieval time.
      • is_returned_in_search_results: (boolean, optional) Whether the field is part of the search API response payload.
      • is_used_in_search: (boolean, optional) Whether the field is used in full text search.
      • supporting_subqueries: (boolean, optional) Whether the field can be used in subqueries. Subqueries are used to filter nested objects as if they were root-level documents. Search results return the parent document with the nested object filtered.
    • analyzer_options: (object, optional) How the field is analyzed during indexing and retrieval. This is only for fields of type string that also have search_options.is_used_in_search set to true.
      • analyzer: (string, optional) The name of the analyzer to use. We provide some default analyzers, like the zav-en-nostem. Otherwise the name needs to match a custom analyzer defined in the document_field_analysis section of the index payload. This analyzer will be used for both indexing and retrieval. If the tenant wants to have separate analyzers for indexing and retrieval, then they should not define analyzer and instead pass the name of the analyzers in the index_analyzer and search_analyzer fields.
      • search_analyzer: (string, optional) The name of the analyzer to use at retrieval time.
      • index_analyzer: (string, optional) The name of the analyzer to use at indexing time.
    • nested_fields: (array of objects, optional) When type=nested then this contains the list of nested fields. The schema of each object in the array is the same as for the document_fields_configuration field. Note that the name field should not be prefixed with the name of the parent field. For example, if the parent field is authors, then the nested fields should be first_name and last_name, not authors.first_name and authors.last_name.
  • document_field_analysis: (object, optional) Defines custom analyzers that can be used in the document_fields_configuration.
    • filter: (object, optional)
    • char_filter: (object, optional)
    • normalizer: (object, optional)
    • analyzer: (object, optional)
  • client_settings: (object, optional) Defines index specific rendering information.
    • display_configuration: (object, optional) Defines how the index content is rendered in the frontend. The field names refer to the names provided in the document_fields_configuration.
      • title_field: (string, optional) The field used for rendering the title of the document card.
      • date_field: (string, optional) The field used for rendering the date of the document card.
      • created_by_field: (string, optional) The field used for rendering the authors of the document card.
      • description_field: (string, optional) The field used for rendering the description of the document card.
      • url_field: (string, optional) The field used for rendering the link to the source content.
      • source_field: (string, optional) The field used for rendering the source of the document card.
      • bounding_boxes_field: (string, optional) The field used for rendering the bounding boxes in the PDF viewer.
      • image_url_field: (string, optional) The field used for rendering the image of the document card.
      • document_metadata_fields: (array of objects, optional) Defines how other metadata is rendered in the card. This could be the number of references to this document (for example in scientific documents), a list of documents linked to the current one, etc.
        • type: (enum, required) The type of metadata. Possible values are "github", "twitter", "counter".
        • field_name: (string, optional) The field that contains the data to be rendered (used in the counter type).
        • url_field: (string, optional) If the rendered element is clickable, this specifies the link URL.
        • list_field_name: (string, optional) The field that contains the list of metadata to be rendered (used for the github and twitter types).
        • icon: (enum. optional) The icon associated with the rendered element. Possible values are "github", "twitter", "reference", "citation".
        • label: (string, optional) The label associated with the rendered element. This can be used in the tooltip, for example.
    • search_filters_configuration: (array of objects, optional) When defined, the search filters will be limited to the ones defined in this list of filters.
      • field_name: (string, required) This refers to the field name in the document_fields_configuration that this filter will filter by.
      • display_name: (string, required) The name that will be displayed as the filter name in the front end.
      • filter_type: (string, optional) Identifier of the filter type, this string is used by the front end to choose the widget that will display this filter.
      • url_param: (string, optional) This string will be used by front end to display in the url as a url param.
      • filter_type_settings: (object, optional) Filter specific configuration, this could include default values and display names.
        • checkbox: (object, optional) Display configuration for checkboxes.
          • values: (array, required) List of values to be display and filter by in the checkbox.
            • label: (string, required) Display name of the value to filter by.
            • value: (string, required) Value to filter by.
    • search_sorting_configuration: (object, optional) Defines the ordering options for the frontend. The field names refer to the names provided in the document_fields_configuration.
      • field_name: (string, required) The field to sort by.
      • display_name: (string, required) The name that will be rendered in the frontend for this sorting option.
      • url_param (string, optional) The parameter that the frontend will use in the URL when this sorting option is selected.
      • retrieval_unit: (string, optional) If specified, this option is only shown for the selected retrieval unit.
    • search_relevance_configuration: (array of objects, optional) Define the default search profile to use, per retrieval unit.
      • retrieval_unit: (string, optional) If specified, this search profile will only be used as default when searching for the retrieval unit.
      • search_profile_name: Search profile to use as default, this name refers to the name field for the profiles defined under the search_profiles_configuration.
  • default_filters_configuration: (object, optional) Defines the default filters that are always applied to the search queries. The field names refer to the names provided in the document_fields_configuration.
    • and_operator: (array of objects, optional) Defines the filters that are applied with the AND operator. Each object in the array has the same schema as the default_filters_configuration object.
    • or_operator: (array of objects, optional) Defines the filters that are applied with the OR operator. Each object in the array has the same schema as the default_filters_configuration object.
    • not_operator: (object, optional) Defines the filters that are applied with the NOT operator. The object has the same schema as the default_filters_configuration object.
    • nested_operator: (object, optional) Defines the filters that are applied with the nested operator. This filter is used with nested fields.
      • field_path: (string, required) The path to the nested field. For example, if the nested field is authors, then the nested filters may refer to authors.first_name.
      • nested_filter: (object, required) The filter to apply to the nested fields. The object has the same schema as the default_filters_configuration object. The field names used inside this filter could be relative to the nested field or not. For example for authors, the field names could be first_name or authors.first_name.
    • exists: (object, optional) Defines the filters that are applied with the exists operator.
      • field_path: (string, required) The path to the field.
    • equals_to: (object, optional) Defines the filters that are applied with the exact match operator.
      • field_path: (string, required) The path to the field.
      • field_value: (string, required) The value to compare to.
    • greater_than: (object, optional) Defines the filters that are applied with the greater than operator.
      • field_path: (string, required) The path to the field.
      • field_value: (string, required) The value to compare to.
    • greater_than_or_equal_to: (object, optional) Defines the filters that are applied with the greater than or equal to operator.
      • field_path: (string, required) The path to the field.
      • field_value: (string, required) The value to compare to.
    • less_than: (object, optional) Defines the filters that are applied with the less than operator.
      • field_path: (string, required) The path to the field.
      • field_value: (string, required) The value to compare to.
    • less_than_or_equal_to: (object, optional) Defines the filters that are applied with the less than or equal to operator.
      • field_path: (string, required) The path to the field.
      • field_value: (string, required) The value to compare to.
    • is_in: (object, optional) Defines the filters that are applied with the is in operator.
      • field_path: (string, required) The path to the field.
      • field_values: (array of strings, required) The values to compare to.
    • geo_distance: (object, optional) Defines the filters that are applied with the geo distance operator.
      • field_path: (string, required) The path to the field.
      • point: (object, required) The point to compare to.
        • lat: (float, required) The latitude of the point to compare to.
        • lon: (float, required) The longitude of the point to compare to.
      • distance: (string, required) The distance to compare to.
    • geo_bounding_box: (object, optional) Defines the filters that are applied with the geo bounding box operator.
      • field_path: (string, required) The path to the field.
      • top_left_point: (object, required) The top left point of the bounding box.
        • lat: (float, required) The latitude of the point.
        • lon: (float, required) The longitude of the point.
      • bottom_right_point: (object, required) The bottom right point of the bounding box.
        • lat: (float, required) The latitude of the point.
        • lon: (float, required) The longitude of the point.
  • search_profiles_configuration: (object, optional) Defines the search pipelines that are used when searching.
    • profiles: (array of object, required) Holds all relevance profiles for this index.
      • name: (string, optional) Name of the search profile. This name is used to select a default profile in the search_relevance_configuration or at query time using the search_profile_name field.
      • query_settings: (object, optional) Define the configuration for boosting documents fields, used on keyword search.
        • field_search_configs: (array of objects, required) Each element of this array represents a document field to be boosted.
          • field_path: (string, required) The path of the field.
          • must_match_query: (boolean, optional) If true, only documents containing the search query on the document field will be returned. Defaults to false.
          • boosting_score: (object, optional) If the query is included in the document field, the document will be boosted.
            • weight: (float, required) Multiplier to be used for boosting.
          • constant_score: (object, optional) If defined and the query is included in the document then the document relevance score will be equal to the weight.
            • weight: (float, required) Relevance score of the document.
      • functions_boosting_settings: (object, optional) Use this configuration to boost a document based on its field value. Possible values are "sum" (default), "multiply", "avg", "first", "max", "min".
        • score_aggregation_method: (enum, optional) Refers to the method that is used to combine the scores given by each of the functions on function_configs.
        • function_configs: (array of objects, required) List of functions to be applied.
          • field_path: (string, required) The path of the field.
          • function_type: (enum, required) The function to be applied, currently the only function supported is "weighted_value" which takes the value on the document field and multiplies it by a weight.
          • function_config (object, required) Define the function configuration.
            • weighted_value (object, required) Configuration for the weighted_value function.
              • field_value: (numeric or string, required) The function will be applied when the document field is equal to this parameter.
              • weight: (float, required) Score multiplier.

On successful response, the pipeline service will create and configure a new index for the tenant. Take note of the id field in the response payload. This field will be the required index_id field of other endpoints.