Adding New Fields to the Metadata Extractor

This guide walks you through the end-to-end process of adding a new field to an existing metadata extraction pipeline. Every layer of the stack that touches metadata - the index, the agent, the content source, and the pipeline workflow - must be updated consistently for the new field to be extracted, stored, and queryable.

If you haven't set up metadata extraction yet, start with the Extract Metadata using AI Agents guide first.

Overview

Adding a new metadata field requires changes in four places, in the following order:

#	Component	What to update	Why
1	Index configuration	`document_fields_configuration`	The field must exist in the index schema before documents can be indexed with it.
2	Agent (bot) configuration	`metadata_descriptions` in `bot_configuration`	The LLM agent needs to know what to extract and how.
3	Content source	`field_mappings` and `fields_to_extract`	The pipeline needs to know how to map the extracted field to the index and which fields to look for.
4	Workflow (if needed)	`metadata_extractor` task settings	Usually no changes are needed, but verify the processor is configured correctly.

Field naming

The field names used in the agent configuration do not need to match the field names in the index. The content source's field_mappings is responsible for translating between the two. For example, the agent can extract a field called document_category while the index stores it as main_applications.

The rest of this guide uses a running example: adding a document_category field to an existing metadata extractor that already extracts title, authors, summary, date, year, and source.

Prerequisites

A working metadata extraction pipeline (see Extract Metadata using AI Agents)
Access to the Zeta Alpha Platform UI with admin privileges

Step 1: Add the Field to the Index Configuration

Before the pipeline can store the new field, it must be declared in the index's document_fields_configuration. If the field already exists in the index, you can skip this step.

In our example, we want to extract a document_category field from the document content and store it in the index field main_applications. The index already has this field defined, so we can skip this step.

If your target index field does not exist yet, add it to the document_fields_configuration. Two entries are typically needed: one with the metadata. prefix (the physical storage path) and one without (the alias used in queries and field mappings).

{
  "name": "main_applications",
  "type": "string",
  "search_options": {
    "is_sort_field": false,
    "is_facet_field": true,
    "is_filter_field": true,
    "is_returned_in_search_results": true,
    "is_used_in_search": false
  }
}

Search Options Reference

Configure search_options based on how you want the field to behave:

Option	Description
`is_facet_field`	Show aggregated counts in the search UI (useful for category-like fields)
`is_filter_field`	Allow filtering search results by this field
`is_sort_field`	Allow sorting search results by this field (useful for dates and numbers)
`is_returned_in_search_results`	Include the field value in search result responses
`is_used_in_search`	Include the field in full-text search matching

Apply the Change

Update the index configuration through the Platform UI:

Navigate to your tenant and click View next to the target index
Edit the index configuration to include the new field entries
Save the updated configuration

tip

For string fields that should be searchable with full-text search, consider adding an analyzer_options block:

{
  "analyzer_options": {
    "analyzer": "zav-en-nostem"
  }
}

Step 2: Add the Field to the Agent Configuration

The metadata extraction agent needs to know about the new field so it can instruct the LLM to extract it from documents. Update the agent's bot_configuration in the tenant's chat_bot_setups.

Add a Metadata Description Item

Add a new entry to the metadata_descriptions.items array in the agent's bot_configuration:

{
  "field_name": "document_category",
  "field_type": "NotRequired[str]",
  "field_description": "The main application area or category that this document belongs to (e.g., 'Coatings', 'Adhesives', 'Water Treatment'). If the document covers multiple categories, return the primary one. If you cannot confidently determine the category, don't return this field."
}

The field_name you choose here is the key the agent will use in its JSON output. It does not need to match the index field name - the content source's field_mappings (Step 3) handles that translation. Choose a name that is clear and descriptive for the LLM.

Full Agent Configuration Example

After adding the document_category field, the full agent configuration looks like this:

{
  "bot_identifier": "generic_metadata_extractor",
  "llm_configuration_name": "gpt-4o-mini",
  "llm_tracing_configuration_name": "langfuse",
  "agent_name": "affiliations_extractor",
  "bot_configuration": {
    "agent_config": {
      "max_document_chars": 5000,
      "max_response_tokens": 4096
    },
    "metadata_descriptions": {
      "items": [
        {
          "field_name": "title",
          "field_type": "str",
          "field_description": "The title of the given document. If you cannot confidently determine the title, don't return this field."
        },
        {
          "field_name": "authors",
          "field_type": "NotRequired[List[str]]",
          "field_description": "A list of authors that wrote this document. If you cannot confidently determine the authors, don't return this field."
        },
        {
          "field_name": "summary",
          "field_type": "str",
          "field_description": "Sometimes the summary is explicitly marked in the text, then extract it exactly without any paraphrasing. Otherwise produce a summary that is about 10 sentences long. If you cannot confidently determine the summary, don't return this field."
        },
        {
          "field_name": "date",
          "field_type": "NotRequired[date]",
          "field_description": "The publication date of the document. Be careful that dates in documents may be in US format or International format. If the date is ambiguous, always look at other dates in the document to verify the format used. When writing this field, convert the date into the format: YYYY-MM-DD. If you cannot confidently determine the date, don't return this field."
        },
        {
          "field_name": "year",
          "field_type": "NotRequired[int]",
          "field_description": "The year of publication of the document. Years cannot be in the future."
        },
        {
          "field_name": "source",
          "field_type": "NotRequired[str]",
          "field_description": "The source where the document belongs to."
        },
        {
          "field_name": "document_category",
          "field_type": "NotRequired[str]",
          "field_description": "The main application area or category that this document belongs to (e.g., 'Coatings', 'Adhesives', 'Water Treatment'). If the document covers multiple categories, return the primary one. If you cannot confidently determine the category, don't return this field."
        }
      ]
    }
  },
  "serving_url": "http://chat-service.production.svc.cluster.local:8080"
}

Writing Good Field Descriptions

The field_description is the primary instruction the LLM receives for extraction. Follow these guidelines:

Be specific about the expected format. For dates, specify YYYY-MM-DD. For lists, clarify whether you want a flat list of strings or a list of objects.
Include fallback behavior. Tell the agent what to do when it can't confidently extract the field - typically "don't return this field."
Add disambiguation rules when the field is commonly ambiguous (e.g., date format, language vs. programming language).
Use NotRequired[...] as the type for fields that may not be present in every document. This tells the agent it's acceptable to omit the field.

Agent Config Parameters

Parameter	Description
`max_document_chars`	Maximum number of characters from the document sent to the LLM. Increase this if extraction quality is poor for longer documents, but be mindful of token costs.
`max_response_tokens`	Maximum tokens the LLM can return. Increase if extracted JSON is being truncated.
`use_reflection_prompt`	When `true`, the agent makes a second LLM call to self-check the output. Improves accuracy at the cost of latency and token usage.

Step 3: Update the Content Source Configuration

The metadata enhancement content source defines the mapping between extracted fields (from the agent) and index fields (in the search engine). Two sections need updating: fields_to_extract and field_mappings.

Add the Field to `fields_to_extract`

The fields_to_extract array tells the pipeline processor which fields to look for in the document metadata. Add an entry for the new field:

{
  "field_name": "document_category",
  "field_type": "string"
}

important

The field_name values in fields_to_extract must match the field_name values defined in the agent's metadata_descriptions. These are the names the agent uses in its JSON output, and the pipeline processor uses them to identify which fields were extracted. If they don't match, the processor won't recognize the extracted field.

note

The field_type here uses simple type names (string, integer, date) rather than Python type hints. This is different from the field_type in the agent's metadata_descriptions, which uses Python notation like NotRequired[str].

Add the Field Mapping

The field_mappings array maps the content_source_field_name (the name the agent returns) to the index_field_name (where it gets stored in the index). Add a mapping for the new field:

{
  "content_source_field_name": "document_category",
  "index_field_name": "main_applications"
}

Note how the agent's field name (document_category) differs from the index field name (main_applications). The field mapping is what connects the two - you can freely choose descriptive names for the agent that make sense for LLM extraction, independent of your index schema.

Understanding Field Mappings

Field mappings connect three naming conventions:

Agent output field name  →  content_source_field_name  →  index_field_name
   "document_category"       "document_category"          "main_applications"

The agent output field name is the JSON key returned by the LLM (set by field_name in metadata_descriptions).
The content_source_field_name must match the agent output field name exactly.
The index_field_name is the field path in the search index. This can be any valid field name in your index.

Nested Field Mappings

For fields with complex types like authors, use inner_field_mappings to map nested structures:

{
  "content_source_field_name": "authors",
  "index_field_name": "DCMI.creator",
  "inner_field_mappings": [
    {
      "content_source_field_name": "name",
      "index_field_name": "full_name"
    }
  ]
}

This maps the agent's output {"authors": [{"name": "John Doe"}]} to the index structure {"DCMI.creator": [{"full_name": "John Doe"}]}.

Full Content Source Example

After adding the document_category field, the content source configuration looks like this:

{
  "name": "user_documents_metadata_enhancement",
  "is_indexable": true,
  "connector": "metadata_extractor_enhancement",
  "workflow_name_overrides": {
    "ingest": "enhance-metadata",
    "reingest": "enhance-metadata"
  },
  "connector_configuration": {
    "metadata_extractor_enhancement": {
      "enhancement_id": "metadata",
      "extractor_backend": "agent",
      "extractor_backend_configuration": {
        "agent": {
          "agent_identifier": "generic_metadata_extractor"
        }
      },
      "fields_to_extract": [
        { "field_name": "authors", "field_type": "string" },
        { "field_name": "title", "field_type": "string" },
        { "field_name": "summary", "field_type": "string" },
        { "field_name": "date", "field_type": "string" },
        { "field_name": "year", "field_type": "integer" },
        { "field_name": "source", "field_type": "string" },
        { "field_name": "document_category", "field_type": "string" }
      ],
      "field_mappings": [
        { "content_source_field_name": "title", "index_field_name": "DCMI.title" },
        { "content_source_field_name": "summary", "index_field_name": "DCMI.abstract" },
        { "content_source_field_name": "source", "index_field_name": "DCMI.source" },
        { "content_source_field_name": "date", "index_field_name": "DCMI.created" },
        { "content_source_field_name": "year", "index_field_name": "DCMI.date" },
        {
          "content_source_field_name": "authors",
          "index_field_name": "DCMI.creator",
          "inner_field_mappings": [
            { "content_source_field_name": "name", "index_field_name": "full_name" }
          ]
        },
        { "content_source_field_name": "document_category", "index_field_name": "main_applications" }
      ]
    }
  }
}

Apply the Change

Update the content source via the Platform UI:

Navigate to your tenant → target index → Content Sources
Find the metadata enhancement content source and click Edit
Update the JSON configuration with the new field in both fields_to_extract and field_mappings
Save the changes

Step 4: Verify the Workflow Configuration

In most cases, the existing workflow does not need changes when adding a new field. However, verify these settings on the metadata_extractor task:

{
  "name": "metadata_extractor_daily",
  "local_settings": {
    "default_ai_extraction": true,
    "always_extract_metadata": true,
    "preserved_fields": ["source", "title"]
  }
}

Key Settings to Check

Setting	Impact on new fields
`always_extract_metadata`	When `true`, extracts all fields on every processing run - including the new field on already-processed documents. When `false`, only extracts fields that are missing.
`preserved_fields`	Lists fields that should not be overwritten by AI extraction if they already have a value. Uses content source field names (i.e., the same names used in the agent's `metadata_descriptions` and `content_source_field_name` in field mappings), not index field names. For example, use `"source"` not `"DCMI.source"`. If your new field should be protected, add its content source field name here.
`default_ai_extraction`	When `true`, enables AI-based extraction. Must be `true` for agent-backed extraction.

caution

If preserved_fields contains "*", all mapped fields are preserved when they already have values. This means the new field will only be extracted for documents where it doesn't already exist - which is usually the desired behavior for reprocessing.

Step 5: Extract the New Field from Existing Documents

New documents will automatically have the new field extracted during ingestion. For documents already in the index, you need to trigger reprocessing.

Create a Reprocessing Workflow

If you don't already have one, create a reprocessing workflow that runs the metadata extractor and re-indexes the results:

{
  "name": "reprocess-metadata",
  "steps": [
    { "service": "start", "next_services": ["metadata_extractor"] },
    { "service": "metadata_extractor", "next_services": ["index_updater"] },
    { "service": "index_updater", "next_services": ["pipeline_source"] }
  ],
  "tasks": [
    {
      "name": "metadata_extractor",
      "local_settings": {
        "default_ai_extraction": true,
        "always_extract_metadata": true
      }
    },
    {
      "name": "index_updater",
      "processor_settings": { "always_run": true, "skip_deleted": false }
    },
    {
      "name": "pipeline_source",
      "processor_settings": { "always_run": true, "skip_deleted": false }
    }
  ]
}

Trigger Reprocessing

Navigate to your tenant → target index → Content Sources
Select the content source whose documents you want to reprocess
Click View under Ingested Documents
Click Create Reingestion
Enter reprocess-metadata as the Workflow Name Override
Click Submit

tip

If you only want to extract the new field without re-extracting existing fields, set always_extract_metadata to false in the reprocessing workflow. The extractor will then only fill in fields that are currently missing - including your new field.

Complete Checklist

Use this checklist when adding a new metadata field:

Index: Field exists in document_fields_configuration (if it's a new index field)
Agent: Field description added to metadata_descriptions.items in the tenant's chat_bot_setups
Content source: Field added to fields_to_extract (matching the agent's field_name)
Content source: Field mapping added to field_mappings (mapping agent field name → index field name)
Workflow: preserved_fields updated if the new field should be protected (uses content source field names)
Reprocessing: Existing documents reprocessed to extract the new field

Troubleshooting

Field is not being extracted

Verify the field_name in the agent's metadata_descriptions matches exactly with the content_source_field_name in field_mappings and the field_name in fields_to_extract. All three must use the same name.
Check that default_ai_extraction is set to true in the workflow task.
Check that the field is not listed in preserved_fields (which would skip extraction if the field already has a value). Remember, preserved_fields uses content source field names, not index field names.

Field is extracted but not appearing in search results

Confirm the field exists in the index's document_fields_configuration.
Verify that the index_field_name in the field mapping matches the field name in the index configuration (without the metadata. prefix).
Check that is_returned_in_search_results is set to true in the field's search_options.

Field is extracted but has wrong values

Review the field_description in the agent configuration - it may need more specific instructions.
Consider increasing max_document_chars if the relevant content appears late in the document.
Enable use_reflection_prompt to have the agent self-check its output.

Extraction is not triggered for existing documents

Ensure you triggered reprocessing with always_extract_metadata: true to force re-extraction of all fields.
If using always_extract_metadata: false, the extractor only extracts fields that are missing - if the document already went through extraction before, fields that were previously extracted (even as empty) won't be retried.

Overview​

Prerequisites​

Step 1: Add the Field to the Index Configuration​

Search Options Reference​

Apply the Change​

Step 2: Add the Field to the Agent Configuration​

Add a Metadata Description Item​

Full Agent Configuration Example​

Writing Good Field Descriptions​

Agent Config Parameters​

Step 3: Update the Content Source Configuration​

Add the Field to fields_to_extract​

Add the Field Mapping​

Understanding Field Mappings​

Nested Field Mappings​

Full Content Source Example​

Apply the Change​

Step 4: Verify the Workflow Configuration​

Key Settings to Check​

Step 5: Extract the New Field from Existing Documents​

Create a Reprocessing Workflow​

Trigger Reprocessing​

Complete Checklist​

Troubleshooting​

Field is not being extracted​

Field is extracted but not appearing in search results​

Field is extracted but has wrong values​

Extraction is not triggered for existing documents​

See Also​

Overview

Prerequisites

Step 1: Add the Field to the Index Configuration

Search Options Reference

Apply the Change

Step 2: Add the Field to the Agent Configuration

Add a Metadata Description Item

Full Agent Configuration Example

Writing Good Field Descriptions

Agent Config Parameters

Step 3: Update the Content Source Configuration

Add the Field to `fields_to_extract`

Add the Field Mapping

Understanding Field Mappings

Nested Field Mappings

Full Content Source Example

Apply the Change

Step 4: Verify the Workflow Configuration

Key Settings to Check

Step 5: Extract the New Field from Existing Documents

Create a Reprocessing Workflow

Trigger Reprocessing

Complete Checklist

Troubleshooting

Field is not being extracted

Field is extracted but not appearing in search results

Field is extracted but has wrong values

Extraction is not triggered for existing documents

See Also