Adding New Fields to the Metadata Extractor
This guide walks you through the end-to-end process of adding a new field to an existing metadata extraction pipeline. Every layer of the stack that touches metadata — the index, the agent, the content source, and the pipeline workflow — must be updated consistently for the new field to be extracted, stored, and queryable.
If you haven't set up metadata extraction yet, start with the Extract Metadata using AI Agents guide first.
Overview
Adding a new metadata field requires changes in four places, in the following order:
| # | Component | What to update | Why |
|---|---|---|---|
| 1 | Index configuration | document_fields_configuration | The field must exist in the index schema before documents can be indexed with it. |
| 2 | Agent (bot) configuration | metadata_descriptions in bot_configuration | The LLM agent needs to know what to extract and how. |
| 3 | Content source | field_mappings and fields_to_extract | The pipeline needs to know how to map the extracted field to the index and which fields to look for. |
| 4 | Workflow (if needed) | metadata_extractor task settings | Usually no changes are needed, but verify the processor is configured correctly. |
The field names used in the agent configuration do not need to match the field names in the index. The content source's field_mappings is responsible for translating between the two. For example, the agent can extract a field called document_category while the index stores it as main_applications.
The rest of this guide uses a running example: adding a document_category field to an existing metadata extractor that already extracts title, authors, summary, date, year, and source.
Prerequisites
- A working metadata extraction pipeline (see Extract Metadata using AI Agents)
- Access to the Zeta Alpha Platform UI with admin privileges
Step 1: Add the Field to the Index Configuration
Before the pipeline can store the new field, it must be declared in the index's document_fields_configuration. If the field already exists in the index, you can skip this step.
In our example, we want to extract a document_category field from the document content and store it in the index field main_applications. The index already has this field defined, so we can skip this step.
If your target index field does not exist yet, add it to the document_fields_configuration. Two entries are typically needed: one with the metadata. prefix (the physical storage path) and one without (the alias used in queries and field mappings).
{
"name": "main_applications",
"type": "string",
"search_options": {
"is_sort_field": false,
"is_facet_field": true,
"is_filter_field": true,
"is_returned_in_search_results": true,
"is_used_in_search": false
}
}
Search Options Reference
Configure search_options based on how you want the field to behave:
| Option | Description |
|---|---|
is_facet_field | Show aggregated counts in the search UI (useful for category-like fields) |
is_filter_field | Allow filtering search results by this field |
is_sort_field | Allow sorting search results by this field (useful for dates and numbers) |
is_returned_in_search_results | Include the field value in search result responses |
is_used_in_search | Include the field in full-text search matching |
Apply the Change
Update the index configuration through the Platform UI:
- Navigate to your tenant and click View next to the target index
- Edit the index configuration to include the new field entries
- Save the updated configuration
For string fields that should be searchable with full-text search, consider adding an analyzer_options block:
{
"analyzer_options": {
"analyzer": "zav-en-nostem"
}
}
Step 2: Add the Field to the Agent Configuration
The metadata extraction agent needs to know about the new field so it can instruct the LLM to extract it from documents. Update the agent's bot_configuration in the tenant's chat_bot_setups.
Add a Metadata Description Item
Add a new entry to the metadata_descriptions.items array in the agent's bot_configuration:
{
"field_name": "document_category",
"field_type": "NotRequired[str]",
"field_description": "The main application area or category that this document belongs to (e.g., 'Coatings', 'Adhesives', 'Water Treatment'). If the document covers multiple categories, return the primary one. If you cannot confidently determine the category, don't return this field."
}
The field_name you choose here is the key the agent will use in its JSON output. It does not need to match the index field name — the content source's field_mappings (Step 3) handles that translation. Choose a name that is clear and descriptive for the LLM.
Full Agent Configuration Example
After adding the document_category field, the full agent configuration looks like this:
{
"bot_identifier": "generic_metadata_extractor",
"llm_configuration_name": "gpt-4o-mini",
"llm_tracing_configuration_name": "langfuse",
"agent_name": "affiliations_extractor",
"bot_configuration": {
"agent_config": {
"max_document_chars": 5000,
"max_response_tokens": 4096
},
"metadata_descriptions": {
"items": [
{
"field_name": "title",
"field_type": "str",
"field_description": "The title of the given document. If you cannot confidently determine the title, don't return this field."
},
{
"field_name": "authors",
"field_type": "NotRequired[List[str]]",
"field_description": "A list of authors that wrote this document. If you cannot confidently determine the authors, don't return this field."
},
{
"field_name": "summary",
"field_type": "str",
"field_description": "Sometimes the summary is explicitly marked in the text, then extract it exactly without any paraphrasing. Otherwise produce a summary that is about 10 sentences long. If you cannot confidently determine the summary, don't return this field."
},
{
"field_name": "date",
"field_type": "NotRequired[date]",
"field_description": "The publication date of the document. Be careful that dates in documents may be in US format or International format. If the date is ambiguous, always look at other dates in the document to verify the format used. When writing this field, convert the date into the format: YYYY-MM-DD. If you cannot confidently determine the date, don't return this field."
},
{
"field_name": "year",
"field_type": "NotRequired[int]",
"field_description": "The year of publication of the document. Years cannot be in the future."
},
{
"field_name": "source",
"field_type": "NotRequired[str]",
"field_description": "The source where the document belongs to."
},
{
"field_name": "document_category",
"field_type": "NotRequired[str]",
"field_description": "The main application area or category that this document belongs to (e.g., 'Coatings', 'Adhesives', 'Water Treatment'). If the document covers multiple categories, return the primary one. If you cannot confidently determine the category, don't return this field."
}
]
}
},
"serving_url": "http://chat-service.production.svc.cluster.local:8080"
}
Writing Good Field Descriptions
The field_description is the primary instruction the LLM receives for extraction. Follow these guidelines:
- Be specific about the expected format. For dates, specify
YYYY-MM-DD. For lists, clarify whether you want a flat list of strings or a list of objects. - Include fallback behavior. Tell the agent what to do when it can't confidently extract the field — typically "don't return this field."
- Add disambiguation rules when the field is commonly ambiguous (e.g., date format, language vs. programming language).
- Use
NotRequired[...]as the type for fields that may not be present in every document. This tells the agent it's acceptable to omit the field.
Agent Config Parameters
| Parameter | Description |
|---|---|
max_document_chars | Maximum number of characters from the document sent to the LLM. Increase this if extraction quality is poor for longer documents, but be mindful of token costs. |
max_response_tokens | Maximum tokens the LLM can return. Increase if extracted JSON is being truncated. |
use_reflection_prompt | When true, the agent makes a second LLM call to self-check the output. Improves accuracy at the cost of latency and token usage. |
Step 3: Update the Content Source Configuration
The metadata enhancement content source defines the mapping between extracted fields (from the agent) and index fields (in the search engine). Two sections need updating: fields_to_extract and field_mappings.
Add the Field to fields_to_extract
The fields_to_extract array tells the pipeline processor which fields to look for in the document metadata. Add an entry for the new field:
{
"field_name": "document_category",
"field_type": "string"
}
The field_name values in fields_to_extract must match the field_name values defined in the agent's metadata_descriptions. These are the names the agent uses in its JSON output, and the pipeline processor uses them to identify which fields were extracted. If they don't match, the processor won't recognize the extracted field.
The field_type here uses simple type names (string, integer, date) rather than Python type hints. This is different from the field_type in the agent's metadata_descriptions, which uses Python notation like NotRequired[str].
Add the Field Mapping
The field_mappings array maps the content_source_field_name (the name the agent returns) to the index_field_name (where it gets stored in the index). Add a mapping for the new field:
{
"content_source_field_name": "document_category",
"index_field_name": "main_applications"
}
Note how the agent's field name (document_category) differs from the index field name (main_applications). The field mapping is what connects the two — you can freely choose descriptive names for the agent that make sense for LLM extraction, independent of your index schema.
Understanding Field Mappings
Field mappings connect three naming conventions:
Agent output field name → content_source_field_name → index_field_name
"document_category" "document_category" "main_applications"
- The agent output field name is the JSON key returned by the LLM (set by
field_nameinmetadata_descriptions). - The
content_source_field_namemust match the agent output field name exactly. - The
index_field_nameis the field path in the search index. This can be any valid field name in your index.
Nested Field Mappings
For fields with complex types like authors, use inner_field_mappings to map nested structures:
{
"content_source_field_name": "authors",
"index_field_name": "DCMI.creator",
"inner_field_mappings": [
{
"content_source_field_name": "name",
"index_field_name": "full_name"
}
]
}
This maps the agent's output {"authors": [{"name": "John Doe"}]} to the index structure {"DCMI.creator": [{"full_name": "John Doe"}]}.
Full Content Source Example
After adding the document_category field, the content source configuration looks like this:
{
"name": "user_documents_metadata_enhancement",
"is_indexable": true,
"connector": "metadata_extractor_enhancement",
"workflow_name_overrides": {
"ingest": "enhance-metadata",
"reingest": "enhance-metadata"
},
"connector_configuration": {
"metadata_extractor_enhancement": {
"enhancement_id": "metadata",
"extractor_backend": "agent",
"extractor_backend_configuration": {
"agent": {
"agent_identifier": "generic_metadata_extractor"
}
},
"fields_to_extract": [
{ "field_name": "authors", "field_type": "string" },
{ "field_name": "title", "field_type": "string" },
{ "field_name": "summary", "field_type": "string" },
{ "field_name": "date", "field_type": "string" },
{ "field_name": "year", "field_type": "integer" },
{ "field_name": "source", "field_type": "string" },
{ "field_name": "document_category", "field_type": "string" }
],
"field_mappings": [
{ "content_source_field_name": "title", "index_field_name": "DCMI.title" },
{ "content_source_field_name": "summary", "index_field_name": "DCMI.abstract" },
{ "content_source_field_name": "source", "index_field_name": "DCMI.source" },
{ "content_source_field_name": "date", "index_field_name": "DCMI.created" },
{ "content_source_field_name": "year", "index_field_name": "DCMI.date" },
{
"content_source_field_name": "authors",
"index_field_name": "DCMI.creator",
"inner_field_mappings": [
{ "content_source_field_name": "name", "index_field_name": "full_name" }
]
},
{ "content_source_field_name": "document_category", "index_field_name": "main_applications" }
]
}
}
}
Apply the Change
Update the content source via the Platform UI:
- Navigate to your tenant → target index → Content Sources
- Find the metadata enhancement content source and click Edit
- Update the JSON configuration with the new field in both
fields_to_extractandfield_mappings - Save the changes
Step 4: Verify the Workflow Configuration
In most cases, the existing workflow does not need changes when adding a new field. However, verify these settings on the metadata_extractor task:
{
"name": "metadata_extractor_daily",
"local_settings": {
"default_ai_extraction": true,
"always_extract_metadata": true,
"preserved_fields": ["source", "title"]
}
}
Key Settings to Check
| Setting | Impact on new fields |
|---|---|
always_extract_metadata | When true, extracts all fields on every processing run — including the new field on already-processed documents. When false, only extracts fields that are missing. |
preserved_fields | Lists fields that should not be overwritten by AI extraction if they already have a value. Uses content source field names (i.e., the same names used in the agent's metadata_descriptions and content_source_field_name in field mappings), not index field names. For example, use "source" not "DCMI.source". If your new field should be protected, add its content source field name here. |
default_ai_extraction | When true, enables AI-based extraction. Must be true for agent-backed extraction. |
If preserved_fields contains "*", all mapped fields are preserved when they already have values. This means the new field will only be extracted for documents where it doesn't already exist — which is usually the desired behavior for reprocessing.
Step 5: Extract the New Field from Existing Documents
New documents will automatically have the new field extracted during ingestion. For documents already in the index, you need to trigger reprocessing.
Create a Reprocessing Workflow
If you don't already have one, create a reprocessing workflow that runs the metadata extractor and re-indexes the results:
{
"name": "reprocess-metadata",
"steps": [
{ "service": "start", "next_services": ["metadata_extractor"] },
{ "service": "metadata_extractor", "next_services": ["index_updater"] },
{ "service": "index_updater", "next_services": ["pipeline_source"] }
],
"tasks": [
{
"name": "metadata_extractor",
"local_settings": {
"default_ai_extraction": true,
"always_extract_metadata": true
}
},
{
"name": "index_updater",
"processor_settings": { "always_run": true, "skip_deleted": false }
},
{
"name": "pipeline_source",
"processor_settings": { "always_run": true, "skip_deleted": false }
}
]
}
Trigger Reprocessing
- Navigate to your tenant → target index → Content Sources
- Select the content source whose documents you want to reprocess
- Click View under Ingested Documents
- Click Create Reingestion
- Enter
reprocess-metadataas the Workflow Name Override - Click Submit
If you only want to extract the new field without re-extracting existing fields, set always_extract_metadata to false in the reprocessing workflow. The extractor will then only fill in fields that are currently missing — including your new field.
Complete Checklist
Use this checklist when adding a new metadata field:
- Index: Field exists in
document_fields_configuration(if it's a new index field) - Agent: Field description added to
metadata_descriptions.itemsin the tenant'schat_bot_setups - Content source: Field added to
fields_to_extract(matching the agent'sfield_name) - Content source: Field mapping added to
field_mappings(mapping agent field name → index field name) - Workflow:
preserved_fieldsupdated if the new field should be protected (uses content source field names) - Reprocessing: Existing documents reprocessed to extract the new field
Troubleshooting
Field is not being extracted
- Verify the
field_namein the agent'smetadata_descriptionsmatches exactly with thecontent_source_field_nameinfield_mappingsand thefield_nameinfields_to_extract. All three must use the same name. - Check that
default_ai_extractionis set totruein the workflow task. - Check that the field is not listed in
preserved_fields(which would skip extraction if the field already has a value). Remember,preserved_fieldsuses content source field names, not index field names.
Field is extracted but not appearing in search results
- Confirm the field exists in the index's
document_fields_configuration. - Verify that the
index_field_namein the field mapping matches the field name in the index configuration (without themetadata.prefix). - Check that
is_returned_in_search_resultsis set totruein the field'ssearch_options.
Field is extracted but has wrong values
- Review the
field_descriptionin the agent configuration — it may need more specific instructions. - Consider increasing
max_document_charsif the relevant content appears late in the document. - Enable
use_reflection_promptto have the agent self-check its output.
Extraction is not triggered for existing documents
- Ensure you triggered reprocessing with
always_extract_metadata: trueto force re-extraction of all fields. - If using
always_extract_metadata: false, the extractor only extracts fields that are missing — if the document already went through extraction before, fields that were previously extracted (even as empty) won't be retried.
See Also
- Extract Metadata using AI Agents — Initial setup guide
- Content Source Custom Metadata — Static metadata attached to all documents in a content source
- Agent Processor Enhancement — Custom AI processing beyond metadata extraction