Extract Metadata using AI Agents

You can extract metadata from document content using AI agents. This extracted information is considered an enhancement since the metadata is not part of the original document content.

This guide shows you how to configure a metadata extractor for your data ingestion workflows.

Note: The extracted information may override the metadata of the document in the index. The original metadata will still be visible in the "Ingested Documents" section, and any extracted data will be attached under the enhancements section of each document.

Prerequisites

Before you begin, ensure you have:

Access to the Zeta Alpha Platform UI
A tenant created
An index created

Step 1: Define the Metadata Items to Extract

First, create a configuration that describes the metadata items to extract and their types. Each item is a configuration object containing the following fields:

field_name: (string) The name of the field to extract.
field_type: (string) The type of the field to extract (use Python type hints notation).
field_description: (string) A description of the field and instructions on how to extract it. This field is passed to the AI agent to guide extraction.

Example Metadata Items

[
  {
    "field_name": "title",
    "field_type": "NotRequired[str]",
    "field_description": "The title of the given document. If you cannot confidently determine the title, don't return this field."
  },
  {
    "field_name": "authors",
    "field_type": "NotRequired[List[Dict[str, str]]]",
    "field_description": "List of authors associated with the paper. Each author is a dictionary with the following fields: name and affiliation."
  },
  {
    "field_name": "summary",
    "field_type": "NotRequired[str]",
    "field_description": "Sometimes the summary is explicitly marked in the text, then extract it exactly without any paraphrasing. Otherwise, produce a summary that is about 10 sentences long. If you cannot confidently determine the summary, don't return this field."
  },
  {
    "field_name": "date",
    "field_type": "NotRequired[date]",
    "field_description": "The publication date of the document. Ensure dates are not ambiguous and are in the format YYYY-MM-DD. If the date is ambiguous or in the future, do not return this field."
  }
]

Step 2: Add the Agent to the Tenant Configuration

To add the agent to the tenant configuration, create a configuration that describes the agent and add it to the chat_bot_setups field of the tenant configuration. For complete configuration details, see the Tenant Configuration Reference.

The agent configuration contains the following fields:

bot_identifier: (string) The identifier of the agent.
llm_configuration_name: (string) A reference to the LLM configuration in the tenant configuration.
llm_tracing_configuration_name: (string, optional) A reference to the LLM tracing configuration.
agent_name: (string) The name of the agent. Set it to metadata_extractor or affiliations_extractor.
serving_url: (string) The URL of the agent serving endpoint.
bot_configuration: (object) The configuration of the agent containing:
- agent_config: (object) Defines how the agent exchanges information with the Zeta Alpha platform:
  - max_document_chars: (integer) The maximum number of characters of the document content that the agent will handle.
  - max_response_tokens: (integer, optional) The maximum number of tokens that the agent will return.
  - use_reflection_prompt: (boolean, optional) Whether the agent should check the output format to ensure correctness. This may slow down the agent and make an extra call.
  - normalization_config: (array, optional, specific to affiliations_extractor) List of fields and normalization functions to apply:
    - field: The field to normalize
    - normalization_fn: The normalization function (currently only scientific_paper is supported, which normalizes affiliation names extracted from scientific papers)
- metadata_descriptions: (object) Contains the metadata items to extract:
  - items: The list of items to extract (from Step 1)

Example Tenant Configuration

Here is an example tenant configuration for the metadata_extractor and affiliations_extractor agents:

{
  ...
  "features": {
    "chat_bot_setups": [
      {
        "bot_identifier": "metadata_extractor",
        "llm_configuration_name": "name_of_the_llm_configuration",
        "agent_name": "metadata_extractor",
        "bot_configuration": {
          "agent_config": {
            "max_document_chars": 24000,
            "max_response_tokens": 4096,
            "use_reflection_prompt": true
          },
          "metadata_descriptions": {
            "items": [
              {
                "field_name": "title",
                "field_type": "NotRequired[str]",
                "field_description": "The title of the given document. If you cannot confidently determine the title, don't return this field."
              },
              ...
            ]
          }
        },
        "serving_url": "http://chat-service.production:8080"
      },
      {
        "bot_identifier": "affiliations_extractor",
        "llm_configuration_name": "name_of_the_llm_configuration",
        "agent_name": "affiliations_extractor",
        "bot_configuration": {
          "agent_config": {
            "max_document_chars": 24000,
            "max_response_tokens": 4096,
            "use_reflection_prompt": true,
            "normalization_config": [
              {
                "field": "affiliations",
                "normalization_fn": "scientific_paper"
              }
            ]
          },
          "metadata_descriptions": {
            "items": [
              {
                "field_name": "affiliations",
                "field_type": "NotRequired[str]",
                "field_description": "The list of authors affiliations."
              },
              ...
            ]
          }
        },
        "serving_url": "http://chat-service.production:8080"
      },
      ...
    ]
  },
  ...
}

Step 3: Create a Metadata Enhancement Workflow

To extract metadata from documents during ingestion, you need to use the metadata extractor in the document workflow. The processor in the main document workflow will trigger an enhancement workflow. Since the document is already in the pipeline, the enhanced metadata will eventually be indexed by the main document workflow.

To avoid reprocessing the document with the enhancement workflow, ensure the following workflow exists:

{
    "name": "enhance-metadata",
    "steps": [
        {
            "next_services": [
                "pipeline_source"
            ],
            "service": "start"
        }
    ],
    "tasks": [
        {
            "name": "pipeline-source",
            "processor_settings": {
                "always_run": true,
                "skip_deleted": false
            }
        }
    ]
}

This workflow should be set as the workflow override for the metadata enhancement content source in the next step.

Step 4: Create the Metadata Enhancement Content Source

To create the content source, define a configuration file containing the following fields:

connector: (string) The connector type. Set it to metadata_extractor_enhancement.
name: (string) The name of the enhancement.
description: (string) A description of the enhancement.
is_indexable: (boolean) Whether the enhancement is indexable.
workflow_name_overrides: (object) The workflow name overrides for the enhancement:
- ingest: The workflow name for the ingestion of the enhancement (set to enhance-metadata)
- reingest: The workflow name for the re-ingestion of the enhancement (set to enhance-metadata)
connector_configuration: (object) The configuration of the enhancement:
- metadata_extractor_enhancement: (object) The metadata extractor enhancement configuration:
  - enhancement_id: (string) The identifier of the enhancement (set to metadata)
  - extractor_backend: (string) The backend of the extractor (set to agent)
  - extractor_backend_configuration: (object) The agent configuration:
    - agent: (object) The agent configuration:
      - agent_identifier: (string) The ID of the agent to use for metadata extraction (should match one of the agents configured in the tenant configuration, e.g., metadata_extractor)
  - fields_to_extract: (array) The fields to extract from the document content:
    - field_name: (string) The name of the field to extract
    - field_type: (string) The type of the field to extract
  - field_mappings: (array) The field mappings between extracted fields and index field names

Example Configuration

Here is an example metadata enhancement content source configuration:

{
  "name": "Metadata Extractor Enhancement",
  "description": "Metadata extractor enhancement for documents",
  "is_indexable": true,
  "connector": "metadata_extractor_enhancement",
  "workflow_name_overrides": {
    "ingest": "enhance-metadata",
    "reingest": "enhance-metadata"
  },
  "connector_configuration": {
    "metadata_extractor_enhancement": {
      "enhancement_id": "metadata",
      "extractor_backend": "agent",
      "extractor_backend_configuration": {
        "agent": {
          "agent_identifier": "metadata_extractor"
        }
      },
      "fields_to_extract": [
        {
          "field_name": "title",
          "field_type": "NotRequired[str]"
        },
        {
          "field_name": "authors",
          "field_type": "NotRequired[List[str]]"
        },
        {
          "field_name": "summary",
          "field_type": "NotRequired[str]"
        },
        {
          "field_name": "date",
          "field_type": "NotRequired[date]"
        }
      ],
      "field_mappings": [
        {
          "content_source_field_name": "date",
          "index_field_name": "DCMI.created"
        },
        {
          "content_source_field_name": "summary",
          "index_field_name": "DCMI.abstract"
        },
        {
          "content_source_field_name": "title",
          "index_field_name": "DCMI.title"
        },
        {
          "content_source_field_name": "authors",
          "index_field_name": "DCMI.creator",
          "inner_field_mappings": [
            {
              "content_source_field_name": ".",
              "index_field_name": "full_name"
            }
          ]
        }
      ]
    }
  }
}

Create the Content Source

To create the metadata enhancement content source in the Zeta Alpha Platform UI:

Navigate to your tenant and click View next to your target index
Click View under Content Sources for the index
Click Create Content Source
Paste your JSON configuration
Click Submit

Metadata Enhancement

Step 5: Add the Metadata Processor to the Documents Workflow

To extract information using the metadata enhancement, add the metadata processor to the documents workflow. This processor extracts information from document content according to the configuration described in the metadata enhancement content source.

Configuration Steps

Identify the content source that ingests the documents from which you want to extract metadata
Create or modify the workflow that content source uses by adding the metadata_extractor processor
Ensure the processor is added after the text representation is created and before the index_updater processor
Add the following task to the list of workflow tasks:

    {
      "name": "metadata_extractor",
      "local_settings": {
        "default_ai_extraction": true,
        "always_extract_metadata": true,
        "preserved_fields": null
      }
    }

Configuration Parameters

always_extract_metadata: (boolean) When set to true, always extracts metadata from document content, even when the document is reprocessed. Set to false to extract metadata only on first ingestion.
preserved_fields: (array or null) Specifies which metadata fields should be protected from being overwritten during extraction. This optional parameter accepts:
- A list of field names (using content source field names) that should be preserved if they already exist in the original document metadata
- A special wildcard value ["*"] which preserves all fields defined in the field mappings configuration
- null to allow all fields to be overwritten
When a field is in the preserved_fields list, the metadata extractor will only extract and apply new values if the field is completely missing from the original metadata. This is particularly useful when certain metadata fields contain trusted data that should not be automatically replaced by AI-extracted values.

Step 6 (Optional): Extract Metadata from Existing Documents

To extract metadata from documents that have already been ingested, you need to reprocess them using a workflow that includes the metadata extractor and index updater processors.

Example Reprocessing Workflow

{
  "name": "reprocess-metadata",
  "steps": [
    {
      "next_services": [
        "metadata_extractor"
      ],
      "service": "start"
    },
    {
      "next_services": [
        "index_updater"
      ],
      "service": "metadata_extractor"
    },
    {
      "next_services": [
        "pipeline_source"
      ],
      "service": "index_updater"
    }
  ],
  "tasks": [
    {
      "name": "metadata_extractor",
      "local_settings": {
        "default_ai_extraction": true,
        "always_extract_metadata": true
      }
    },
    {
      "name": "index_updater",
      "processor_settings": {
        "always_run": true,
        "skip_deleted": false
      }
    },
    {
      "name": "pipeline_source",
      "processor_settings": {
        "always_run": true,
        "skip_deleted": false
      }
    }
  ]
}

Trigger Reprocessing

To trigger reprocessing for a specific content source:

Navigate to your tenant in the Zeta Alpha Platform UI and click View next to your target index
Click View under Content Sources for the index
Click View under Ingested Documents for the content source you want to reprocess
Click Create Reingestion
Enter the workflow name reprocess-metadata under the Workflow Name Override field
Click Submit

Reprocess Documents

Prerequisites​

Step 1: Define the Metadata Items to Extract​

Example Metadata Items​

Step 2: Add the Agent to the Tenant Configuration​

Example Tenant Configuration​

Step 3: Create a Metadata Enhancement Workflow​

Step 4: Create the Metadata Enhancement Content Source​

Example Configuration​

Create the Content Source​

Step 5: Add the Metadata Processor to the Documents Workflow​

Configuration Steps​

Configuration Parameters​

Step 6 (Optional): Extract Metadata from Existing Documents​

Example Reprocessing Workflow​

Trigger Reprocessing​