How to Test Agents with Specification-Based Testing

In this guide, we will explore how to use the new specification-based testing functionality in the Agents SDK to validate agent behavior through structured YAML test specifications. This testing framework allows you to define expected behaviors for your agents and automatically validate their responses against these expectations.

Prerequisites

Before you begin, make sure you have completed the Getting Started with the Agents SDK tutorial and have a working agents project set up.

Overview of Specification-Based Testing

Specification-based testing in the Agents SDK allows you to:

Define test scenarios in human-readable YAML files
Specify expectations for agent responses (text content, tool calls, citations, etc.)
Run automated tests against multiple agents or configurations
Get detailed feedback on which expectations passed or failed

The testing framework supports four main types of expectations:

Text Includes: Verify that the response contains specific text
Tool Calls: Validate that specific tools were called with expected parameters
Citations: Check that the agent provided appropriate citations
JSON Structure: Validate JSON response structure and content

Running Tests

The test command follows this basic syntax:

rag_agents test <specs_path> [options]

Basic Usage

To run all test specifications in a directory:

rag_agents test agents/specs

This will:

Load all YAML specification files from the specified directory
Initialize your agents project
Execute each test specification
Report results with timing information

Command Options

--pattern: Glob pattern to match specific spec files (default: *.yaml)
--setup-src: Path to agent setup configuration file
--secret-setup-src: Path to secret agent setup configuration file

Example with custom pattern:

rag_agents test agents/specs --project-dir agents --pattern "*_test.yaml"

Example Output

When you run the test command, you'll see output similar to this:

➜ rag_agents test agents/specs/examples --project-dir agents 
─────────────────────────────────────────────────── Agent Spec Test Run ────────────────────────────────────────────────────
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ collected        4 items                                                                                                 │
│ specs path       agents/specs/examples                                                                                   │
│ pattern          *.yaml                                                                                                  │
│ project dir      agents                                                                                                  │
│ setup src        agents/agent_setups.json                                                                                │
│ secret setup src agents/env/agent_setups.json                                                                            │
│ SDK version      0.0.1                                                                                                   │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
[01] sample_combi
   Complex expectations: string match, tool calls, and a citation
   Spec: agents/specs/examples/sample_combi.yaml
PASSED 17.22s

[02] sample_tool_call
   Check that the `search` tool has been invoked
   Spec: agents/specs/examples/sample_tool_call.yaml
PASSED 8.94s

[03] sample_citation
   Make sure the response has a citation
   Spec: agents/specs/examples/sample_citation.yaml
PASSED 5.72s

[04] sample_text
   Simple check for substring match
   Spec: agents/specs/examples/sample_text.yaml
PASSED 0.98s

───────────────────────────────────────────────────────── SUMMARY ──────────────────────────────────────────────────────────
4 passed, 4 total in 32.85s

Writing Test Specifications

Test specifications are YAML files that define a conversation scenario and expected outcomes. Here's the basic structure:

id: unique_test_identifier
description: "Human-readable description of what this test validates"
agent_identifier: "name_of_your_agent"
messages:
  - role: user
    content: "User message content"
  - role: bot
    content: "Expected bot response (optional)"
expectations:
  - type: expectation_type
    # expectation-specific parameters
bot_params:
  # optional agent parameters
conversation_context:
  # optional context setup

Basic Test Example

Here's a simple test that verifies the agent responds with specific text:

id: simple_greeting_test
description: "Verify agent introduces itself correctly"
agent_identifier: "my_chat_agent"
messages:
  - role: user
    content: "What's your name?"
expectations:
  - type: text_includes
    value: "Zeta Alpha"

Types of Expectations

1. Text Includes Expectations

Verify that the agent's response contains specific text strings.

expectations:
  - type: text_includes
    value: "machine learning"
  - type: text_includes
    value: "neural network"

Use cases:

Checking that key terms appear in responses
Validating greeting messages
Ensuring specific information is mentioned

2. Tool Call Expectations

Validate that the agent calls specific tools with expected parameters.

expectations:
  - type: tool_call
    name: search
    params_partial:
      search_request:
        query: "transformer architecture"
        filters:
          category: "research_papers"

Key features:

name: The exact name of the tool/function that should be called
params_partial: Partial matching of parameters (supports nested objects)

Use cases:

Ensuring agents use tools functionality appropriately
Validating API calls with correct parameters
Testing tool integration workflows

3. Citation Expectations

Verify that the agent provides appropriate citations in its responses.

expectations:
  - type: citation
    min_citations: 2
    cited_docs:
      - "948d9d226a999cf6a4bb4f8a821218f3192e2ea2"
      - "abc123def456789012345678901234567890abcd"

Parameters:

min_citations: Minimum number of citations required
cited_docs: Specific document IDs that should be cited (optional)

Use cases:

Ensuring proper grounding of responses
Validating that specific documents are referenced
Testing citation quality and quantity

4. JSON Structure Expectations

Validate JSON response structure and specific key-value pairs.

expectations:
  - type: json_includes
    items:
      status: "success"
      result:
        confidence: 0.95
        category: "research"

Features:

Partial matching: only specified keys need to match
Nested object support
Exact value matching for primitives

Use cases:

Testing API-style agents that return structured data
Validating classification results
Checking metadata in responses

Advanced Test Specifications

Multi-Turn Conversations

Test complex conversations with multiple human/agent exchanges:

id: multi_turn_research_test
description: "Test agent's ability to handle follow-up questions"
agent_identifier: "research_agent"
messages:
    - role: user
        content: "What is BERT?"
    - role: bot
        content: "BERT is a transformer-based model developed by Google for natural language understanding tasks."
    - role: user
        content: "How does it compare to GPT?"
expectations:
  - type: text_includes
    value: "Bidirectional Encoder"
  - type: text_includes
    value: "autoregressive"
  - type: tool_call
    name: search
    params_partial:
      search_request:
        query: "BERT vs GPT"

Complex Expectations

Combine multiple expectation types to thoroughly validate agent behavior:

id: comprehensive_research_test
description: "Complete research workflow validation"
agent_identifier: "advanced_research_agent"
messages:
  - role: user
    content: "Explain transformer architecture and provide sources"
expectations:
  # Content validation
  - type: text_includes
    value: "attention mechanism"
  - type: text_includes
    value: "encoder-decoder"
  
  # Tool usage validation
  - type: tool_call
    name: search
    params_partial:
      search_request:
        query: "transformer"
  - type: tool_call
    name: cite
    params_partial:
      citations:
        - document_id: "some-paper-id"
  
  # Citation validation
  - type: citation
    min_citations: 2
    cited_docs:
      - "attention-is-all-you-need-paper-id"

Agent Configuration Testing

Test different agent configurations using bot_params:

id: temperature_test
description: "Test agent with specific temperature setting"
agent_identifier: "configurable_agent"
bot_params:
  temperature: 0.1
  max_tokens: 200
messages:
  - role: user
    content: "Explain quantum computing"
expectations:
  - type: text_includes
    value: "quantum"
  - type: text_includes
    value: "superposition"

Context-Aware Testing

Test agents with specific document or custom context:

id: context_aware_test
description: "Test agent with document context"
agent_identifier: "context_agent"
conversation_context:
  document_context:
    documents:
      - id: "research-paper-123"
        content: "Quantum computing leverages quantum mechanics..."
        metadata:
          title: "Quantum Computing Fundamentals"
          authors: ["Dr. Smith", "Prof. Johnson"]
  custom_context:
    domain: "quantum_physics"
    expertise_level: "intermediate"
messages:
  - role: user
    content: "Summarize the key concepts from the provided document"
expectations:
  - type: text_includes
    value: "quantum mechanics"
  - type: citation
    min_citations: 1
    cited_docs:
      - "research-paper-123"

Best Practices

1. Start Simple

Begin with basic text inclusion tests before moving to complex multi-expectation scenarios:

# Good: Simple, focused test
id: basic_search_test
description: "Verify agent can perform basic search"
agent_identifier: "search_agent"
messages:
  - role: user
    content: "Search for machine learning papers"
expectations:
  - type: tool_call
    name: search

2. Use Descriptive IDs and Descriptions

Make your tests self-documenting:

# Good: Clear and descriptive
id: citation_quality_with_multiple_sources
description: "Verify agent provides high-quality citations when multiple relevant sources are available"

# Avoid: Vague descriptions
id: test1
description: "Test stuff"

3. Test Edge Cases

Include tests for boundary conditions and error scenarios:

id: empty_search_results_handling
description: "Test agent behavior when search returns no results"
agent_identifier: "search_agent"
messages:
  - role: user
    content: "Find information about 'xyzqwertynonexistentterm123'"
expectations:
  - type: text_includes
    value: "no results found"
  - type: tool_call
    name: search

4. Use Partial Parameter Matching Effectively

Leverage partial matching for flexible tool call validation:

expectations:
  - type: tool_call
    name: search
    params_partial:
      search_request:
        query: "BERT"  # Only check that query contains "BERT"
        # Don't need to specify all parameters

Create comprehensive test suites for specific functionality:

# File: citation_behavior_suite.yaml
id: citation_comprehensive_test
description: "Complete citation behavior validation"
agent_identifier: "research_agent"
messages:
  - role: user
    content: "Tell me about attention mechanisms in transformers with sources"
expectations:
  - type: citation
    min_citations: 1
  - type: tool_call
    name: cite
  - type: text_includes
    value: "attention"

Integration with CI/CD

Integrate specification-based testing into your development workflow:

GitHub Actions Example

name: Agent Testing
on: [push, pull_request]

jobs:
  test-agents:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.13'
      - name: Install dependencies
        run: |
          pip install -r requirements.txt
      - name: Run agent tests
        run: |
          rag_agents test agents/specs --project-dir agents
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

Local Development Workflow

Write tests first: Define expected behavior before implementing
Run tests frequently: Validate changes during development
Create regression tests: Add tests for fixed bugs
Update tests: Modify expectations when behavior intentionally changes

Conclusion

Specification-based testing provides a powerful way to ensure your agents behave consistently and correctly. By writing comprehensive test specifications, you can:

Catch regressions early in development
Document expected agent behavior
Validate complex interactions automatically
Ensure consistency across different agent configurations

Start with simple text inclusion tests and gradually build up to complex multi-expectation scenarios as you become more familiar with the framework. Remember to organize your tests logically, use descriptive names, and test both happy paths and edge cases.

For more advanced testing scenarios and integration with evaluation workflows, see How to Evaluate Agent Quality Using RAGElo.

Prerequisites​

Overview of Specification-Based Testing​

Running Tests​

Basic Usage​

Command Options​

Example Output​

Writing Test Specifications​

Basic Test Example​

Types of Expectations​

1. Text Includes Expectations​

2. Tool Call Expectations​

3. Citation Expectations​

4. JSON Structure Expectations​

Advanced Test Specifications​

Multi-Turn Conversations​

Complex Expectations​

Agent Configuration Testing​

Context-Aware Testing​

Best Practices​

1. Start Simple​

2. Use Descriptive IDs and Descriptions​

3. Test Edge Cases​

4. Use Partial Parameter Matching Effectively​

5. Group Related Tests​

Integration with CI/CD​

GitHub Actions Example​

Local Development Workflow​

Conclusion​

Prerequisites

Overview of Specification-Based Testing

Running Tests

Basic Usage

Command Options

Example Output

Writing Test Specifications

Basic Test Example

Types of Expectations

1. Text Includes Expectations

2. Tool Call Expectations

3. Citation Expectations

4. JSON Structure Expectations

Advanced Test Specifications

Multi-Turn Conversations

Complex Expectations

Agent Configuration Testing

Context-Aware Testing

Best Practices

1. Start Simple

2. Use Descriptive IDs and Descriptions

3. Test Edge Cases

4. Use Partial Parameter Matching Effectively

5. Group Related Tests

Integration with CI/CD

GitHub Actions Example

Local Development Workflow

Conclusion