Skip to main content

How to Test Agents with Specification-Based Testing

In this guide, we will explore how to use the new specification-based testing functionality in the Agents SDK to validate agent behavior through structured YAML test specifications. This testing framework allows you to define expected behaviors for your agents and automatically validate their responses against these expectations.

Prerequisites

Before you begin, make sure you have completed the Getting Started with the Agents SDK tutorial and have a working agents project set up.

Overview of Specification-Based Testing

Specification-based testing in the Agents SDK allows you to:

  1. Define test scenarios in human-readable YAML files
  2. Specify expectations for agent responses (text content, tool calls, citations, etc.)
  3. Run automated tests against multiple agents or configurations
  4. Get detailed feedback on which expectations passed or failed

The testing framework supports four main types of expectations:

  • Text Includes: Verify that the response contains specific text
  • Tool Calls: Validate that specific tools were called with expected parameters
  • Citations: Check that the agent provided appropriate citations
  • JSON Structure: Validate JSON response structure and content

Running Tests

The test command follows this basic syntax:

rag_agents test <specs_path> [options]

Basic Usage

To run all test specifications in a directory:

rag_agents test agents/specs

This will:

  1. Load all YAML specification files from the specified directory
  2. Initialize your agents project
  3. Execute each test specification
  4. Report results with timing information

Command Options

  • --pattern: Glob pattern to match specific spec files (default: *.yaml)
  • --setup-src: Path to agent setup configuration file
  • --secret-setup-src: Path to secret agent setup configuration file

Example with custom pattern:

rag_agents test agents/specs --project-dir agents --pattern "*_test.yaml"

Example Output

When you run the test command, you'll see output similar to this:

➜ rag_agents test agents/specs/examples --project-dir agents 
─────────────────────────────────────────────────── Agent Spec Test Run ────────────────────────────────────────────────────
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ collected 4 items │
│ specs path agents/specs/examples │
│ pattern *.yaml │
│ project dir agents │
│ setup src agents/agent_setups.json │
│ secret setup src agents/env/agent_setups.json │
│ SDK version 0.0.1 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
[01] sample_combi
Complex expectations: string match, tool calls, and a citation
Spec: agents/specs/examples/sample_combi.yaml
PASSED 17.22s

[02] sample_tool_call
Check that the `search` tool has been invoked
Spec: agents/specs/examples/sample_tool_call.yaml
PASSED 8.94s

[03] sample_citation
Make sure the response has a citation
Spec: agents/specs/examples/sample_citation.yaml
PASSED 5.72s

[04] sample_text
Simple check for substring match
Spec: agents/specs/examples/sample_text.yaml
PASSED 0.98s

───────────────────────────────────────────────────────── SUMMARY ──────────────────────────────────────────────────────────
4 passed, 4 total in 32.85s

Writing Test Specifications

Test specifications are YAML files that define a conversation scenario and expected outcomes. Here's the basic structure:

id: unique_test_identifier
description: "Human-readable description of what this test validates"
agent_identifier: "name_of_your_agent"
messages:
- role: user
content: "User message content"
- role: bot
content: "Expected bot response (optional)"
expectations:
- type: expectation_type
# expectation-specific parameters
bot_params:
# optional agent parameters
conversation_context:
# optional context setup

Basic Test Example

Here's a simple test that verifies the agent responds with specific text:

id: simple_greeting_test
description: "Verify agent introduces itself correctly"
agent_identifier: "my_chat_agent"
messages:
- role: user
content: "What's your name?"
expectations:
- type: text_includes
value: "Zeta Alpha"

Types of Expectations

1. Text Includes Expectations

Verify that the agent's response contains specific text strings.

expectations:
- type: text_includes
value: "machine learning"
- type: text_includes
value: "neural network"

Use cases:

  • Checking that key terms appear in responses
  • Validating greeting messages
  • Ensuring specific information is mentioned

2. Tool Call Expectations

Validate that the agent calls specific tools with expected parameters.

expectations:
- type: tool_call
name: search
params_partial:
search_request:
query: "transformer architecture"
filters:
category: "research_papers"

Key features:

  • name: The exact name of the tool/function that should be called
  • params_partial: Partial matching of parameters (supports nested objects)

Use cases:

  • Ensuring agents use tools functionality appropriately
  • Validating API calls with correct parameters
  • Testing tool integration workflows

3. Citation Expectations

Verify that the agent provides appropriate citations in its responses.

expectations:
- type: citation
min_citations: 2
cited_docs:
- "948d9d226a999cf6a4bb4f8a821218f3192e2ea2"
- "abc123def456789012345678901234567890abcd"

Parameters:

  • min_citations: Minimum number of citations required
  • cited_docs: Specific document IDs that should be cited (optional)

Use cases:

  • Ensuring proper grounding of responses
  • Validating that specific documents are referenced
  • Testing citation quality and quantity

4. JSON Structure Expectations

Validate JSON response structure and specific key-value pairs.

expectations:
- type: json_includes
items:
status: "success"
result:
confidence: 0.95
category: "research"

Features:

  • Partial matching: only specified keys need to match
  • Nested object support
  • Exact value matching for primitives

Use cases:

  • Testing API-style agents that return structured data
  • Validating classification results
  • Checking metadata in responses

Advanced Test Specifications

Multi-Turn Conversations

Test complex conversations with multiple human/agent exchanges:

id: multi_turn_research_test
description: "Test agent's ability to handle follow-up questions"
agent_identifier: "research_agent"
messages:
- role: user
content: "What is BERT?"
- role: bot
content: "BERT is a transformer-based model developed by Google for natural language understanding tasks."
- role: user
content: "How does it compare to GPT?"
expectations:
- type: text_includes
value: "Bidirectional Encoder"
- type: text_includes
value: "autoregressive"
- type: tool_call
name: search
params_partial:
search_request:
query: "BERT vs GPT"

Complex Expectations

Combine multiple expectation types to thoroughly validate agent behavior:

id: comprehensive_research_test
description: "Complete research workflow validation"
agent_identifier: "advanced_research_agent"
messages:
- role: user
content: "Explain transformer architecture and provide sources"
expectations:
# Content validation
- type: text_includes
value: "attention mechanism"
- type: text_includes
value: "encoder-decoder"

# Tool usage validation
- type: tool_call
name: search
params_partial:
search_request:
query: "transformer"
- type: tool_call
name: cite
params_partial:
citations:
- document_id: "some-paper-id"

# Citation validation
- type: citation
min_citations: 2
cited_docs:
- "attention-is-all-you-need-paper-id"

Agent Configuration Testing

Test different agent configurations using bot_params:

id: temperature_test
description: "Test agent with specific temperature setting"
agent_identifier: "configurable_agent"
bot_params:
temperature: 0.1
max_tokens: 200
messages:
- role: user
content: "Explain quantum computing"
expectations:
- type: text_includes
value: "quantum"
- type: text_includes
value: "superposition"

Context-Aware Testing

Test agents with specific document or custom context:

id: context_aware_test
description: "Test agent with document context"
agent_identifier: "context_agent"
conversation_context:
document_context:
documents:
- id: "research-paper-123"
content: "Quantum computing leverages quantum mechanics..."
metadata:
title: "Quantum Computing Fundamentals"
authors: ["Dr. Smith", "Prof. Johnson"]
custom_context:
domain: "quantum_physics"
expertise_level: "intermediate"
messages:
- role: user
content: "Summarize the key concepts from the provided document"
expectations:
- type: text_includes
value: "quantum mechanics"
- type: citation
min_citations: 1
cited_docs:
- "research-paper-123"

Best Practices

1. Start Simple

Begin with basic text inclusion tests before moving to complex multi-expectation scenarios:

# Good: Simple, focused test
id: basic_search_test
description: "Verify agent can perform basic search"
agent_identifier: "search_agent"
messages:
- role: user
content: "Search for machine learning papers"
expectations:
- type: tool_call
name: search

2. Use Descriptive IDs and Descriptions

Make your tests self-documenting:

# Good: Clear and descriptive
id: citation_quality_with_multiple_sources
description: "Verify agent provides high-quality citations when multiple relevant sources are available"

# Avoid: Vague descriptions
id: test1
description: "Test stuff"

3. Test Edge Cases

Include tests for boundary conditions and error scenarios:

id: empty_search_results_handling
description: "Test agent behavior when search returns no results"
agent_identifier: "search_agent"
messages:
- role: user
content: "Find information about 'xyzqwertynonexistentterm123'"
expectations:
- type: text_includes
value: "no results found"
- type: tool_call
name: search

4. Use Partial Parameter Matching Effectively

Leverage partial matching for flexible tool call validation:

expectations:
- type: tool_call
name: search
params_partial:
search_request:
query: "BERT" # Only check that query contains "BERT"
# Don't need to specify all parameters

Create comprehensive test suites for specific functionality:

# File: citation_behavior_suite.yaml
id: citation_comprehensive_test
description: "Complete citation behavior validation"
agent_identifier: "research_agent"
messages:
- role: user
content: "Tell me about attention mechanisms in transformers with sources"
expectations:
- type: citation
min_citations: 1
- type: tool_call
name: cite
- type: text_includes
value: "attention"

Integration with CI/CD

Integrate specification-based testing into your development workflow:

GitHub Actions Example

name: Agent Testing
on: [push, pull_request]

jobs:
test-agents:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.13'
- name: Install dependencies
run: |
pip install -r requirements.txt
- name: Run agent tests
run: |
rag_agents test agents/specs --project-dir agents
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

Local Development Workflow

  1. Write tests first: Define expected behavior before implementing
  2. Run tests frequently: Validate changes during development
  3. Create regression tests: Add tests for fixed bugs
  4. Update tests: Modify expectations when behavior intentionally changes

Conclusion

Specification-based testing provides a powerful way to ensure your agents behave consistently and correctly. By writing comprehensive test specifications, you can:

  • Catch regressions early in development
  • Document expected agent behavior
  • Validate complex interactions automatically
  • Ensure consistency across different agent configurations

Start with simple text inclusion tests and gradually build up to complex multi-expectation scenarios as you become more familiar with the framework. Remember to organize your tests logically, use descriptive names, and test both happy paths and edge cases.

For more advanced testing scenarios and integration with evaluation workflows, see How to Evaluate Agent Quality Using RAGElo.