How to Test Agents with Specification-Based Testing
In this guide, we will explore how to use the new specification-based testing functionality in the Agents SDK to validate agent behavior through structured YAML test specifications. This testing framework allows you to define expected behaviors for your agents and automatically validate their responses against these expectations.
Prerequisites
Before you begin, make sure you have completed the Getting Started with the Agents SDK tutorial and have a working agents project set up.
Overview of Specification-Based Testing
Specification-based testing in the Agents SDK allows you to:
- Define test scenarios in human-readable YAML files
- Specify expectations for agent responses (text content, tool calls, citations, etc.)
- Run automated tests against multiple agents or configurations
- Get detailed feedback on which expectations passed or failed
The testing framework supports four main types of expectations:
- Text Includes: Verify that the response contains specific text
- Tool Calls: Validate that specific tools were called with expected parameters
- Citations: Check that the agent provided appropriate citations
- JSON Structure: Validate JSON response structure and content
Running Tests
The test command follows this basic syntax:
rag_agents test <specs_path> [options]
Basic Usage
To run all test specifications in a directory:
rag_agents test agents/specs
This will:
- Load all YAML specification files from the specified directory
- Initialize your agents project
- Execute each test specification
- Report results with timing information
Command Options
--pattern
: Glob pattern to match specific spec files (default:*.yaml
)--setup-src
: Path to agent setup configuration file--secret-setup-src
: Path to secret agent setup configuration file
Example with custom pattern:
rag_agents test agents/specs --project-dir agents --pattern "*_test.yaml"
Example Output
When you run the test command, you'll see output similar to this:
➜ rag_agents test agents/specs/examples --project-dir agents
─────────────────────────────────────────────────── Agent Spec Test Run ────────────────────────────────────────────────────
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ collected 4 items │
│ specs path agents/specs/examples │
│ pattern *.yaml │
│ project dir agents │
│ setup src agents/agent_setups.json │
│ secret setup src agents/env/agent_setups.json │
│ SDK version 0.0.1 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
[01] sample_combi
Complex expectations: string match, tool calls, and a citation
Spec: agents/specs/examples/sample_combi.yaml
PASSED 17.22s
[02] sample_tool_call
Check that the `search` tool has been invoked
Spec: agents/specs/examples/sample_tool_call.yaml
PASSED 8.94s
[03] sample_citation
Make sure the response has a citation
Spec: agents/specs/examples/sample_citation.yaml
PASSED 5.72s
[04] sample_text
Simple check for substring match
Spec: agents/specs/examples/sample_text.yaml
PASSED 0.98s
───────────────────────────────────────────────────────── SUMMARY ──────────────────────────────────────────────────────────
4 passed, 4 total in 32.85s
Writing Test Specifications
Test specifications are YAML files that define a conversation scenario and expected outcomes. Here's the basic structure:
id: unique_test_identifier
description: "Human-readable description of what this test validates"
agent_identifier: "name_of_your_agent"
messages:
- role: user
content: "User message content"
- role: bot
content: "Expected bot response (optional)"
expectations:
- type: expectation_type
# expectation-specific parameters
bot_params:
# optional agent parameters
conversation_context:
# optional context setup
Basic Test Example
Here's a simple test that verifies the agent responds with specific text:
id: simple_greeting_test
description: "Verify agent introduces itself correctly"
agent_identifier: "my_chat_agent"
messages:
- role: user
content: "What's your name?"
expectations:
- type: text_includes
value: "Zeta Alpha"
Types of Expectations
1. Text Includes Expectations
Verify that the agent's response contains specific text strings.
expectations:
- type: text_includes
value: "machine learning"
- type: text_includes
value: "neural network"
Use cases:
- Checking that key terms appear in responses
- Validating greeting messages
- Ensuring specific information is mentioned
2. Tool Call Expectations
Validate that the agent calls specific tools with expected parameters.
expectations:
- type: tool_call
name: search
params_partial:
search_request:
query: "transformer architecture"
filters:
category: "research_papers"
Key features:
name
: The exact name of the tool/function that should be calledparams_partial
: Partial matching of parameters (supports nested objects)
Use cases:
- Ensuring agents use tools functionality appropriately
- Validating API calls with correct parameters
- Testing tool integration workflows
3. Citation Expectations
Verify that the agent provides appropriate citations in its responses.
expectations:
- type: citation
min_citations: 2
cited_docs:
- "948d9d226a999cf6a4bb4f8a821218f3192e2ea2"
- "abc123def456789012345678901234567890abcd"
Parameters:
min_citations
: Minimum number of citations requiredcited_docs
: Specific document IDs that should be cited (optional)
Use cases:
- Ensuring proper grounding of responses
- Validating that specific documents are referenced
- Testing citation quality and quantity
4. JSON Structure Expectations
Validate JSON response structure and specific key-value pairs.
expectations:
- type: json_includes
items:
status: "success"
result:
confidence: 0.95
category: "research"
Features:
- Partial matching: only specified keys need to match
- Nested object support
- Exact value matching for primitives
Use cases:
- Testing API-style agents that return structured data
- Validating classification results
- Checking metadata in responses
Advanced Test Specifications
Multi-Turn Conversations
Test complex conversations with multiple human/agent exchanges:
id: multi_turn_research_test
description: "Test agent's ability to handle follow-up questions"
agent_identifier: "research_agent"
messages:
- role: user
content: "What is BERT?"
- role: bot
content: "BERT is a transformer-based model developed by Google for natural language understanding tasks."
- role: user
content: "How does it compare to GPT?"
expectations:
- type: text_includes
value: "Bidirectional Encoder"
- type: text_includes
value: "autoregressive"
- type: tool_call
name: search
params_partial:
search_request:
query: "BERT vs GPT"
Complex Expectations
Combine multiple expectation types to thoroughly validate agent behavior:
id: comprehensive_research_test
description: "Complete research workflow validation"
agent_identifier: "advanced_research_agent"
messages:
- role: user
content: "Explain transformer architecture and provide sources"
expectations:
# Content validation
- type: text_includes
value: "attention mechanism"
- type: text_includes
value: "encoder-decoder"
# Tool usage validation
- type: tool_call
name: search
params_partial:
search_request:
query: "transformer"
- type: tool_call
name: cite
params_partial:
citations:
- document_id: "some-paper-id"
# Citation validation
- type: citation
min_citations: 2
cited_docs:
- "attention-is-all-you-need-paper-id"
Agent Configuration Testing
Test different agent configurations using bot_params
:
id: temperature_test
description: "Test agent with specific temperature setting"
agent_identifier: "configurable_agent"
bot_params:
temperature: 0.1
max_tokens: 200
messages:
- role: user
content: "Explain quantum computing"
expectations:
- type: text_includes
value: "quantum"
- type: text_includes
value: "superposition"
Context-Aware Testing
Test agents with specific document or custom context:
id: context_aware_test
description: "Test agent with document context"
agent_identifier: "context_agent"
conversation_context:
document_context:
documents:
- id: "research-paper-123"
content: "Quantum computing leverages quantum mechanics..."
metadata:
title: "Quantum Computing Fundamentals"
authors: ["Dr. Smith", "Prof. Johnson"]
custom_context:
domain: "quantum_physics"
expertise_level: "intermediate"
messages:
- role: user
content: "Summarize the key concepts from the provided document"
expectations:
- type: text_includes
value: "quantum mechanics"
- type: citation
min_citations: 1
cited_docs:
- "research-paper-123"
Best Practices
1. Start Simple
Begin with basic text inclusion tests before moving to complex multi-expectation scenarios:
# Good: Simple, focused test
id: basic_search_test
description: "Verify agent can perform basic search"
agent_identifier: "search_agent"
messages:
- role: user
content: "Search for machine learning papers"
expectations:
- type: tool_call
name: search
2. Use Descriptive IDs and Descriptions
Make your tests self-documenting:
# Good: Clear and descriptive
id: citation_quality_with_multiple_sources
description: "Verify agent provides high-quality citations when multiple relevant sources are available"
# Avoid: Vague descriptions
id: test1
description: "Test stuff"
3. Test Edge Cases
Include tests for boundary conditions and error scenarios:
id: empty_search_results_handling
description: "Test agent behavior when search returns no results"
agent_identifier: "search_agent"
messages:
- role: user
content: "Find information about 'xyzqwertynonexistentterm123'"
expectations:
- type: text_includes
value: "no results found"
- type: tool_call
name: search
4. Use Partial Parameter Matching Effectively
Leverage partial matching for flexible tool call validation:
expectations:
- type: tool_call
name: search
params_partial:
search_request:
query: "BERT" # Only check that query contains "BERT"
# Don't need to specify all parameters
5. Group Related Tests
Create comprehensive test suites for specific functionality:
# File: citation_behavior_suite.yaml
id: citation_comprehensive_test
description: "Complete citation behavior validation"
agent_identifier: "research_agent"
messages:
- role: user
content: "Tell me about attention mechanisms in transformers with sources"
expectations:
- type: citation
min_citations: 1
- type: tool_call
name: cite
- type: text_includes
value: "attention"
Integration with CI/CD
Integrate specification-based testing into your development workflow:
GitHub Actions Example
name: Agent Testing
on: [push, pull_request]
jobs:
test-agents:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.13'
- name: Install dependencies
run: |
pip install -r requirements.txt
- name: Run agent tests
run: |
rag_agents test agents/specs --project-dir agents
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
Local Development Workflow
- Write tests first: Define expected behavior before implementing
- Run tests frequently: Validate changes during development
- Create regression tests: Add tests for fixed bugs
- Update tests: Modify expectations when behavior intentionally changes
Conclusion
Specification-based testing provides a powerful way to ensure your agents behave consistently and correctly. By writing comprehensive test specifications, you can:
- Catch regressions early in development
- Document expected agent behavior
- Validate complex interactions automatically
- Ensure consistency across different agent configurations
Start with simple text inclusion tests and gradually build up to complex multi-expectation scenarios as you become more familiar with the framework. Remember to organize your tests logically, use descriptive names, and test both happy paths and edge cases.
For more advanced testing scenarios and integration with evaluation workflows, see How to Evaluate Agent Quality Using RAGElo.