Running Benchmarks
This guide covers how to execute benchmarks using Spring AI Bench.
1. Quick Start
2. Environment Setup
2.1. Local Development Requirements
For AI agent integration testing, you’ll need to build and install spring-ai-agents locally:
# Build spring-ai-agents first
git clone https://github.com/spring-ai-community/spring-ai-agents.git
cd spring-ai-agents
./mvnw clean install -DskipTests
cd ..
# Then build spring-ai-bench
git clone https://github.com/spring-ai-community/spring-ai-bench.git
cd spring-ai-bench
./mvnw clean install
2.2. API Keys
For live agent testing, configure your environment:
# Claude Code
export ANTHROPIC_API_KEY=your-anthropic-key
# Gemini
export GEMINI_API_KEY=your-google-key
# GitHub (for private repositories)
export GITHUB_TOKEN=your-github-token
2.3. Agent CLI Tools
Install required CLI tools for your agents:
-
Claude Code
-
Gemini
# Install Claude CLI
npm install -g @anthropic-ai/claude-cli
# Verify installation
claude --version
# Install Google Cloud SDK
curl https://sdk.cloud.google.com | bash
# Initialize and authenticate
gcloud init
gcloud auth application-default login
3. Running Specific Benchmarks
3.1. By Test Class
# Claude Code integration
./mvnw test -Dtest=ClaudeCodeIntegrationSuccessTest -pl bench-agents
# Gemini integration
./mvnw test -Dtest=GeminiIntegrationTest -pl bench-agents
# HelloWorld (mock agent)
./mvnw test -Dtest=HelloWorldIntegrationTest -pl bench-agents
# HelloWorld AI agent (requires spring-ai-agents built locally)
./mvnw test -Dtest=HelloWorldAIIntegrationTest -pl bench-agents
# Multi-agent comparison (deterministic + Claude + Gemini)
./mvnw test -Dtest=HelloWorldMultiAgentTest -pl bench-agents
3.2. By Profile
# Live agents only
./mvnw test -Pagents-live
# Core framework only
./mvnw test -Pdefault
3.3. Custom Benchmark Files
To run benchmarks from YAML specifications:
# Single benchmark file
java -jar bench-app/target/bench-app.jar \
--benchmark src/test/resources/samples/calculator-sqrt-bug.yaml
# Multiple benchmarks
java -jar bench-app/target/bench-app.jar \
--benchmark-dir src/test/resources/samples/
4. Understanding Results
4.1. Console Output
During execution, you’ll see structured logging:
[INFO] ADAPTER - Starting ClaudeCodeAgentModel
[INFO] SETUP - Workspace: /tmp/bench-workspace-123
[INFO] SETUP - Run root: /tmp/bench-reports/456
[INFO] WORKSPACE - Workspace cleaned successfully
[INFO] AGENT - Executing agent task
[INFO] AGENT - Agent call completed. Results: 1
[INFO] VERIFIER - Starting verification
[INFO] VERIFIER - exists:PASS content:PASS
[INFO] RESULT - SUCCESS: All checks passed
[INFO] FINAL - Exit code: 0, Duration: 15432ms
4.2. HTML Reports
After execution, HTML reports are generated:
bench-reports/
└── {run-id}/
├── run.log # Detailed execution log
├── report.html # Human-readable report
├── report.json # Machine-readable metadata
└── workspace/ # Final workspace state
4.3. Interactive Dashboard
Generate a comprehensive dashboard from all benchmark reports:
# Generate interactive site from all benchmark reports
jbang jbang/site.java --reportsDir /tmp/bench-reports --siteDir /tmp/bench-site
# View results in browser
open file:///tmp/bench-site/index.html
The dashboard provides:
-
Run Overview - Table of all benchmark executions with status and timing
-
Agent Comparison - Side-by-side performance comparison across agents
-
Detailed Reports - Click-through to individual run details
-
Performance Metrics - Duration tracking and success rates
4.4. JSON Metadata
The report.json
file contains structured results:
{
"runId": "123e4567-e89b-12d3-a456-426614174000",
"benchmarkId": "calculator-sqrt-bug",
"success": true,
"exitCode": 0,
"durationMs": 15432,
"startTime": "2024-01-15T10:30:00Z",
"endTime": "2024-01-15T10:30:15Z",
"agent": {
"kind": "claude-code",
"model": "claude-3-5-sonnet"
},
"verification": {
"success": true,
"checks": [
{"name": "exists", "pass": true, "detail": "ok"},
{"name": "content", "pass": true, "detail": "ok"}
]
}
}
5. Configuration Options
5.1. Timeout Settings
# Custom timeout (in seconds)
./mvnw test -Dtest.timeout=1200
# Per-agent timeout in YAML
agent:
timeout: PT10M # 10 minutes
6. Troubleshooting
6.1. Common Issues
7. Next Steps
-
Writing Custom Benchmarks - Create your own benchmarks
-
Agent Configuration - Configure agents for optimal performance
-
Verification System - Understand success criteria and verification