Running Benchmarks

This guide covers how to execute benchmarks using Spring AI Bench.

1. Quick Start

1.1. Run a Single Benchmark

# Run the end-to-end test with built-in benchmark
./mvnw test -Dtest=BenchHarnessE2ETest -pl bench-core

# Run with specific agent
./mvnw test -Pagents-live -Dtest=ClaudeCodeIntegrationSuccessTest -pl bench-agents

1.2. Run Benchmark Suite

# All core benchmarks (no live agents)
./mvnw test

# All agent integration benchmarks
./mvnw test -Pagents-live

2. Environment Setup

2.1. Local Development Requirements

For AI agent integration testing, you’ll need to build and install spring-ai-agents locally:

# Build spring-ai-agents first
git clone https://github.com/spring-ai-community/spring-ai-agents.git
cd spring-ai-agents
./mvnw clean install -DskipTests
cd ..

# Then build spring-ai-bench
git clone https://github.com/spring-ai-community/spring-ai-bench.git
cd spring-ai-bench
./mvnw clean install

2.2. API Keys

For live agent testing, configure your environment:

# Claude Code
export ANTHROPIC_API_KEY=your-anthropic-key

# Gemini
export GEMINI_API_KEY=your-google-key

# GitHub (for private repositories)
export GITHUB_TOKEN=your-github-token

2.3. Agent CLI Tools

Install required CLI tools for your agents:

Claude Code
Gemini

# Install Claude CLI
npm install -g @anthropic-ai/claude-cli

# Verify installation
claude --version

# Install Google Cloud SDK
curl https://sdk.cloud.google.com | bash

# Initialize and authenticate
gcloud init
gcloud auth application-default login

3. Running Specific Benchmarks

3.1. By Test Class

# Claude Code integration
./mvnw test -Dtest=ClaudeCodeIntegrationSuccessTest -pl bench-agents

# Gemini integration
./mvnw test -Dtest=GeminiIntegrationTest -pl bench-agents

# HelloWorld (mock agent)
./mvnw test -Dtest=HelloWorldIntegrationTest -pl bench-agents

# HelloWorld AI agent (requires spring-ai-agents built locally)
./mvnw test -Dtest=HelloWorldAIIntegrationTest -pl bench-agents

# Multi-agent comparison (deterministic + Claude + Gemini)
./mvnw test -Dtest=HelloWorldMultiAgentTest -pl bench-agents

3.2. By Profile

# Live agents only
./mvnw test -Pagents-live

# Core framework only
./mvnw test -Pdefault

3.3. Custom Benchmark Files

To run benchmarks from YAML specifications:

# Single benchmark file
java -jar bench-app/target/bench-app.jar \
  --benchmark src/test/resources/samples/calculator-sqrt-bug.yaml

# Multiple benchmarks
java -jar bench-app/target/bench-app.jar \
  --benchmark-dir src/test/resources/samples/

4. Understanding Results

4.1. Console Output

During execution, you’ll see structured logging:

[INFO] ADAPTER - Starting ClaudeCodeAgentModel
[INFO] SETUP - Workspace: /tmp/bench-workspace-123
[INFO] SETUP - Run root: /tmp/bench-reports/456
[INFO] WORKSPACE - Workspace cleaned successfully
[INFO] AGENT - Executing agent task
[INFO] AGENT - Agent call completed. Results: 1
[INFO] VERIFIER - Starting verification
[INFO] VERIFIER - exists:PASS content:PASS
[INFO] RESULT - SUCCESS: All checks passed
[INFO] FINAL - Exit code: 0, Duration: 15432ms

4.2. HTML Reports

After execution, HTML reports are generated:

bench-reports/
└── {run-id}/
    ├── run.log           # Detailed execution log
    ├── report.html       # Human-readable report
    ├── report.json       # Machine-readable metadata
    └── workspace/        # Final workspace state

4.3. Interactive Dashboard

Generate a comprehensive dashboard from all benchmark reports:

# Generate interactive site from all benchmark reports
jbang jbang/site.java --reportsDir /tmp/bench-reports --siteDir /tmp/bench-site

# View results in browser
open file:///tmp/bench-site/index.html

The dashboard provides:

Run Overview - Table of all benchmark executions with status and timing
Agent Comparison - Side-by-side performance comparison across agents
Detailed Reports - Click-through to individual run details
Performance Metrics - Duration tracking and success rates

4.4. JSON Metadata

The report.json file contains structured results:

{
  "runId": "123e4567-e89b-12d3-a456-426614174000",
  "benchmarkId": "calculator-sqrt-bug",
  "success": true,
  "exitCode": 0,
  "durationMs": 15432,
  "startTime": "2024-01-15T10:30:00Z",
  "endTime": "2024-01-15T10:30:15Z",
  "agent": {
    "kind": "claude-code",
    "model": "claude-3-5-sonnet"
  },
  "verification": {
    "success": true,
    "checks": [
      {"name": "exists", "pass": true, "detail": "ok"},
      {"name": "content", "pass": true, "detail": "ok"}
    ]
  }
}

5. Configuration Options

5.1. Timeout Settings

# Custom timeout (in seconds)
./mvnw test -Dtest.timeout=1200

# Per-agent timeout in YAML
agent:
  timeout: PT10M  # 10 minutes

5.2. Workspace Management

# Keep workspace after execution (for debugging)
./mvnw test -Dkeep.workspace=true

# Custom workspace root
./mvnw test -Dworkspace.root=/custom/path

5.3. Agent-Specific Settings

Claude Code
Gemini

agent:
  kind: claude-code
  model: claude-3-5-sonnet
  autoApprove: true
  extras:
    yolo: true
    max_steps: 10

agent:
  kind: gemini
  model: gemini-2.0-flash-exp
  autoApprove: true
  extras:
    yolo: true
    temperature: 0.7

6. Troubleshooting

6.1. Common Issues

6.1.1. Authentication Failures

# Verify API keys are set
echo $ANTHROPIC_API_KEY
echo $GEMINI_API_KEY

# Test agent availability
claude --version
gcloud auth list

6.1.2. Timeout Errors

# Increase timeout for complex benchmarks
./mvnw test -Dtest.timeout=3600

6.1.3. Workspace Conflicts

# Clean all workspaces
rm -rf /tmp/bench-*

# Use custom workspace location
./mvnw test -Dworkspace.root=/custom/location

6.2. Debug Mode

Enable detailed logging:

# Maven debug output
./mvnw test -X

# Spring Boot debug logging
./mvnw test -Dlogging.level.org.springaicommunity=DEBUG

6.3. Log Analysis

Check log files for detailed execution traces:

# Find recent benchmark runs
ls -lt /tmp/bench-reports/

# View detailed log
cat /tmp/bench-reports/{run-id}/run.log

# Search for errors
grep -i error /tmp/bench-reports/{run-id}/run.log

7. Next Steps

Writing Custom Benchmarks - Create your own benchmarks
Agent Configuration - Configure agents for optimal performance
Verification System - Understand success criteria and verification