Getting Started
This guide will get you up and running with Spring AI Bench quickly.
1. Prerequisites
Before you begin, ensure you have:
-
Java 17+ - Required for building and running
-
Maven 3.6+ - Build system
-
Docker - For DockerSandbox testing (optional)
-
GitHub Token - For repository access (set
GITHUB_TOKEN
env var) -
Agent API Keys - For agent integration tests:
-
ANTHROPIC_API_KEY
- Claude Code agent -
GEMINI_API_KEY
- Gemini agent
-
2. Installation
3. Running Your First Benchmark
3.1. Core Tests (No API Keys Required)
# All core tests (infrastructure, sandboxes, framework)
./mvnw test
# Specific test categories
./mvnw test -Dtest=*IntegrationTest # All integration tests
./mvnw test -Dtest=BenchHarnessE2ETest # End-to-end benchmark test
./mvnw test -Dtest=LocalSandboxIntegrationTest # Local execution tests
./mvnw test -Dtest=DockerSandboxTest # Docker container tests
4. Test Profiles
Spring AI Bench uses Maven profiles to control test execution:
-
Default profile: Runs core infrastructure tests (no API keys required)
-
agents-live
profile: Runs live agent integration tests (requires API keys)
./mvnw test -Pagents-live
5. Configuration
5.1. Basic YAML Configuration
Create a benchmark specification file:
repo:
owner: rd-1-2022
name: simple-calculator
ref: 93da3b1847ed67f3bc7d8a84e1e6afd737f1a555
agent:
kind: claude-code
model: claude-4-sonnet
autoApprove: true
prompt: |
Fix the failing JUnit tests in this project.
Run "./mvnw test" until all tests pass, then commit.
success:
cmd: mvn test
timeoutSec: 600
6. Verification Commands
# Verify everything builds and core tests pass
./mvnw clean verify
# Quick benchmark test
./mvnw test -Dtest=BenchHarnessTest
# Full verification including agent tests (requires API keys)
ANTHROPIC_API_KEY=your_key GEMINI_API_KEY=your_key ./mvnw clean verify -Pagents-live
7. Next Steps
Now that you have Spring AI Bench running:
-
Learn about benchmark concepts - Understand how benchmarks are structured
-
Set up your first agent - Connect Claude Code or Gemini
-
Write custom benchmarks - Create benchmarks for your own projects
-
Explore execution environments - Understand isolation and security