Implementation Details
Spring AI Bench provides a comprehensive execution framework and benchmarking platform for AI agents in Java environments with isolated sandboxes, customizable execution, and evaluation capabilities.
1. Architecture Overview
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ BenchHarness │ │ AgentRunner │ │SuccessVerifier │
│ │ │ │ │ │
│ Orchestrates │────│ Executes agent │────│ Validates │
│ benchmark runs │ │ in workspace │ │ success criteria│
└─────────────────┘ └─────────────────┘ └─────────────────┘
│
▼
┌─────────────────┐
│ Sandbox │
│ (Interface) │
└─────────────────┘
│
┌───────────────┼───────────────┐
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│LocalSandbox │ │DockerSandbox│ │CloudSandbox │
│ │ │ │ │ (Future) │
│Process exec │ │Containers │ │Distributed │
└─────────────┘ └─────────────┘ └─────────────┘
2. Usage Examples
2.1. Basic Local Execution
// Execute a simple command
try (var sandbox = LocalSandbox.builder().build()) {
var spec = ExecSpec.of("echo", "Hello World");
var result = sandbox.exec(spec);
System.out.println("Exit code: " + result.exitCode());
System.out.println("Output: " + result.mergedLog());
}
2.2. Docker Sandbox with Custom Environment
// Execute in isolated Docker container
try (var sandbox = new DockerSandbox("openjdk:17-jdk")) {
var spec = ExecSpec.builder()
.command("java", "-version")
.env("JAVA_OPTS", "-Xmx512m")
.timeout(Duration.ofSeconds(30))
.build();
var result = sandbox.exec(spec);
System.out.println("Java version: " + result.mergedLog());
}
2.3. Running a Complete Benchmark
// Load and execute a benchmark
Path benchmarkFile = Path.of("benchmarks/hello-world.yaml");
BenchCase benchCase = BenchCaseLoader.load(benchmarkFile);
var harness = new BenchHarness(github, Duration.ofMinutes(10));
BenchResult result = harness.run(benchCase);
System.out.println("Success: " + result.success());
System.out.println("Duration: " + result.duration());
3. Benchmark Specification Format
Benchmarks are defined in YAML files with the following structure:
# hello-world.yaml
id: hello-world
category: coding
version: 0.5
repo:
owner: example-org
name: example-repo
ref: main
agent:
kind: hello-world-ai
provider: claude
autoApprove: true
prompt: |
Create a hello world file with the specified content.
success:
cmd: ls hello.txt
timeoutSec: 600
4. Development Setup
4.1. Prerequisites
-
Java 17+ - Required for building and running
-
Maven 3.6+ - Build system
-
Docker - For DockerSandbox testing (optional)
-
GitHub Token - For repository access (set
GITHUB_TOKEN
env var) -
Agent API Keys - For agent integration tests:
-
ANTHROPIC_API_KEY
- Claude Code agent -
GEMINI_API_KEY
- Gemini agent
-
4.2. Building from Source
# Full build with tests
./mvnw clean install
# Quick build (skip tests)
./mvnw clean install -DskipTests
# Run specific test categories
./mvnw test -Dtest=*IntegrationTest
./mvnw test -Dtest=BenchHarnessE2ETest
4.3. Running Tests
# All tests
./mvnw test
# Integration tests only
./mvnw test -Dtest=*IntegrationTest
# Specific sandbox tests
./mvnw test -Dtest=LocalSandboxIntegrationTest
./mvnw test -Dtest=DockerSandboxTest
# Agent integration tests (requires API keys)
ANTHROPIC_API_KEY=your_key GEMINI_API_KEY=your_key ./mvnw test -Pagents-live
# Or run specific agent tests:
ANTHROPIC_API_KEY=your_key ./mvnw test -Dtest=ClaudeCodeIntegrationSuccessTest -pl bench-agents
GEMINI_API_KEY=your_key ./mvnw test -Dtest=GeminiIntegrationTest -pl bench-agents
5. Configuration
6. Advanced Features
6.1. Custom Execution Customizers
// Create a customizer for Claude CLI integration
ExecSpecCustomizer claudeCustomizer = new ClaudeCliCustomizer();
// Build sandbox with customizers
var sandbox = LocalSandbox.builder()
.customizer(claudeCustomizer)
.workingDirectory(workspace)
.build();
6.2. MCP (Model Context Protocol) Integration
// Configure MCP tools
var mcpConfig = McpConfig.of("brave", "filesystem", "github");
var spec = ExecSpec.builder()
.command("claude-code", "agent.py")
.mcp(mcpConfig) // Automatically adds --tools=brave,filesystem,github
.build();
6.3. Workspace Management
// Automatic repository cloning and cleanup
try (var manager = new RepoWorkspaceManager(github)) {
var repoSpec = new RepoSpec("owner", "repo", "main");
var workspace = manager.checkout(repoSpec, Duration.ofMinutes(5));
// Use workspace for agent execution
var sandbox = LocalSandbox.builder()
.workingDirectory(workspace.repoDir())
.build();
}
7. Testing Strategy
Spring AI Bench uses a comprehensive testing approach:
-
Unit Tests - Individual component testing
-
Integration Tests - Real process execution validation
-
E2E Tests - Complete benchmark workflow testing
-
Smoke Tests - Basic functionality validation
8. Recent Updates
8.1. Current Implementation (September 2024)
-
✅ Spring AI Agents Integration - Seamless JBang launcher pattern
-
✅ Multi-agent comparison - Deterministic vs AI-powered agents
-
✅ Improved timeout handling - Automatic process destruction and better error messages
-
✅ Enhanced platform compatibility - Windows/Unix shell command abstraction
-
✅ Comprehensive reporting - HTML and JSON reports with timing data
8.2. Key Improvements
-
Cleaner Process Execution - Robust process management with proper cleanup
-
Better Error Handling - Detailed error messages with command output
-
Robust Timeout Management - Built-in process cleanup and timeout exceptions
-
Future-Ready Logging - Native SLF4J integration for observability
9. Integration Patterns
9.1. HelloWorld vs HelloWorldAI Distinction
-
hello-world: Deterministic mock agent (no AI, ~100ms)
-
hello-world-ai: AI-powered agent via Spring AI Agents JBang integration
-
Uses
provider
parameter to select:claude
orgemini
-
Performance varies: Gemini ~5s, Claude ~18s+
-
9.2. JBang Integration Architecture
// Pattern: spring-ai-bench → JBang → spring-ai-agents → AI provider
List<String> command = Arrays.asList(
"jbang",
"/absolute/path/to/spring-ai-agents/jbang/launcher.java",
"hello-world-agent-ai",
"path=hello.txt",
"content=Hello World!",
"provider=" + provider // "claude" or "gemini"
);
9.3. Multi-Agent Test Implementation
Comparative Benchmarking: The HelloWorldMultiAgentTest
runs identical tasks across:
-
hello-world (deterministic baseline)
-
hello-world-ai with provider=gemini
-
hello-world-ai with provider=claude
Performance Characteristics Observed:
-
Deterministic: ~114ms (baseline)
-
Gemini AI: ~5,600ms
-
Claude AI: ~18,000ms
9.4. Report Generation
Multi-Format Output: Each agent run generates:
-
Console summary with performance ratios
-
Individual HTML reports in
/tmp/bench-reports/<uuid>/
-
Aggregated dashboard via
jbang jbang/site.java
Site Generation Best Practice:
# 1. Run multi-agent test
ANTHROPIC_API_KEY=key GEMINI_API_KEY=key ./mvnw test -Dtest=HelloWorldMultiAgentTest -pl bench-agents
# 2. Generate comprehensive site
jbang jbang/site.java --reportsDir /tmp/bench-reports --siteDir /tmp/bench-site
# 3. View results
open file:///tmp/bench-reports/index.html # Better formatted table
open file:///tmp/bench-site/index.html # Aggregated view