Architecture

Spring AI Bench is a comprehensive execution framework for running AI agents in isolated environments with support for benchmarking, customization, and monitoring.

1. System Overview

┌─────────────────────────────────────────────────────────────────┐
│                        Spring AI Bench                         │
├─────────────────┬───────────────────┬──────────────────────────┤
│   bench-core    │   bench-agents    │       bench-app          │
│                 │                   │                          │
│ • Execution     │ • Agent Runners   │ • CLI Interface          │
│ • Sandboxes     │ • Integration     │ • Report Viewing         │
│ • Verification  │ • Auto-Config     │ • Batch Processing       │
│ • Specifications│ • Spring AI       │ • Site Generation        │
│                 │   Agents          │                          │
└─────────────────┴───────────────────┴──────────────────────────┘

2. Core Architecture

2.1. Execution Framework

The system is built around a Sandbox abstraction that provides isolated execution environments:

Sandbox (interface)
├── LocalSandbox - Process exec implementation
├── DockerSandbox - TestContainers implementation
└── [Future: CloudSandbox - Distributed execution]

Key Components:

ExecSpec - Command specification with timeout, environment variables, MCP config
ExecResult - Execution results with exit codes, logs, duration
TimeoutException - Timeout handling for long-running processes

2.2. Execution Backends

2.2.1. Local Process Execution (`LocalSandbox`)

Purpose: Execute commands in local processes within isolated directories
Security: Directory isolation only - commands execute with JVM privileges
Features:
- Customizable working directories
- Environment variable support
- Timeout handling
- MCP (Model Context Protocol) integration
- Automatic cleanup of temporary directories

2.2.2. Docker/TestContainers (`DockerSandbox`)

Purpose: Execute commands in Docker containers for strong isolation
Features:
- Uses TestContainers library (v1.21.0)
- Long-lived containers with "sleep infinity" pattern
- Multiple command executions within same container environment
- Automatic container lifecycle management
- Working directory: /work

2.2.3. Future: Distributed/Cloud Implementation

Foundation laid with Spring Cloud Deployer SPI integration
Support for distributed task execution
Prepared for cloud-based distributed architecture

2.3. Customization Framework

ExecSpecCustomizer Pattern allows runtime modification of execution specifications:

ExecSpecCustomizer (interface) - Base customization contract
ClaudeCliCustomizer - Specialized for Claude CLI integration
- Automatically injects MCP tools via --tools flag
- Transforms: ["claude-cli", "agent.py"] → ["claude-cli", "agent.py", "--tools=brave,filesystem"]

2.4. Agent Integration

2.4.1. Agent Implementations

Spring AI Bench currently supports:

hello-world: Deterministic mock agent for infrastructure testing
hello-world-ai: AI-powered agent via Spring AI Agents integration
- Claude provider support
- Gemini provider support
- JBang launcher pattern

2.4.2. Spring AI Agents Integration

The integration with Spring AI Agents follows this pattern:

# spring-ai-bench → JBang → spring-ai-agents → AI provider
jbang /path/to/spring-ai-agents/jbang/launcher.java \
  hello-world-agent-ai \
  path=hello.txt \
  content="Hello World!" \
  provider=claude

This ensures benchmark success guarantees good end-user experience by testing the exact CLI interface users would use.

2.5. Benchmarking System

2.5.1. Benchmark Specifications

BenchSpec - Top-level benchmark specification
BenchCase - Individual benchmark case with:
- ID, category ("coding", "project-mgmt", "version-upgrade")
- Repository specification (RepoSpec)
- Agent specification (AgentSpec)
- Success criteria (SuccessSpec)
- Timeout configuration

2.5.2. Agent Support

AgentSpec supports multiple agent types:

"hello-world" - Deterministic mock agent
"hello-world-ai" - AI-powered agent via Spring AI Agents
Configurable models, prompts, generation parameters

2.5.3. Execution Harness

BenchHarness - End-to-end benchmark execution
AgentRunner - Agent execution interface
HelloWorldAgentRunner - Deterministic implementation
HelloWorldAIAgentRunner - AI-powered implementation
SuccessVerifier - Validation of benchmark results (temporary implementation - evolving into judge concept in spring-ai-agents)

2.6. Repository & Workspace Management

RepoWorkspaceManager - GitHub repository operations
Workspace - Isolated workspace for agent execution
Automatic repository cloning and cleanup
GitHub API integration for repository access

3. Integration Components

3.1. Spring Cloud Deployer

SPI Integration: spring-cloud-deployer-spi (v2.9.5)
Local Implementation: spring-cloud-deployer-local (v2.9.5)
Purpose: Process management and distributed task execution
Usage: LocalTaskLauncher for process orchestration

3.2. Model Context Protocol (MCP)

McpConfig - Configuration for MCP tool integration
Environment variable injection: MCP_TOOLS
Integration with Claude CLI through customizers

3.3. TestContainers

Version: 1.21.0
Purpose: Docker-based sandbox isolation
Features: Container lifecycle management, port mapping, volume mounts

4. Key Design Decisions

4.1. Sandbox Abstraction

Rationale: Support multiple execution environments (local, Docker, future cloud)
Pattern: Interface-based design for extensibility
Trade-offs: Abstraction overhead vs. flexibility

4.2. Merged Log Output

Design: ExecResult combines stdout/stderr into mergedLog
Rationale: Optimized for AI analysis - preserves temporal ordering
Use Case: LLMs can analyze execution logs in chronological order

4.3. Customizer Pattern

Purpose: Last-mile command/environment customization
Benefits: Flexible, composable, testable
Example: Claude CLI tool injection without hardcoding

4.4. Resource Management

AutoCloseable: All sandboxes implement proper cleanup
Try-with-resources: Workspace management ensures cleanup
Timeout Handling: Prevents runaway processes

5. Module Structure

spring-ai-bench/
├── bench-core/           # Core execution framework
│   ├── exec/            # Execution system (Sandbox, ExecSpec, etc.)
│   ├── spec/            # Benchmark specifications
│   ├── repo/            # Repository & workspace management
│   ├── run/             # Benchmark harness & execution
│   └── io/              # Configuration loading
├── bench-agents/         # Agent integration layer
│   ├── runner/          # Agent runners (Claude, Gemini, HelloWorld)
│   └── integration/     # Spring Boot auto-configuration
├── bench-app/           # Application CLI
├── bench-site/          # Static site generation
└── bench-tracks/        # Benchmark track definitions
    └── hello-world/     # Hello world track (current)

6. Dependencies & Technology Stack

6.1. Core Dependencies

Spring Framework: Core dependency injection and configuration
Spring Cloud Deployer: Distributed process management
Jackson: YAML/JSON configuration handling
TestContainers: Docker sandbox implementation
GitHub API: Repository operations
SLF4J: Logging framework

6.2. Build System

Maven: Build system with multi-module structure
Java 17+: Target runtime
Surefire: Test execution

7. Development Timeline

September 2024 Implementation:

Complete execution framework with sandbox isolation
Spring AI Agents integration via JBang launcher
Agent implementations (hello-world deterministic and AI-powered)
Basic reporting and HTML generation
Docker and local sandbox support

8. Testing Strategy

Unit Tests: Individual component testing
Integration Tests: End-to-end sandbox execution
Smoke Tests: Basic functionality validation
E2E Tests: Complete benchmark execution flows

9. Future Development Areas

9.1. Cloud Implementation

Cloud-based sandbox implementations
Auto-scaling execution clusters
Distributed benchmark orchestration
Cost optimization strategies

9.2. Enhanced Agent Support

Additional agent integrations beyond current implementations
Agent-specific optimizations and customizations
Multi-agent benchmark scenarios

9.3. Monitoring & Observability

Execution metrics collection
Performance monitoring dashboards
Resource utilization tracking
Benchmark result analytics

9.4. Security Enhancements

Improved sandbox isolation
Resource limits and quotas
Security scanning integration
Audit logging

10. Getting Started

10.1. Prerequisites

Java 17+
Docker (for DockerSandbox)
Maven 3.6+
GitHub access token (for repository operations)

10.2. Basic Usage

// Local execution
try (var sandbox = LocalSandbox.builder().build()) {
    var spec = ExecSpec.of("echo", "Hello World");
    var result = sandbox.exec(spec);
    System.out.println("Exit code: " + result.exitCode());
    System.out.println("Output: " + result.mergedLog());
}

// Docker execution
try (var sandbox = new DockerSandbox("openjdk:17-jdk")) {
    var spec = ExecSpec.of("java", "-version");
    var result = sandbox.exec(spec);
    System.out.println("Java version: " + result.mergedLog());
}

This document reflects the current architecture as of September 2024.