Architecture

Spring AI Bench is a comprehensive execution framework for running AI agents in isolated environments with support for benchmarking, customization, and monitoring.

1. System Overview

┌─────────────────────────────────────────────────────────────────┐
│                        Spring AI Bench                         │
├─────────────────┬───────────────────┬──────────────────────────┤
│   bench-core    │   bench-agents    │       bench-app          │
│                 │                   │                          │
│ • Execution     │ • Agent Runners   │ • CLI Interface          │
│ • Sandboxes     │ • Integration     │ • Report Viewing         │
│ • Verification  │ • Auto-Config     │ • Batch Processing       │
│ • Specifications│ • Spring AI       │ • Site Generation        │
│                 │   Agents          │                          │
└─────────────────┴───────────────────┴──────────────────────────┘

2. Core Architecture

2.1. Execution Framework

The system is built around a Sandbox abstraction that provides isolated execution environments:

Sandbox (interface)
├── LocalSandbox - Process exec implementation
├── DockerSandbox - TestContainers implementation
└── [Future: CloudSandbox - Distributed execution]

Key Components:

  • ExecSpec - Command specification with timeout, environment variables, MCP config

  • ExecResult - Execution results with exit codes, logs, duration

  • TimeoutException - Timeout handling for long-running processes

2.2. Execution Backends

2.2.1. Local Process Execution (LocalSandbox)

  • Purpose: Execute commands in local processes within isolated directories

  • Security: Directory isolation only - commands execute with JVM privileges

  • Features:

    • Customizable working directories

    • Environment variable support

    • Timeout handling

    • MCP (Model Context Protocol) integration

    • Automatic cleanup of temporary directories

2.2.2. Docker/TestContainers (DockerSandbox)

  • Purpose: Execute commands in Docker containers for strong isolation

  • Features:

    • Uses TestContainers library (v1.21.0)

    • Long-lived containers with "sleep infinity" pattern

    • Multiple command executions within same container environment

    • Automatic container lifecycle management

    • Working directory: /work

2.2.3. Future: Distributed/Cloud Implementation

  • Foundation laid with Spring Cloud Deployer SPI integration

  • Support for distributed task execution

  • Prepared for cloud-based distributed architecture

2.3. Customization Framework

ExecSpecCustomizer Pattern allows runtime modification of execution specifications:

  • ExecSpecCustomizer (interface) - Base customization contract

  • ClaudeCliCustomizer - Specialized for Claude CLI integration

    • Automatically injects MCP tools via --tools flag

    • Transforms: ["claude-cli", "agent.py"]["claude-cli", "agent.py", "--tools=brave,filesystem"]

2.4. Agent Integration

2.4.1. Agent Implementations

Spring AI Bench currently supports:

  • hello-world: Deterministic mock agent for infrastructure testing

  • hello-world-ai: AI-powered agent via Spring AI Agents integration

    • Claude provider support

    • Gemini provider support

    • JBang launcher pattern

2.4.2. Spring AI Agents Integration

The integration with Spring AI Agents follows this pattern:

# spring-ai-bench → JBang → spring-ai-agents → AI provider
jbang /path/to/spring-ai-agents/jbang/launcher.java \
  hello-world-agent-ai \
  path=hello.txt \
  content="Hello World!" \
  provider=claude

This ensures benchmark success guarantees good end-user experience by testing the exact CLI interface users would use.

2.5. Benchmarking System

2.5.1. Benchmark Specifications

  • BenchSpec - Top-level benchmark specification

  • BenchCase - Individual benchmark case with:

    • ID, category ("coding", "project-mgmt", "version-upgrade")

    • Repository specification (RepoSpec)

    • Agent specification (AgentSpec)

    • Success criteria (SuccessSpec)

    • Timeout configuration

2.5.2. Agent Support

AgentSpec supports multiple agent types:

  • "hello-world" - Deterministic mock agent

  • "hello-world-ai" - AI-powered agent via Spring AI Agents

  • Configurable models, prompts, generation parameters

2.5.3. Execution Harness

  • BenchHarness - End-to-end benchmark execution

  • AgentRunner - Agent execution interface

  • HelloWorldAgentRunner - Deterministic implementation

  • HelloWorldAIAgentRunner - AI-powered implementation

  • SuccessVerifier - Validation of benchmark results (temporary implementation - evolving into judge concept in spring-ai-agents)

2.6. Repository & Workspace Management

  • RepoWorkspaceManager - GitHub repository operations

  • Workspace - Isolated workspace for agent execution

  • Automatic repository cloning and cleanup

  • GitHub API integration for repository access

3. Integration Components

3.1. Spring Cloud Deployer

  • SPI Integration: spring-cloud-deployer-spi (v2.9.5)

  • Local Implementation: spring-cloud-deployer-local (v2.9.5)

  • Purpose: Process management and distributed task execution

  • Usage: LocalTaskLauncher for process orchestration

3.2. Model Context Protocol (MCP)

  • McpConfig - Configuration for MCP tool integration

  • Environment variable injection: MCP_TOOLS

  • Integration with Claude CLI through customizers

3.3. TestContainers

  • Version: 1.21.0

  • Purpose: Docker-based sandbox isolation

  • Features: Container lifecycle management, port mapping, volume mounts

4. Key Design Decisions

4.1. Sandbox Abstraction

  • Rationale: Support multiple execution environments (local, Docker, future cloud)

  • Pattern: Interface-based design for extensibility

  • Trade-offs: Abstraction overhead vs. flexibility

4.2. Merged Log Output

  • Design: ExecResult combines stdout/stderr into mergedLog

  • Rationale: Optimized for AI analysis - preserves temporal ordering

  • Use Case: LLMs can analyze execution logs in chronological order

4.3. Customizer Pattern

  • Purpose: Last-mile command/environment customization

  • Benefits: Flexible, composable, testable

  • Example: Claude CLI tool injection without hardcoding

4.4. Resource Management

  • AutoCloseable: All sandboxes implement proper cleanup

  • Try-with-resources: Workspace management ensures cleanup

  • Timeout Handling: Prevents runaway processes

5. Module Structure

spring-ai-bench/
├── bench-core/           # Core execution framework
│   ├── exec/            # Execution system (Sandbox, ExecSpec, etc.)
│   ├── spec/            # Benchmark specifications
│   ├── repo/            # Repository & workspace management
│   ├── run/             # Benchmark harness & execution
│   └── io/              # Configuration loading
├── bench-agents/         # Agent integration layer
│   ├── runner/          # Agent runners (Claude, Gemini, HelloWorld)
│   └── integration/     # Spring Boot auto-configuration
├── bench-app/           # Application CLI
├── bench-site/          # Static site generation
└── bench-tracks/        # Benchmark track definitions
    └── hello-world/     # Hello world track (current)

6. Dependencies & Technology Stack

6.1. Core Dependencies

  • Spring Framework: Core dependency injection and configuration

  • Spring Cloud Deployer: Distributed process management

  • Jackson: YAML/JSON configuration handling

  • TestContainers: Docker sandbox implementation

  • GitHub API: Repository operations

  • SLF4J: Logging framework

6.2. Build System

  • Maven: Build system with multi-module structure

  • Java 17+: Target runtime

  • Surefire: Test execution

7. Development Timeline

September 2024 Implementation:

  • Complete execution framework with sandbox isolation

  • Spring AI Agents integration via JBang launcher

  • Agent implementations (hello-world deterministic and AI-powered)

  • Basic reporting and HTML generation

  • Docker and local sandbox support

8. Testing Strategy

  • Unit Tests: Individual component testing

  • Integration Tests: End-to-end sandbox execution

  • Smoke Tests: Basic functionality validation

  • E2E Tests: Complete benchmark execution flows

9. Future Development Areas

9.1. Cloud Implementation

  • Cloud-based sandbox implementations

  • Auto-scaling execution clusters

  • Distributed benchmark orchestration

  • Cost optimization strategies

9.2. Enhanced Agent Support

  • Additional agent integrations beyond current implementations

  • Agent-specific optimizations and customizations

  • Multi-agent benchmark scenarios

9.3. Monitoring & Observability

  • Execution metrics collection

  • Performance monitoring dashboards

  • Resource utilization tracking

  • Benchmark result analytics

9.4. Security Enhancements

  • Improved sandbox isolation

  • Resource limits and quotas

  • Security scanning integration

  • Audit logging

10. Getting Started

10.1. Prerequisites

  • Java 17+

  • Docker (for DockerSandbox)

  • Maven 3.6+

  • GitHub access token (for repository operations)

10.2. Basic Usage

// Local execution
try (var sandbox = LocalSandbox.builder().build()) {
    var spec = ExecSpec.of("echo", "Hello World");
    var result = sandbox.exec(spec);
    System.out.println("Exit code: " + result.exitCode());
    System.out.println("Output: " + result.mergedLog());
}

// Docker execution
try (var sandbox = new DockerSandbox("openjdk:17-jdk")) {
    var spec = ExecSpec.of("java", "-version");
    var result = sandbox.exec(spec);
    System.out.println("Java version: " + result.mergedLog());
}

This document reflects the current architecture as of September 2024.