Spring AI Bench

Open benchmarking suite for Java-centric AI developer agents.

1. What & Why

The Problem: Existing benchmarks (SWE-bench) measure:

  • Yesterday’s agents: Academic scaffolding (SWE-agent: ~11,300 lines)

  • Yesterday’s tasks: Static 2023 Python patches

  • One architecture: Can’t measure Claude, Gemini, Amazon Q (the agents teams actually use)

Spring AI Bench measures:

  • Modern agents: Claude, Gemini, Amazon Q, Amp—any agent via AgentModel abstraction

  • Enterprise Java workflows: Issue triage, PR review, coverage, compliance—not just patches

  • Your code: Run benchmarks on YOUR repos, measure YOUR scenarios

If agents have evolved, benchmarks must evolve too.

2. Why SWE-bench Falls Short: At a Glance

Issue SWE-bench Problem Spring AI Bench Solution

Scope

Patch loops only (bug fixes)

Full dev lifecycle (triage, PR review, coverage, compliance, deps, migrations)

Contamination

60%+ Verified → 19% Live (same agent, 3x drop on fresh tasks)

Golden benchmark set (curated) + run same benchmarks on YOUR repos

Language Bias

Python-only: ~75% scores; Java: ~7-10% (order of magnitude gap from training data bias)

Java-first with Maven, Spring Boot, complex dependency trees

Reproducibility

No disclosure required; can’t verify leaderboard claims

One-click Docker containers + open scaffolding + transparent judges

Real code. Real runs. Real Java. Not static Python patches from 2023.

→ See detailed evidence & analysis below (Section 6)

3. Spring AI Bench: What You Get

Built for Reproducibility (BetterBench as north star):

Multi-Agent Support:

  • Any agent: Claude Code, Gemini CLI, Amazon Q Developer, Amp, Codex, custom implementations

  • Not locked in: AgentModel abstraction lets you measure the agents your team uses

  • Real-world: Measure production agents, not academic research artifacts

Enterprise Java Workflows:

  • Full lifecycle: Issue triage → PR review → coverage uplift → compliance validation → dependency upgrades

  • Real complexity: Maven, JDK versions, Spring Boot, complex dependency trees

  • Your repos: Not static 2023 GitHub data—run on YOUR code

Modern Standards:

  • One-click reproducible: Docker containers with pre-configured environments (no "works on my machine")

  • Multi-dimensional: Success rate + cost + speed + reliability + quality (not just pass/fail)

  • Open & transparent: GitHub repo, Apache 2.0 license, reproducible evaluation code

  • Judge framework: Sophisticated verification beyond "tests pass" (deterministic + AI-powered judges)

Reproducibility First:

  • One-click setup: Docker containers, pre-configured environments

  • Open methodology: Published evaluation code, transparent scoring

  • Golden set + YOUR repos: Standardized comparison + real-world validation

  • BetterBench inspiration: Following Stanford’s quality framework principles (starting with reproducibility)

Run benchmarks on YOUR code. Measure YOUR scenarios. See which agent wins for YOU.

4. How Spring AI Bench Compares

Dimension SWE-bench (2023-2024) Spring AI Bench (2025)

Scope

Patch loops for bug fixing

Full enterprise development lifecycle

Language

Python-only (12 repos)

Java-first (extensible to Kotlin, Groovy)

Agent Support

One architecture (SWE-agent: ~11,300 lines)

Any agent (Claude, Gemini, Q, custom via AgentModel)

Reproducibility

No disclosure required (can’t verify claims)

One-click Docker + open scaffolding + transparent judges

Agent Paradigm

Built for 2024 patch-loop agents

Built for 2025 declarative goal agents

Standards

Pre-BetterBench

Following BetterBench principles (reproducibility first)

Historical value: SWE-bench pioneered agent evaluation

Modern needs: Spring AI Bench measures what enterprise teams care about

5. Architecture Overview

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   Agent Types   │    │  Execution Core  │    │   Sandboxes     │
├─────────────────┤    ├──────────────────┤    ├─────────────────┤
│ ✅ Claude Code  │────│ BenchHarness     │────│LocalSandbox     │
│ ✅ Gemini       │    │ AgentRunner      │    │DockerSandbox    │
│ ✅ Amazon Q     │    │ SpecLoader       │    │CloudSandbox     │
│ ✅ Amp          │    │ ReportGenerator  │    │   (Future)      │
│ ✅ Custom       │    │ Judge Framework  │    │                 │
└─────────────────┘    └──────────────────┘    └─────────────────┘

Skills, not just tools: Benchmarks encode skills (context + actions + success criteria). Tools matter, but the plan and verification criteria matter just as much. Where possible, we align with Model Context Protocol (MCP) to keep tool use portable across agents.

6. Benchmark Tracks: The Vision

✅ Available Now:

  • hello-world: File creation and infrastructure validation

🚧 In Active Development:

  • Test Coverage Uplift: Generate tests to achieve coverage thresholds while keeping builds green

  • Issue Analysis & Labeling: Automated triage with domain-specific classification

  • Pull Request Review: Comprehensive PR analysis with risk assessment and policy compliance

  • Static Analysis Remediation: Fix checkstyle violations while preserving functionality

📋 Future Roadmap:

Integration testing, dependency upgrades, API migrations, compliance validation, performance optimization, documentation generation

This breadth sets Spring AI Bench apart—measuring the full spectrum of enterprise Java development.

7. Next Steps

Ready to get started?

Have questions? See detailed evidence and analysis below.


8. The Evidence: Why SWE-bench Falls Short

This section provides detailed evidence for the claims in the summary tables above.

8.1. The Paradigm Shift: From Scaffolding to Declarative Agents

The software development agent landscape has fundamentally changed:

2023-2024: The Scaffolding Era

  • Agents required complex client-side engineering (SWE-agent: ~11,300 lines of code):

  • Benchmarks designed for patch-based workflows: edit → test → repeat

  • SWE-bench pioneered agent evaluation for code fixes

2025: The Declarative Era

  • Reasoning models internalize planning (GPT-4o, Claude Opus 4, Gemini 2.0)

  • Model Context Protocol (MCP) standardizes tool and context management

  • Agents accept declarative goals: "Raise coverage to 80%" vs procedural steps

Spring AI Agents embodies this shift:

"The shift: from imperative (code every workflow step) to declarative (describe the goal and let the model plan the steps)."

SWE-bench measured yesterday’s agents (academic SWE-agent scaffolding) with yesterday’s tasks (static 2023 Python patches).

Spring AI Bench measures modern agents (Claude, Gemini, Amazon Q—the ones enterprises actually use) on enterprise Java workflows.

8.2. Contamination & Overfitting: The Evidence

First, establish the baseline: Recent agents achieve impressive scores on SWE-bench’s static datasets:

  • SWE-bench Verified (static 2023 Python issues): Top agents exceed 60% resolved rate

  • Best published result: Claude Opus 4.1 at ~75% (Anthropic 2025)

  • Community consensus: SWE-bench Verified is the "gold standard" for agent evaluation

But then, something doesn’t add up:

When the exact same agents run on fresh, unseen issues (SWE-bench Live, 2025), performance collapses:

SWE-bench Live Results (arxiv:2505.23419):

  • OpenHands + Claude 3.7 Sonnet on Verified (static 2023 data): ~43%

  • OpenHands + Claude 3.7 Sonnet on Live (new 2025 issues): ~19%

  • Same agent, same settings, ~2x performance drop

The paper states: "Recent state-of-the-art agents and models report a resolved rate exceeding 60% on the SWE-bench Verified subset. In contrast, the highest resolved rate on SWE-bench-Live is only 19%."

This fails a common-sense test: Imagine a student preparing for a standardized exam (SAT, A-levels, Gaokao):

  • Practice tests: Scores ~75% (top tier)

  • Real exam: Scores ~19% (bottom half)

  • Same student, same preparation, ~2x score collapse

What’s the logical explanation?

  1. The practice tests leaked the answers - questions and solutions were in study materials

  2. The student memorized patterns, not concepts - optimized for specific question types

  3. The practice tests weren’t representative - easier or structurally different from real exam

No credible educator would accept this as measuring genuine knowledge. The same logic applies here: The evidence points to overfitting and contamination, not genuine capability.

The SWE-bench Live paper itself states: "This raises concerns about potential overfitting to SWE-bench."

The mini-swe-agent paradox: mini-swe-agent claims to achieve 68% on SWE-bench Verified with just ~100 lines of code (vs ~11,300 for full SWE-agent). If a trivial bash-only agent with linear history matches complex scaffolding performance, this suggests:

  1. Either: The benchmark is too easy (agents memorized solutions)

  2. Or: Complex scaffolding was never needed (the problem was simpler than claimed)

Tellingly: mini-swe-agent has not published SWE-bench-Live results. We can infer the likely outcome from the pattern: simple approaches work on static data, fail on fresh tasks.

SWE-bench+ Analysis (OpenReview):

  • ~33% of "successful" patches had solution leakage (answer in issue text)

  • ~31% passed due to weak test suites (not truly fixed)

  • GPT-4 + SWE-agent: ~12% → ~4% after filtering

  • Benchmark has structural evaluation weaknesses

SWE-bench Illusion (arxiv:2506.12286):

  • "When State-of-the-Art LLMs Remember Instead of Reason"

  • Documents contamination across multiple coding benchmarks

BetterBench Recommendation: Dynamic task generation, live updates, contamination resistance

Spring AI Bench Approach: Run on YOUR repos with fresh goals (not static 2023 GitHub data)

8.3. Language Bias: The Python-Only Problem

SWE-bench only measures Python. All 2,294 tasks in the original dataset come from 12 Python repositories. This creates a fundamental problem for enterprise Java teams.

Why this matters: When researchers finally tested agents on other languages (Java, TypeScript, Go), the results were shocking:

The Numbers (exact citations):

Benchmark Language Top Score Citation

SWE-bench Verified

Python

~75% (Claude Opus 4.1)

Anthropic 2025

SWE-bench-Java Verified

Java

~10% (DeepSeek-V2, 9/91 tasks)

arxiv:2408.14354

SWE-bench-Java Verified

Java

~7% (GPT-4o, 6/91 tasks)

arxiv:2408.14354

SWE-PolyBench

Python

~24%

arxiv:2504.08703

SWE-PolyBench

TypeScript

~5%

arxiv:2504.08703

Order of magnitude gap: Python ~75% vs Java ~7-10%

Why? The SWE-PolyBench paper investigated this and found: "This performance distribution cannot be explained by complexity metrics alone…​ pass rates stem from…​ language-specific factors that likely reflect the distribution of programming languages and structural patterns in LLMs' pretraining data."

In plain terms: Models were trained predominantly on Python code. When tested on Java, TypeScript, or Go, they struggle—not because these languages are harder, but because the models have seen far less training data in these languages.

This aligns with real-world experience: developers using AI coding tools (like Amp) report that agents work better on Python than on enterprise Java workflows. The paradigm shift to declarative agents doesn’t eliminate the training data bias problem.

Enterprise reality: Most critical systems use Java, Kotlin, C#, Go (polyglot stacks) - precisely the languages where current benchmarks show agents struggling

Spring AI Bench: Java-first design handles Maven, JDK, Spring Boot complexity from day one

8.4. Scaffolding Opacity: The Reproducibility Problem

Modern benchmark standards require reproducibility (BetterBench criteria).

The problem: SWE-bench doesn’t require submissions to disclose their architecture or provide reproduction scripts. This isn’t Anthropic’s fault—it’s a benchmark design flaw.

What we observe:

  • Anthropic’s blog describes "agent system": scaffolding, prompt formatting, iteration, error recovery

  • Blog doesn’t disclose exact prompts, retry strategies, selection logic

  • Community projects (augmentcode/augment-swebench-agent) had to reverse-engineer planning tools

  • Same model, different scaffolds = different scores (but we can’t compare because architectures aren’t disclosed)

BetterBench verdict: A benchmark that doesn’t require reproducible submissions fails a fundamental quality criterion.

But there’s a deeper issue: Why should we measure only one agent architecture?

SWE-bench measures one academic agent (SWE-agent: ~11,300 lines, developed part-time by researchers). Meanwhile:

  • Google, Amazon, OpenAI, Anthropic, Microsoft: Teams of dozens working full-time on production agents

  • Startups (Poolside, Magic, Cognition): Venture-funded teams building commercial agents

  • The market reality: Enterprise teams don’t use SWE-agent—they use Claude, Gemini, GitHub Copilot, Cursor, etc.

Spring AI Agents embraces this reality: Our AgentModel abstraction lets you measure ANY agent:

  • Claude Code (Anthropic’s production agent)

  • Gemini with code execution

  • Amazon Q Developer

  • Amp, Codex, or custom implementations

Spring AI Bench follows suit: Measure the agents teams actually use, not just academic research artifacts. That’s what BetterBench calls for—benchmarks aligned with real-world usage.

And we provide one-click reproducibility: Every benchmark runs in a Docker container with a pre-configured environment. No "works on my machine" excuses. Clone, run, verify. That’s the BetterBench standard.

8.5. BetterBench Alignment Roadmap

For our roadmap toward alignment with Stanford’s 46-criteria framework, see BetterBench Alignment Roadmap.

Current focus areas:

  • Design: User personas, domain experts, multi-dimensional metrics, judge framework

  • Implementation: Open source, one-click Docker reproducibility, multi-provider support

  • Documentation: Architecture docs, getting started, limitations, Apache 2.0 license

  • Maintenance: Active development, GitHub feedback channels, maintainer team

Note: This is a self-assessment roadmap, not a validated BetterBench certification. We’re working toward alignment with these quality standards.

9. References

9.1. BetterBench

  • Reuel et al. (2024), "BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices," NeurIPS 2024 Datasets & Benchmarks Track

  • Paper: arxiv.org/abs/2411.12990

  • Website: betterbench.stanford.edu/

  • Framework: 46 criteria across design, implementation, documentation, and maintenance stages

9.2. SWE-bench Research

Original SWE-bench:

SWE-bench-Java:

  • Dakhel et al. (2024), "SWE-bench-Java: A Multi-Language Benchmark for Issue Resolving"

  • Paper: arxiv.org/abs/2408.14354

  • Key Results: GPT-4o ~7% (6/91), DeepSeek-V2 ~10% (9/91) on Verified Java tasks

SWE-bench-Live:

  • Jimenez et al. (2025), "SWE-bench Goes Live!"

  • Paper: arxiv.org/abs/2505.23419

  • Key Finding: Best agent ~19% on Live vs 60%+ on Verified (same agent, 3x drop)

SWE-PolyBench:

  • AWS (2025), "SWE-PolyBench: A Multilingual Benchmark for Code Agents"

  • Paper: arxiv.org/abs/2504.08703

  • Key Results: Python ~24%, TypeScript ~5%, Java moderate performance

SWE-bench-Illusion:

  • Authors (2025), "The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason"

  • Paper: arxiv.org/abs/2506.12286

  • Focus: Contamination analysis across coding benchmarks

SWE-bench+:

  • Authors (2024), "SWE-Bench+: Enhanced Coding Benchmark for LLMs"

  • Paper: openreview.net/forum?id=pwIGnH2LHJ

  • Key Findings: ~33% solution leakage, ~31% weak test suites, GPT-4 12%→4% after filtering

9.3. Model Performance

Anthropic Claude Opus 4.1:

Anthropic SWE-bench Engineering:

  • Anthropic (2024), "Raising the bar on SWE-bench Verified with Claude 3.5 Sonnet"

  • Blog: www.anthropic.com/research/swe-bench-sonnet

  • Description: Agent system engineering with scaffolding, prompt formatting, iteration, error recovery

9.4. Community Analysis

Runloop Blog:

Community Implementations:

Spring AI Agents: