Spring AI Bench

Table of Contents

1. What & Why
2. Why SWE-bench Falls Short: At a Glance
3. Spring AI Bench: What You Get
4. How Spring AI Bench Compares
5. Architecture Overview
6. Benchmark Tracks: The Vision
7. Next Steps
8. The Evidence: Why SWE-bench Falls Short
9. References

Open benchmarking suite for Java-centric AI developer agents.

1. What & Why

The Problem: Existing benchmarks (SWE-bench) measure:

Yesterday’s agents: Academic scaffolding (SWE-agent: ~11,300 lines)
Yesterday’s tasks: Static 2023 Python patches
One architecture: Can’t measure Claude, Gemini, Amazon Q (the agents teams actually use)

Spring AI Bench measures:

Modern agents: Claude, Gemini, Amazon Q, Amp—any agent via AgentModel abstraction
Enterprise Java workflows: Issue triage, PR review, coverage, compliance—not just patches
Your code: Run benchmarks on YOUR repos, measure YOUR scenarios

If agents have evolved, benchmarks must evolve too.

2. Why SWE-bench Falls Short: At a Glance

Issue	SWE-bench Problem	Spring AI Bench Solution
Scope	Patch loops only (bug fixes)	Full dev lifecycle (triage, PR review, coverage, compliance, deps, migrations)
Contamination	60%+ Verified → 19% Live (same agent, 3x drop on fresh tasks)	Golden benchmark set (curated) + run same benchmarks on YOUR repos
Language Bias	Python-only: ~75% scores; Java: ~7-10% (order of magnitude gap from training data bias)	Java-first with Maven, Spring Boot, complex dependency trees
Reproducibility	No disclosure required; can’t verify leaderboard claims	One-click Docker containers + open scaffolding + transparent judges

Issue

SWE-bench Problem

Spring AI Bench Solution

Scope

Patch loops only (bug fixes)

Full dev lifecycle (triage, PR review, coverage, compliance, deps, migrations)

Contamination

60%+ Verified → 19% Live (same agent, 3x drop on fresh tasks)

Golden benchmark set (curated) + run same benchmarks on YOUR repos

Language Bias

Python-only: ~75% scores; Java: ~7-10% (order of magnitude gap from training data bias)

Java-first with Maven, Spring Boot, complex dependency trees

Reproducibility

No disclosure required; can’t verify leaderboard claims

One-click Docker containers + open scaffolding + transparent judges

Real code. Real runs. Real Java. Not static Python patches from 2023.

→ See detailed evidence & analysis below (Section 6)

3. Spring AI Bench: What You Get

Built for Reproducibility (BetterBench as north star):

Multi-Agent Support:

✅ Any agent: Claude Code, Gemini CLI, Amazon Q Developer, Amp, Codex, custom implementations
✅ Not locked in: AgentModel abstraction lets you measure the agents your team uses
✅ Real-world: Measure production agents, not academic research artifacts

Enterprise Java Workflows:

✅ Full lifecycle: Issue triage → PR review → coverage uplift → compliance validation → dependency upgrades
✅ Real complexity: Maven, JDK versions, Spring Boot, complex dependency trees
✅ Your repos: Not static 2023 GitHub data—run on YOUR code

Modern Standards:

✅ One-click reproducible: Docker containers with pre-configured environments (no "works on my machine")
✅ Multi-dimensional: Success rate + cost + speed + reliability + quality (not just pass/fail)
✅ Open & transparent: GitHub repo, Apache 2.0 license, reproducible evaluation code
✅ Judge framework: Sophisticated verification beyond "tests pass" (deterministic + AI-powered judges)

Reproducibility First:

✅ One-click setup: Docker containers, pre-configured environments
✅ Open methodology: Published evaluation code, transparent scoring
✅ Golden set + YOUR repos: Standardized comparison + real-world validation
✅ BetterBench inspiration: Following Stanford’s quality framework principles (starting with reproducibility)

Run benchmarks on YOUR code. Measure YOUR scenarios. See which agent wins for YOU.

4. How Spring AI Bench Compares

Dimension	SWE-bench (2023-2024)	Spring AI Bench (2025)
Scope	Patch loops for bug fixing	Full enterprise development lifecycle
Language	Python-only (12 repos)	Java-first (extensible to Kotlin, Groovy)
Agent Support	One architecture (SWE-agent: ~11,300 lines)	Any agent (Claude, Gemini, Q, custom via AgentModel)
Reproducibility	No disclosure required (can’t verify claims)	One-click Docker + open scaffolding + transparent judges
Agent Paradigm	Built for 2024 patch-loop agents	Built for 2025 declarative goal agents
Standards	Pre-BetterBench	Following BetterBench principles (reproducibility first)

Dimension

SWE-bench (2023-2024)

Spring AI Bench (2025)

Scope

Patch loops for bug fixing

Full enterprise development lifecycle

Language

Python-only (12 repos)

Java-first (extensible to Kotlin, Groovy)

Agent Support

One architecture (SWE-agent: ~11,300 lines)

Any agent (Claude, Gemini, Q, custom via AgentModel)

Reproducibility

No disclosure required (can’t verify claims)

One-click Docker + open scaffolding + transparent judges

Agent Paradigm

Built for 2024 patch-loop agents

Built for 2025 declarative goal agents

Standards

Pre-BetterBench

Following BetterBench principles (reproducibility first)

Historical value: SWE-bench pioneered agent evaluation

Modern needs: Spring AI Bench measures what enterprise teams care about

5. Architecture Overview

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   Agent Types   │    │  Execution Core  │    │   Sandboxes     │
├─────────────────┤    ├──────────────────┤    ├─────────────────┤
│ ✅ Claude Code  │────│ BenchHarness     │────│LocalSandbox     │
│ ✅ Gemini       │    │ AgentRunner      │    │DockerSandbox    │
│ ✅ Amazon Q     │    │ SpecLoader       │    │CloudSandbox     │
│ ✅ Amp          │    │ ReportGenerator  │    │   (Future)      │
│ ✅ Custom       │    │ Judge Framework  │    │                 │
└─────────────────┘    └──────────────────┘    └─────────────────┘

Skills, not just tools: Benchmarks encode skills (context + actions + success criteria). Tools matter, but the plan and verification criteria matter just as much. Where possible, we align with Model Context Protocol (MCP) to keep tool use portable across agents.

6. Benchmark Tracks: The Vision

✅ Available Now:

hello-world: File creation and infrastructure validation

🚧 In Active Development:

Test Coverage Uplift: Generate tests to achieve coverage thresholds while keeping builds green
Issue Analysis & Labeling: Automated triage with domain-specific classification
Pull Request Review: Comprehensive PR analysis with risk assessment and policy compliance
Static Analysis Remediation: Fix checkstyle violations while preserving functionality

📋 Future Roadmap:

Integration testing, dependency upgrades, API migrations, compliance validation, performance optimization, documentation generation

This breadth sets Spring AI Bench apart—measuring the full spectrum of enterprise Java development.

7. Next Steps

Ready to get started?

Try it: Getting Started Guide - Quick setup and first benchmark
Understand it: Architecture Overview - System design and components
Verify it: BetterBench Alignment - Our commitment to quality standards
Integrate it: Agent Integration - Connect Claude, Gemini, Amazon Q, or custom agents

Have questions? See detailed evidence and analysis below.

8. The Evidence: Why SWE-bench Falls Short

This section provides detailed evidence for the claims in the summary tables above.

8.1. The Paradigm Shift: From Scaffolding to Declarative Agents

The software development agent landscape has fundamentally changed:

2023-2024: The Scaffolding Era

Agents required complex client-side engineering (SWE-agent: ~11,300 lines of code):
- Multi-step loops (while not step_output.done)
- Prompt orchestration (Jinja2 templates for system, instance, next-step)
- Error recovery (retry loops, exception handling)
Benchmarks designed for patch-based workflows: edit → test → repeat
SWE-bench pioneered agent evaluation for code fixes

2025: The Declarative Era

Reasoning models internalize planning (GPT-4o, Claude Opus 4, Gemini 2.0)
Model Context Protocol (MCP) standardizes tool and context management
Agents accept declarative goals: "Raise coverage to 80%" vs procedural steps

Spring AI Agents embodies this shift:

"The shift: from imperative (code every workflow step) to declarative (describe the goal and let the model plan the steps)."

SWE-bench measured yesterday’s agents (academic SWE-agent scaffolding) with yesterday’s tasks (static 2023 Python patches).

Spring AI Bench measures modern agents (Claude, Gemini, Amazon Q—the ones enterprises actually use) on enterprise Java workflows.

8.2. Contamination & Overfitting: The Evidence

First, establish the baseline: Recent agents achieve impressive scores on SWE-bench’s static datasets:

SWE-bench Verified (static 2023 Python issues): Top agents exceed 60% resolved rate
Best published result: Claude Opus 4.1 at ~75% (Anthropic 2025)
Community consensus: SWE-bench Verified is the "gold standard" for agent evaluation

But then, something doesn’t add up:

When the exact same agents run on fresh, unseen issues (SWE-bench Live, 2025), performance collapses:

SWE-bench Live Results (arxiv:2505.23419):

OpenHands + Claude 3.7 Sonnet on Verified (static 2023 data): ~43%
OpenHands + Claude 3.7 Sonnet on Live (new 2025 issues): ~19%
Same agent, same settings, ~2x performance drop

The paper states: "Recent state-of-the-art agents and models report a resolved rate exceeding 60% on the SWE-bench Verified subset. In contrast, the highest resolved rate on SWE-bench-Live is only 19%."

This fails a common-sense test: Imagine a student preparing for a standardized exam (SAT, A-levels, Gaokao):

Practice tests: Scores ~75% (top tier)
Real exam: Scores ~19% (bottom half)
Same student, same preparation, ~2x score collapse

What’s the logical explanation?

The practice tests leaked the answers - questions and solutions were in study materials
The student memorized patterns, not concepts - optimized for specific question types
The practice tests weren’t representative - easier or structurally different from real exam

No credible educator would accept this as measuring genuine knowledge. The same logic applies here: The evidence points to overfitting and contamination, not genuine capability.

The SWE-bench Live paper itself states: "This raises concerns about potential overfitting to SWE-bench."

The mini-swe-agent paradox: mini-swe-agent claims to achieve 68% on SWE-bench Verified with just ~100 lines of code (vs ~11,300 for full SWE-agent). If a trivial bash-only agent with linear history matches complex scaffolding performance, this suggests:

Either: The benchmark is too easy (agents memorized solutions)
Or: Complex scaffolding was never needed (the problem was simpler than claimed)

Tellingly: mini-swe-agent has not published SWE-bench-Live results. We can infer the likely outcome from the pattern: simple approaches work on static data, fail on fresh tasks.

SWE-bench+ Analysis (OpenReview):

~33% of "successful" patches had solution leakage (answer in issue text)
~31% passed due to weak test suites (not truly fixed)
GPT-4 + SWE-agent: ~12% → ~4% after filtering
Benchmark has structural evaluation weaknesses

SWE-bench Illusion (arxiv:2506.12286):

"When State-of-the-Art LLMs Remember Instead of Reason"
Documents contamination across multiple coding benchmarks

BetterBench Recommendation: Dynamic task generation, live updates, contamination resistance

Spring AI Bench Approach: Run on YOUR repos with fresh goals (not static 2023 GitHub data)

8.3. Language Bias: The Python-Only Problem

SWE-bench only measures Python. All 2,294 tasks in the original dataset come from 12 Python repositories. This creates a fundamental problem for enterprise Java teams.

Why this matters: When researchers finally tested agents on other languages (Java, TypeScript, Go), the results were shocking:

The Numbers (exact citations):

Benchmark	Language	Top Score	Citation
SWE-bench Verified	Python	~75% (Claude Opus 4.1)	Anthropic 2025
SWE-bench-Java Verified	Java	~10% (DeepSeek-V2, 9/91 tasks)	arxiv:2408.14354
SWE-bench-Java Verified	Java	~7% (GPT-4o, 6/91 tasks)	arxiv:2408.14354
SWE-PolyBench	Python	~24%	arxiv:2504.08703
SWE-PolyBench	TypeScript	~5%	arxiv:2504.08703

Benchmark

Language

Top Score

Citation

SWE-bench Verified

Python

~75% (Claude Opus 4.1)

Anthropic 2025

SWE-bench-Java Verified

Java

~10% (DeepSeek-V2, 9/91 tasks)

arxiv:2408.14354

SWE-bench-Java Verified

Java

~7% (GPT-4o, 6/91 tasks)

arxiv:2408.14354

SWE-PolyBench

Python

~24%

arxiv:2504.08703

SWE-PolyBench

TypeScript

~5%

arxiv:2504.08703

Order of magnitude gap: Python ~75% vs Java ~7-10%

Why? The SWE-PolyBench paper investigated this and found: "This performance distribution cannot be explained by complexity metrics alone… pass rates stem from… language-specific factors that likely reflect the distribution of programming languages and structural patterns in LLMs' pretraining data."

In plain terms: Models were trained predominantly on Python code. When tested on Java, TypeScript, or Go, they struggle—not because these languages are harder, but because the models have seen far less training data in these languages.

This aligns with real-world experience: developers using AI coding tools (like Amp) report that agents work better on Python than on enterprise Java workflows. The paradigm shift to declarative agents doesn’t eliminate the training data bias problem.

Enterprise reality: Most critical systems use Java, Kotlin, C#, Go (polyglot stacks) - precisely the languages where current benchmarks show agents struggling

Spring AI Bench: Java-first design handles Maven, JDK, Spring Boot complexity from day one

8.4. Scaffolding Opacity: The Reproducibility Problem

Modern benchmark standards require reproducibility (BetterBench criteria).

The problem: SWE-bench doesn’t require submissions to disclose their architecture or provide reproduction scripts. This isn’t Anthropic’s fault—it’s a benchmark design flaw.

What we observe:

Anthropic’s blog describes "agent system": scaffolding, prompt formatting, iteration, error recovery
Blog doesn’t disclose exact prompts, retry strategies, selection logic
Community projects (augmentcode/augment-swebench-agent) had to reverse-engineer planning tools
Same model, different scaffolds = different scores (but we can’t compare because architectures aren’t disclosed)

BetterBench verdict: A benchmark that doesn’t require reproducible submissions fails a fundamental quality criterion.

But there’s a deeper issue: Why should we measure only one agent architecture?

SWE-bench measures one academic agent (SWE-agent: ~11,300 lines, developed part-time by researchers). Meanwhile:

Google, Amazon, OpenAI, Anthropic, Microsoft: Teams of dozens working full-time on production agents
Startups (Poolside, Magic, Cognition): Venture-funded teams building commercial agents
The market reality: Enterprise teams don’t use SWE-agent—they use Claude, Gemini, GitHub Copilot, Cursor, etc.

Spring AI Agents embraces this reality: Our AgentModel abstraction lets you measure ANY agent:

Claude Code (Anthropic’s production agent)
Gemini with code execution
Amazon Q Developer
Amp, Codex, or custom implementations

Spring AI Bench follows suit: Measure the agents teams actually use, not just academic research artifacts. That’s what BetterBench calls for—benchmarks aligned with real-world usage.

And we provide one-click reproducibility: Every benchmark runs in a Docker container with a pre-configured environment. No "works on my machine" excuses. Clone, run, verify. That’s the BetterBench standard.

8.5. BetterBench Alignment Roadmap

For our roadmap toward alignment with Stanford’s 46-criteria framework, see BetterBench Alignment Roadmap.

Current focus areas:

Design: User personas, domain experts, multi-dimensional metrics, judge framework
Implementation: Open source, one-click Docker reproducibility, multi-provider support
Documentation: Architecture docs, getting started, limitations, Apache 2.0 license
Maintenance: Active development, GitHub feedback channels, maintainer team

Note: This is a self-assessment roadmap, not a validated BetterBench certification. We’re working toward alignment with these quality standards.

9. References

9.1. BetterBench

Reuel et al. (2024), "BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices," NeurIPS 2024 Datasets & Benchmarks Track
Paper: arxiv.org/abs/2411.12990
Website: betterbench.stanford.edu/
Framework: 46 criteria across design, implementation, documentation, and maintenance stages

9.2. SWE-bench Research

Original SWE-bench:

Jimenez et al. (2024), "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" ICLR 2024
Paper: arxiv.org/abs/2310.06770
Website: www.swebench.com/

SWE-bench-Java:

Dakhel et al. (2024), "SWE-bench-Java: A Multi-Language Benchmark for Issue Resolving"
Paper: arxiv.org/abs/2408.14354
Key Results: GPT-4o ~7% (6/91), DeepSeek-V2 ~10% (9/91) on Verified Java tasks

SWE-bench-Live:

Jimenez et al. (2025), "SWE-bench Goes Live!"
Paper: arxiv.org/abs/2505.23419
Key Finding: Best agent ~19% on Live vs 60%+ on Verified (same agent, 3x drop)

SWE-PolyBench:

AWS (2025), "SWE-PolyBench: A Multilingual Benchmark for Code Agents"
Paper: arxiv.org/abs/2504.08703
Key Results: Python ~24%, TypeScript ~5%, Java moderate performance

SWE-bench-Illusion:

Authors (2025), "The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason"
Paper: arxiv.org/abs/2506.12286
Focus: Contamination analysis across coding benchmarks

SWE-bench+:

Authors (2024), "SWE-Bench+: Enhanced Coding Benchmark for LLMs"
Paper: openreview.net/forum?id=pwIGnH2LHJ
Key Findings: ~33% solution leakage, ~31% weak test suites, GPT-4 _12%→4% after filtering

9.3. Model Performance

Anthropic Claude Opus 4.1:

Anthropic (2025), "Claude Opus 4.1"
Blog: www.anthropic.com/news/claude-opus-4-1
SWE-bench Verified: ~75% (Python)

Anthropic SWE-bench Engineering:

Anthropic (2024), "Raising the bar on SWE-bench Verified with Claude 3.5 Sonnet"
Blog: www.anthropic.com/research/swe-bench-sonnet
Description: Agent system engineering with scaffolding, prompt formatting, iteration, error recovery

9.4. Community Analysis

Runloop Blog:

Runloop (2024), "SWE-bench Deep Dive: Unmasking the Limitations"
Blog: www.runloop.ai/blog/swe-bench-deep-dive

Community Implementations:

augmentcode/augment-swebench-agent: Community fork based on Anthropic’s architecture
GitHub: github.com/augmentcode/augment-swebench-agent

9.5. Related Frameworks

Spring AI Agents:

Spring AI Community, "Spring AI Agents: The pragmatic integration layer for autonomous agents in Java enterprise development"
Documentation: spring-ai-community.github.io/spring-ai-agents
GitHub: github.com/spring-ai-community/spring-ai-agents
Key Concept: Paradigm shift from imperative (code every step) to declarative (describe the goal)