Spring AI Bench
Open benchmarking suite for Java-centric AI developer agents.
1. What & Why
The Problem: Existing benchmarks (SWE-bench) measure:
-
Yesterday’s agents: Academic scaffolding (SWE-agent: ~11,300 lines)
-
Yesterday’s tasks: Static 2023 Python patches
-
One architecture: Can’t measure Claude, Gemini, Amazon Q (the agents teams actually use)
Spring AI Bench measures:
-
Modern agents: Claude, Gemini, Amazon Q, Amp—any agent via
AgentModel
abstraction -
Enterprise Java workflows: Issue triage, PR review, coverage, compliance—not just patches
-
Your code: Run benchmarks on YOUR repos, measure YOUR scenarios
If agents have evolved, benchmarks must evolve too.
2. Why SWE-bench Falls Short: At a Glance
Issue | SWE-bench Problem | Spring AI Bench Solution |
---|---|---|
Scope |
Patch loops only (bug fixes) |
Full dev lifecycle (triage, PR review, coverage, compliance, deps, migrations) |
Contamination |
60%+ Verified → 19% Live (same agent, 3x drop on fresh tasks) |
Golden benchmark set (curated) + run same benchmarks on YOUR repos |
Language Bias |
Python-only: ~75% scores; Java: ~7-10% (order of magnitude gap from training data bias) |
Java-first with Maven, Spring Boot, complex dependency trees |
Reproducibility |
No disclosure required; can’t verify leaderboard claims |
One-click Docker containers + open scaffolding + transparent judges |
Real code. Real runs. Real Java. Not static Python patches from 2023.
→ See detailed evidence & analysis below (Section 6)
3. Spring AI Bench: What You Get
Built for Reproducibility (BetterBench as north star):
Multi-Agent Support:
-
✅ Any agent: Claude Code, Gemini CLI, Amazon Q Developer, Amp, Codex, custom implementations
-
✅ Not locked in: AgentModel abstraction lets you measure the agents your team uses
-
✅ Real-world: Measure production agents, not academic research artifacts
Enterprise Java Workflows:
-
✅ Full lifecycle: Issue triage → PR review → coverage uplift → compliance validation → dependency upgrades
-
✅ Real complexity: Maven, JDK versions, Spring Boot, complex dependency trees
-
✅ Your repos: Not static 2023 GitHub data—run on YOUR code
Modern Standards:
-
✅ One-click reproducible: Docker containers with pre-configured environments (no "works on my machine")
-
✅ Multi-dimensional: Success rate + cost + speed + reliability + quality (not just pass/fail)
-
✅ Open & transparent: GitHub repo, Apache 2.0 license, reproducible evaluation code
-
✅ Judge framework: Sophisticated verification beyond "tests pass" (deterministic + AI-powered judges)
Reproducibility First:
-
✅ One-click setup: Docker containers, pre-configured environments
-
✅ Open methodology: Published evaluation code, transparent scoring
-
✅ Golden set + YOUR repos: Standardized comparison + real-world validation
-
✅ BetterBench inspiration: Following Stanford’s quality framework principles (starting with reproducibility)
Run benchmarks on YOUR code. Measure YOUR scenarios. See which agent wins for YOU.
4. How Spring AI Bench Compares
Dimension | SWE-bench (2023-2024) | Spring AI Bench (2025) |
---|---|---|
Scope |
Patch loops for bug fixing |
Full enterprise development lifecycle |
Language |
Python-only (12 repos) |
Java-first (extensible to Kotlin, Groovy) |
Agent Support |
One architecture (SWE-agent: ~11,300 lines) |
Any agent (Claude, Gemini, Q, custom via AgentModel) |
Reproducibility |
No disclosure required (can’t verify claims) |
One-click Docker + open scaffolding + transparent judges |
Agent Paradigm |
Built for 2024 patch-loop agents |
Built for 2025 declarative goal agents |
Standards |
Pre-BetterBench |
Following BetterBench principles (reproducibility first) |
Historical value: SWE-bench pioneered agent evaluation
Modern needs: Spring AI Bench measures what enterprise teams care about
5. Architecture Overview
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Agent Types │ │ Execution Core │ │ Sandboxes │
├─────────────────┤ ├──────────────────┤ ├─────────────────┤
│ ✅ Claude Code │────│ BenchHarness │────│LocalSandbox │
│ ✅ Gemini │ │ AgentRunner │ │DockerSandbox │
│ ✅ Amazon Q │ │ SpecLoader │ │CloudSandbox │
│ ✅ Amp │ │ ReportGenerator │ │ (Future) │
│ ✅ Custom │ │ Judge Framework │ │ │
└─────────────────┘ └──────────────────┘ └─────────────────┘
Skills, not just tools: Benchmarks encode skills (context + actions + success criteria). Tools matter, but the plan and verification criteria matter just as much. Where possible, we align with Model Context Protocol (MCP) to keep tool use portable across agents.
6. Benchmark Tracks: The Vision
✅ Available Now:
-
hello-world: File creation and infrastructure validation
🚧 In Active Development:
-
Test Coverage Uplift: Generate tests to achieve coverage thresholds while keeping builds green
-
Issue Analysis & Labeling: Automated triage with domain-specific classification
-
Pull Request Review: Comprehensive PR analysis with risk assessment and policy compliance
-
Static Analysis Remediation: Fix checkstyle violations while preserving functionality
📋 Future Roadmap:
Integration testing, dependency upgrades, API migrations, compliance validation, performance optimization, documentation generation
This breadth sets Spring AI Bench apart—measuring the full spectrum of enterprise Java development.
7. Next Steps
Ready to get started?
-
Try it: Getting Started Guide - Quick setup and first benchmark
-
Understand it: Architecture Overview - System design and components
-
Verify it: BetterBench Alignment - Our commitment to quality standards
-
Integrate it: Agent Integration - Connect Claude, Gemini, Amazon Q, or custom agents
Have questions? See detailed evidence and analysis below.
8. The Evidence: Why SWE-bench Falls Short
This section provides detailed evidence for the claims in the summary tables above.
8.1. The Paradigm Shift: From Scaffolding to Declarative Agents
The software development agent landscape has fundamentally changed:
2023-2024: The Scaffolding Era
-
Agents required complex client-side engineering (SWE-agent: ~11,300 lines of code):
-
Multi-step loops (
while not step_output.done
) -
Prompt orchestration (Jinja2 templates for system, instance, next-step)
-
Error recovery (retry loops, exception handling)
-
-
Benchmarks designed for patch-based workflows: edit → test → repeat
-
SWE-bench pioneered agent evaluation for code fixes
2025: The Declarative Era
-
Reasoning models internalize planning (GPT-4o, Claude Opus 4, Gemini 2.0)
-
Model Context Protocol (MCP) standardizes tool and context management
-
Agents accept declarative goals: "Raise coverage to 80%" vs procedural steps
Spring AI Agents embodies this shift:
"The shift: from imperative (code every workflow step) to declarative (describe the goal and let the model plan the steps)."
SWE-bench measured yesterday’s agents (academic SWE-agent scaffolding) with yesterday’s tasks (static 2023 Python patches).
Spring AI Bench measures modern agents (Claude, Gemini, Amazon Q—the ones enterprises actually use) on enterprise Java workflows.
8.2. Contamination & Overfitting: The Evidence
First, establish the baseline: Recent agents achieve impressive scores on SWE-bench’s static datasets:
-
SWE-bench Verified (static 2023 Python issues): Top agents exceed 60% resolved rate
-
Best published result: Claude Opus 4.1 at ~75% (Anthropic 2025)
-
Community consensus: SWE-bench Verified is the "gold standard" for agent evaluation
But then, something doesn’t add up:
When the exact same agents run on fresh, unseen issues (SWE-bench Live, 2025), performance collapses:
SWE-bench Live Results (arxiv:2505.23419):
-
OpenHands + Claude 3.7 Sonnet on Verified (static 2023 data): ~43%
-
OpenHands + Claude 3.7 Sonnet on Live (new 2025 issues): ~19%
-
Same agent, same settings, ~2x performance drop
The paper states: "Recent state-of-the-art agents and models report a resolved rate exceeding 60% on the SWE-bench Verified subset. In contrast, the highest resolved rate on SWE-bench-Live is only 19%."
This fails a common-sense test: Imagine a student preparing for a standardized exam (SAT, A-levels, Gaokao):
-
Practice tests: Scores ~75% (top tier)
-
Real exam: Scores ~19% (bottom half)
-
Same student, same preparation, ~2x score collapse
What’s the logical explanation?
-
The practice tests leaked the answers - questions and solutions were in study materials
-
The student memorized patterns, not concepts - optimized for specific question types
-
The practice tests weren’t representative - easier or structurally different from real exam
No credible educator would accept this as measuring genuine knowledge. The same logic applies here: The evidence points to overfitting and contamination, not genuine capability.
The SWE-bench Live paper itself states: "This raises concerns about potential overfitting to SWE-bench."
The mini-swe-agent paradox: mini-swe-agent claims to achieve 68% on SWE-bench Verified with just ~100 lines of code (vs ~11,300 for full SWE-agent). If a trivial bash-only agent with linear history matches complex scaffolding performance, this suggests:
-
Either: The benchmark is too easy (agents memorized solutions)
-
Or: Complex scaffolding was never needed (the problem was simpler than claimed)
Tellingly: mini-swe-agent has not published SWE-bench-Live results. We can infer the likely outcome from the pattern: simple approaches work on static data, fail on fresh tasks.
SWE-bench+ Analysis (OpenReview):
-
~33% of "successful" patches had solution leakage (answer in issue text)
-
~31% passed due to weak test suites (not truly fixed)
-
GPT-4 + SWE-agent: ~12% → ~4% after filtering
-
Benchmark has structural evaluation weaknesses
SWE-bench Illusion (arxiv:2506.12286):
-
"When State-of-the-Art LLMs Remember Instead of Reason"
-
Documents contamination across multiple coding benchmarks
BetterBench Recommendation: Dynamic task generation, live updates, contamination resistance
Spring AI Bench Approach: Run on YOUR repos with fresh goals (not static 2023 GitHub data)
8.3. Language Bias: The Python-Only Problem
SWE-bench only measures Python. All 2,294 tasks in the original dataset come from 12 Python repositories. This creates a fundamental problem for enterprise Java teams.
Why this matters: When researchers finally tested agents on other languages (Java, TypeScript, Go), the results were shocking:
The Numbers (exact citations):
Benchmark | Language | Top Score | Citation |
---|---|---|---|
SWE-bench Verified |
Python |
~75% (Claude Opus 4.1) |
|
SWE-bench-Java Verified |
Java |
~10% (DeepSeek-V2, 9/91 tasks) |
|
SWE-bench-Java Verified |
Java |
~7% (GPT-4o, 6/91 tasks) |
|
SWE-PolyBench |
Python |
~24% |
|
SWE-PolyBench |
TypeScript |
~5% |
Order of magnitude gap: Python ~75% vs Java ~7-10%
Why? The SWE-PolyBench paper investigated this and found: "This performance distribution cannot be explained by complexity metrics alone… pass rates stem from… language-specific factors that likely reflect the distribution of programming languages and structural patterns in LLMs' pretraining data."
In plain terms: Models were trained predominantly on Python code. When tested on Java, TypeScript, or Go, they struggle—not because these languages are harder, but because the models have seen far less training data in these languages.
This aligns with real-world experience: developers using AI coding tools (like Amp) report that agents work better on Python than on enterprise Java workflows. The paradigm shift to declarative agents doesn’t eliminate the training data bias problem.
Enterprise reality: Most critical systems use Java, Kotlin, C#, Go (polyglot stacks) - precisely the languages where current benchmarks show agents struggling
Spring AI Bench: Java-first design handles Maven, JDK, Spring Boot complexity from day one
8.4. Scaffolding Opacity: The Reproducibility Problem
Modern benchmark standards require reproducibility (BetterBench criteria).
The problem: SWE-bench doesn’t require submissions to disclose their architecture or provide reproduction scripts. This isn’t Anthropic’s fault—it’s a benchmark design flaw.
What we observe:
-
Anthropic’s blog describes "agent system": scaffolding, prompt formatting, iteration, error recovery
-
Blog doesn’t disclose exact prompts, retry strategies, selection logic
-
Community projects (augmentcode/augment-swebench-agent) had to reverse-engineer planning tools
-
Same model, different scaffolds = different scores (but we can’t compare because architectures aren’t disclosed)
BetterBench verdict: A benchmark that doesn’t require reproducible submissions fails a fundamental quality criterion.
But there’s a deeper issue: Why should we measure only one agent architecture?
SWE-bench measures one academic agent (SWE-agent: ~11,300 lines, developed part-time by researchers). Meanwhile:
-
Google, Amazon, OpenAI, Anthropic, Microsoft: Teams of dozens working full-time on production agents
-
Startups (Poolside, Magic, Cognition): Venture-funded teams building commercial agents
-
The market reality: Enterprise teams don’t use SWE-agent—they use Claude, Gemini, GitHub Copilot, Cursor, etc.
Spring AI Agents embraces this reality: Our AgentModel
abstraction lets you measure ANY agent:
-
Claude Code (Anthropic’s production agent)
-
Gemini with code execution
-
Amazon Q Developer
-
Amp, Codex, or custom implementations
Spring AI Bench follows suit: Measure the agents teams actually use, not just academic research artifacts. That’s what BetterBench calls for—benchmarks aligned with real-world usage.
And we provide one-click reproducibility: Every benchmark runs in a Docker container with a pre-configured environment. No "works on my machine" excuses. Clone, run, verify. That’s the BetterBench standard.
8.5. BetterBench Alignment Roadmap
For our roadmap toward alignment with Stanford’s 46-criteria framework, see BetterBench Alignment Roadmap.
Current focus areas:
-
Design: User personas, domain experts, multi-dimensional metrics, judge framework
-
Implementation: Open source, one-click Docker reproducibility, multi-provider support
-
Documentation: Architecture docs, getting started, limitations, Apache 2.0 license
-
Maintenance: Active development, GitHub feedback channels, maintainer team
Note: This is a self-assessment roadmap, not a validated BetterBench certification. We’re working toward alignment with these quality standards.
9. References
9.1. BetterBench
-
Reuel et al. (2024), "BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices," NeurIPS 2024 Datasets & Benchmarks Track
-
Paper: arxiv.org/abs/2411.12990
-
Website: betterbench.stanford.edu/
-
Framework: 46 criteria across design, implementation, documentation, and maintenance stages
9.2. SWE-bench Research
Original SWE-bench:
-
Jimenez et al. (2024), "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" ICLR 2024
-
Paper: arxiv.org/abs/2310.06770
-
Website: www.swebench.com/
SWE-bench-Java:
-
Dakhel et al. (2024), "SWE-bench-Java: A Multi-Language Benchmark for Issue Resolving"
-
Paper: arxiv.org/abs/2408.14354
-
Key Results: GPT-4o ~7% (6/91), DeepSeek-V2 ~10% (9/91) on Verified Java tasks
SWE-bench-Live:
-
Jimenez et al. (2025), "SWE-bench Goes Live!"
-
Paper: arxiv.org/abs/2505.23419
-
Key Finding: Best agent ~19% on Live vs 60%+ on Verified (same agent, 3x drop)
SWE-PolyBench:
-
AWS (2025), "SWE-PolyBench: A Multilingual Benchmark for Code Agents"
-
Paper: arxiv.org/abs/2504.08703
-
Key Results: Python ~24%, TypeScript ~5%, Java moderate performance
SWE-bench-Illusion:
-
Authors (2025), "The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason"
-
Paper: arxiv.org/abs/2506.12286
-
Focus: Contamination analysis across coding benchmarks
SWE-bench+:
-
Authors (2024), "SWE-Bench+: Enhanced Coding Benchmark for LLMs"
-
Key Findings: ~33% solution leakage, ~31% weak test suites, GPT-4 12%→4% after filtering
9.3. Model Performance
Anthropic Claude Opus 4.1:
-
Anthropic (2025), "Claude Opus 4.1"
-
SWE-bench Verified: ~75% (Python)
Anthropic SWE-bench Engineering:
-
Anthropic (2024), "Raising the bar on SWE-bench Verified with Claude 3.5 Sonnet"
-
Description: Agent system engineering with scaffolding, prompt formatting, iteration, error recovery
9.4. Community Analysis
Runloop Blog:
-
Runloop (2024), "SWE-bench Deep Dive: Unmasking the Limitations"
Community Implementations:
-
augmentcode/augment-swebench-agent: Community fork based on Anthropic’s architecture
9.5. Related Frameworks
Spring AI Agents:
-
Spring AI Community, "Spring AI Agents: The pragmatic integration layer for autonomous agents in Java enterprise development"
-
Documentation: spring-ai-community.github.io/spring-ai-agents
-
Key Concept: Paradigm shift from imperative (code every step) to declarative (describe the goal)