Judge Framework Integration
Automated agent evaluation using the Judge framework from Spring AI Agents.
1. Overview
Spring AI Bench uses the Judge framework from Spring AI Agents to verify agent execution results. The Judge framework provides deterministic, LLM-powered, and ensemble evaluation capabilities.
2. Architecture
Spring AI Bench follows a driver program pattern:
-
spring-ai-agents provides the Judge framework (reusable evaluation library)
-
spring-ai-bench uses the Judge framework to evaluate benchmark results
This separation allows the Judge framework to be used by any evaluation system, not just benchmarks.
3. Benchmark Modules
3.1. bench-agents Module
The bench-agents
module uses Judge framework for live agent integration testing:
// Example from ClaudeAgentModelIntegrationTest
Judge judge = Judges.allOf(
new FileExistsJudge("hello.txt"),
new FileContentJudge("hello.txt", "Hello World!",
FileContentJudge.MatchMode.EXACT)
);
JudgmentContext context = JudgmentContext.builder()
.workspace(workspaceDir)
.build();
Judgment judgment = judge.judge(context);
if (judgment.pass()) {
// Agent succeeded
} else {
// Agent failed: judgment.reasoning()
}
Key features:
-
Direct use of Judge framework judges (
FileExistsJudge
,FileContentJudge
) -
Compose judges using
Judges.allOf()
for multi-criteria evaluation -
Create
JudgmentContext
from workspace and execution metadata -
Get structured
Judgment
results with pass/fail status and reasoning
3.2. bench-core CLI Runner
The bench-core
module uses Spring Dependency Injection to configure judges:
@Configuration
public class JudgeConfiguration {
@Bean(name = "hello-world")
public Judge helloWorldJudge() {
return Judges.allOf(
new FileExistsJudge("hello.txt"),
new FileContentJudge("hello.txt", "Hello World!",
FileContentJudge.MatchMode.EXACT)
);
}
}
BenchRunner integration:
public class BenchRunner {
private final Map<String, Judge> judges;
public BenchRunner(Map<String, Judge> judges) {
this.judges = judges;
}
public void execute(RunSpec runSpec) throws Exception {
// ... execute agent ...
// Look up judge by case ID
Judge judge = judges.get(runSpec.getCaseId());
// Create judgment context
JudgmentContext context = JudgmentContext.builder()
.workspace(workspace)
.build();
// Execute judgment
Judgment judgment = judge.judge(context);
String status = judgment.pass() ? "success" : "failure";
// Generate reports with judgment
generateReports(runSpec, judgment);
}
}
Key patterns:
-
Spring DI as factory -
@Bean
methods create judge instances -
Judge lookup by case ID -
judges.get(runSpec.getCaseId())
-
Manual DI for CLI mode -
BenchMain
creates judges map when running without Spring context -
Structured results -
Judgment
object with pass/fail, reasoning, and checks
4. Report Generation
Reports use Judge framework’s Judgment
results:
5. Judge Types Available
Spring AI Bench uses deterministic judges from the Judge framework:
Judge | Purpose | Example |
---|---|---|
|
Verify file creation |
|
|
Verify file contents |
|
|
Compose multiple judges |
|
|
Any judge must pass |
|
See the Spring AI Agents documentation for the complete Judge API.
6. Adding New Benchmark Cases
To add a new benchmark case with judge evaluation:
1. Create a judge bean in JudgeConfiguration
:
@Bean(name = "my-benchmark")
public Judge myBenchmarkJudge() {
return Judges.allOf(
new FileExistsJudge("output.txt"),
new FileContentJudge("output.txt", "Expected content",
FileContentJudge.MatchMode.CONTAINS)
);
}
2. The bean name must match the case ID from your YAML configuration.
3. Create your benchmark YAML in bench-tracks/{case-id}/cases/{case-id}.yaml
4. Run the benchmark:
bench run --case my-benchmark
The runner will automatically look up your judge bean and execute it after the agent completes.
7. Testing with Judges
Integration tests use Judge framework directly:
@Test
void agentCreatesCorrectFile() throws Exception {
Path workspace = Files.createTempDirectory("test");
// Run agent
agentModel.execute(AgentTaskRequest.builder()
.goal("Create hello.txt with content: Hello World!")
.workingDirectory(workspace)
.build());
// Verify with judge
Judge judge = Judges.allOf(
new FileExistsJudge("hello.txt"),
new FileContentJudge("hello.txt", "Hello World!",
FileContentJudge.MatchMode.EXACT)
);
JudgmentContext context = JudgmentContext.builder()
.workspace(workspace)
.build();
Judgment judgment = judge.judge(context);
assertThat(judgment.pass()).isTrue();
}
8. Migration Notes
Spring AI Bench has fully migrated from temporary verifier infrastructure to the Judge framework:
-
bench-agents: ✅ Complete - Uses Judge framework directly
-
bench-core: ✅ Complete - Uses Spring DI for judge configuration
All verification now uses the Judge framework from Spring AI Agents.
9. Next Steps
-
AgentSpec API - Agent configuration
-
AgentRunner API - Agent execution
-
Spring AI Agents Judge Framework - Complete Judge API documentation