Judge API: Automated Agent Evaluation
The Judge API provides automated evaluation and verification of agent task execution. Instead of manually checking if your agent succeeded, judges provide programmatic verification with detailed feedback.
1. Why Evaluate Agent Execution?
Agents are non-deterministic. The same goal might succeed today and fail tomorrow due to:
-
Network issues, file permissions, resource constraints
-
LLM reasoning variations across runs
-
Environmental differences (dependencies, configuration)
-
Complexity of multi-step tasks
Manual verification doesn’t scale. You need automated, reliable evaluation built into your agent workflow.
2. The Judge Pattern
A Judge
evaluates whether an agent achieved its goal by examining the agent’s output, workspace state, and execution context.
// Without judge - manual checking
AgentClientResponse response = agentClientBuilder
.goal("Create a REST API")
.call();
// ❓ Did it work? Check manually?
// ❓ Are tests passing?
// ❓ Is the code correct?
// With judge - automated verification
AgentClientResponse response = agentClientBuilder
.goal("Create a REST API")
.advisors(JudgeAdvisor.builder()
.judge(new BuildSuccessJudge())
.build())
.call();
Judgment judgment = response.getJudgment();
if (judgment.pass()) {
deploy(); // Safe - judge verified success
} else {
alert("Agent failed: " + judgment.reasoning());
}
Real-world example: The code coverage agent uses a CoverageJudge
to verify test coverage targets:
CoverageJudge judge = new CoverageJudge(80.0); // Target 80% coverage
AgentClientResponse response = agentClient
.goal("Increase JaCoCo test coverage to 80%")
.advisors(JudgeAdvisor.builder().judge(judge).build())
.run();
Judgment judgment = response.getJudgment();
if (judgment.pass() && judgment.score() instanceof NumericalScore score) {
System.out.println("Coverage achieved: " + score.value() + "%");
}
// Real results: 0% → 71.4% coverage on Spring gs-rest-service
3. Core Abstractions
The Judge API consists of four primary abstractions:
3.1. Judge Interface
The core evaluation interface:
public interface Judge {
Judgment judge(JudgmentContext context);
default CompletableFuture<Judgment> judgeAsync(JudgmentContext context) {
return CompletableFuture.supplyAsync(() -> judge(context));
}
}
Key characteristics:
-
Single responsibility - One method:
judge()
-
Async support - Default async implementation via
CompletableFuture
-
Stateless - Judges don’t maintain state between calls
-
Composable - Judges can be combined (see Jury Pattern)
3.2. JudgmentContext
Contains all information needed for evaluation:
public record JudgmentContext(
String goal, // What the agent was asked to do
Path workspace, // Where the agent worked
Optional<AgentOutput> agentOutput, // What the agent produced
Optional<AgentOutput> expectedOutput, // Expected result (if known)
Optional<List<String>> referenceData, // Context documents
Instant startedAt, // Execution start time
Duration executionTime, // How long execution took
Map<String, Object> metadata // Additional context
) {}
3.3. Judgment
The evaluation result:
public record Judgment(
JudgmentStatus status, // PASS, FAIL, ABSTAIN, ERROR
Score score, // Detailed score (boolean, numerical, categorical)
String reasoning, // Explanation of judgment
List<Check> checks, // Individual verification checks
Optional<Duration> elapsed, // Judgment duration
Map<String, Object> metadata // Additional metadata
) {
public boolean pass() {
return status == JudgmentStatus.PASS;
}
}
Convenience methods:
// Quick checks
if (judgment.pass()) { /* ... */ }
if (judgment.fail()) { /* ... */ }
if (judgment.abstain()) { /* ... */ }
// Score access
if (judgment.score() instanceof NumericalScore numerical) {
double value = numerical.normalized(); // 0.0 to 1.0
}
3.4. Score Types
Type-safe scoring with sealed interfaces:
public sealed interface Score permits BooleanScore, NumericalScore, CategoricalScore {
Object value();
ScoreType type();
}
// Boolean: pass/fail
BooleanScore pass = new BooleanScore(true);
// Numerical: scored metrics
NumericalScore quality = new NumericalScore(8.5, 0, 10);
double normalized = quality.normalized(); // 0.85
// Categorical: classification
CategoricalScore level = new CategoricalScore(
"excellent",
List.of("poor", "good", "excellent")
);
4. Judge Types
Judges fall into three main categories:
4.1. Deterministic Judges
Rule-based evaluation using file system checks, command execution, or assertions:
Judge | Purpose | Example |
---|---|---|
|
Verify file creation |
|
|
Verify file contents |
|
|
Verify command success |
|
|
Verify build success |
|
|
Custom assertions |
|
See Deterministic Judges for details.
4.2. LLM-Powered Judges
AI-based evaluation using language models:
Judge | Purpose | Example |
---|---|---|
|
Semantic correctness |
|
|
Custom criteria evaluation |
|
|
Ground output in context |
|
|
Simple yes/no criteria |
|
See LLM-Powered Judges for details.
4.3. Agent as Judge
Use an agent to evaluate another agent’s work:
AgentJudge codeReviewer = AgentJudge.builder()
.agentClient(agentClient)
.goal("Review the code for bugs, security issues, and code quality")
.build();
Judgment review = codeReviewer.judge(context);
See Agent as Judge for details.
5. Configuration for Driver Programs
Driver programs (like spring-ai-bench) can use JudgeSpec
for YAML-based judge configuration:
// JudgeSpec - Pure data class for configuration
public class JudgeSpec {
private String type; // "file-exists", "file-content", etc.
private String path; // File path
private String expected; // Expected content
private String matchMode; // "EXACT", "CONTAINS", etc.
private Map<String, Object> config; // Additional configuration
// Getters and setters
}
YAML configuration example:
judge:
type: file-content
path: hello.txt
expected: "Hello World!"
matchMode: EXACT
Driver program instantiation pattern:
Driver programs load JudgeSpec
from YAML and instantiate judges using their preferred dependency injection mechanism:
// Spring DI pattern (recommended)
@Configuration
public class JudgeConfiguration {
@Bean(name = "hello-world")
public Judge helloWorldJudge() {
return Judges.allOf(
new FileExistsJudge("hello.txt"),
new FileContentJudge("hello.txt", "Hello World!",
FileContentJudge.MatchMode.EXACT)
);
}
}
// Or manual instantiation from JudgeSpec
JudgeSpec spec = loadFromYaml("judge.yaml");
Judge judge = switch (spec.getType()) {
case "file-exists" -> new FileExistsJudge(spec.getPath());
case "file-content" -> new FileContentJudge(
spec.getPath(),
spec.getExpected(),
FileContentJudge.MatchMode.valueOf(spec.getMatchMode())
);
default -> throw new IllegalArgumentException("Unknown judge type");
};
Design principle:
-
JudgeSpec
is a pure data class with no behavior -
Judge instantiation is left to driver programs
-
Supports any dependency injection mechanism
-
Framework-agnostic configuration
6. Integration with AgentClient
Judges integrate via the JudgeAdvisor
:
// Single judge
AgentClientResponse response = agentClientBuilder
.goal("Build and test the application")
.workingDirectory(projectRoot)
.advisors(JudgeAdvisor.builder()
.judge(new BuildSuccessJudge())
.build())
.call();
// Multiple judges
AgentClientResponse response = agentClientBuilder
.goal("Generate documentation")
.advisors(
JudgeAdvisor.builder()
.judge(new FileExistsJudge("README.md"))
.build(),
JudgeAdvisor.builder()
.judge(new CorrectnessJudge(chatClient))
.build()
)
.call();
See JudgeAdvisor for integration details.
7. Ensemble Evaluation: The Jury Pattern
Combine multiple judges for robust evaluation:
Jury qualityJury = Juries.builder()
.addJudge("build", new BuildSuccessJudge())
.addJudge("correctness", new CorrectnessJudge(chatClient))
.addJudge("quality", new CodeQualityJudge(chatClient))
.votingStrategy(VotingStrategies.weightedAverage(Map.of(
"build", 0.5,
"correctness", 0.3,
"quality", 0.2
)))
.build();
Verdict verdict = qualityJury.vote(context);
// Examine overall result
if (verdict.aggregated().pass()) {
System.out.println("Quality bar met!");
}
// Examine individual judges
verdict.individual().forEach(judgment -> {
System.out.println(judgment.score());
});
See Jury Pattern for ensemble evaluation.
8. Research Foundations
The Spring AI Agents Judge API synthesizes design patterns from leading AI evaluation frameworks. This ensures production-grade architecture informed by real-world usage.
8.1. Framework Influences
Framework | Language | Key Contribution | GitHub |
---|---|---|---|
judges |
Python |
Core abstraction, jury ensemble pattern |
|
deepeval |
Python |
G-Eval, threshold-based success, metrics |
|
ragas |
Python |
Multi-step evaluation, faithfulness, self-consistency |
|
evals |
Python |
Systematic evaluation, reproducibility, recording |
|
JudgeLM |
Python |
Judge type taxonomy, pairwise comparison, prompt templates |
|
langfuse |
TypeScript |
Observability as cross-cutting concern |
8.2. Key Patterns Adopted
From these frameworks, we adopted:
- 1. Clean Core Interface (from judges)
-
Single
judge()
method with async support. A jury is itself a judge, enabling recursive composition. - 2. Flexible Scoring (from judges, deepeval)
-
Type-safe score variants (boolean, numerical, categorical) with normalization support.
- 3. Ensemble Pattern (from judges)
-
Jury extends Judge
with multiple voting strategies and parallel execution. - 4. Multi-Step Evaluation (from ragas)
-
Break complex evaluation into stages: Decompose → Verify → Aggregate (e.g., FaithfulnessJudge).
- 5. Self-Consistency (from ragas)
-
Run judgment N times with majority voting for robustness (SimpleCriteriaJudge with strictness parameter).
- 6. G-Eval Pattern (from deepeval)
-
Auto-generate evaluation steps from criteria using LLM, then execute structured chain-of-thought reasoning.
- 7. Threshold-Based Success (from deepeval)
-
Metrics have configurable thresholds determining pass/fail (e.g.,
new CorrectnessJudge(chatClient, 0.8)
). - 8. Pairwise Comparison (from JudgeLM)
-
Compare two agent outputs to determine which is better (PairwiseJudge).
- 9. Reproducibility (from evals)
-
Deterministic evaluation via timestamps, metadata, and structured recording.
- 10. Observability as Cross-Cutting (from langfuse)
-
Don’t couple judge interface to observability—use decorator pattern or AOP for tracing.
8.3. Unique Spring AI Agents Contributions
Beyond synthesizing existing patterns, we added:
- 1. AssertJ Integration
-
Leverage 2000+ AssertJ assertions with
AssertJJudge
andSoftAssertions
for declarative testing.AssertJJudge.create(context -> judge -> { String output = context.agentOutput().get().asText(); judge.assertThat(output).contains("Hello"); judge.assertThat(output).hasLineCount(5); });
- 2. Agent-as-Judge
-
Use
AgentClient
for judgment—agents evaluate other agents with structured reasoning. - 3. Workspace-Centric Context
-
Agent-specific evaluation with
Path workspace
, file system operations, and build integration. - 4. Rich Agent Output
-
Beyond string output—sealed
AgentOutput
interface withTextOutput
,StructuredOutput
,MultimodalOutput
. - 5. Spring Integration
-
Native Spring Boot integration with
ChatClient
from Spring AI, bean-based configuration, and future auto-configuration support.
9. Production Patterns
9.1. Pattern 1: CI/CD Integration
Verify builds and tests before deployment:
@Service
public class ContinuousIntegration {
private final AgentClient.Builder agentClientBuilder;
public boolean fixAndDeploy(Path projectRoot) {
AgentClientResponse response = agentClientBuilder
.goal("Fix failing tests and run 'mvn clean install'")
.workingDirectory(projectRoot)
.advisors(JudgeAdvisor.builder()
.judge(new BuildSuccessJudge())
.build())
.call();
Judgment judgment = response.getJudgment();
if (judgment.pass()) {
deploy(projectRoot);
return true;
} else {
alertTeam("Build failed: " + judgment.reasoning());
return false;
}
}
}
9.2. Pattern 2: Quality Gates
Enforce quality standards:
Jury qualityGate = Juries.builder()
.addJudge("build", new BuildSuccessJudge())
.addJudge("coverage", new CoverageJudge(80.0))
.addJudge("correctness", new CorrectnessJudge(chatClient))
.votingStrategy(VotingStrategies.allMustPass())
.build();
Verdict verdict = qualityGate.vote(context);
if (!verdict.aggregated().pass()) {
throw new QualityGateException("Quality standards not met");
}
9.3. Pattern 3: Self-Healing Systems
Agents verify and retry:
int maxRetries = 3;
Judgment judgment = null;
for (int attempt = 0; attempt < maxRetries; attempt++) {
AgentClientResponse response = agentClientBuilder
.goal("Fix the failing tests")
.advisors(JudgeAdvisor.builder()
.judge(new BuildSuccessJudge())
.build())
.call();
judgment = response.getJudgment();
if (judgment.pass()) {
break; // Success!
}
logger.warn("Attempt {} failed: {}", attempt + 1, judgment.reasoning());
}
if (!judgment.pass()) {
escalateToHuman(judgment);
}
10. Next Steps
Explore the Judge API in depth:
-
Start here: JudgeAdvisor - Integration with AgentClient (primary entry point)
-
Deterministic: Deterministic Judges - Rule-based evaluation
-
LLM-Powered: LLM-Powered Judges - AI-based evaluation
-
Agent as Judge: Agent as Judge - Agents evaluating agents
-
Ensemble: Jury Pattern - Combine judges for robust evaluation
-
Research: Research Foundations (coming soon) - Complete design rationale
11. Further Reading
-
Anthropic SDK Blog: Building Agents with the Claude Agent SDK - Agent architecture foundations
-
Your First Judge - Practical introduction
-
CLI Agents - Understanding autonomous agents
The Judge API transforms agents from "fire and forget" tools into production-grade, self-verifying systems with automated quality assurance built into every execution.