Judge API: Automated Agent Evaluation

The Judge API provides automated evaluation and verification of agent task execution. Instead of manually checking if your agent succeeded, judges provide programmatic verification with detailed feedback.

1. Why Evaluate Agent Execution?

Agents are non-deterministic. The same goal might succeed today and fail tomorrow due to:

Network issues, file permissions, resource constraints
LLM reasoning variations across runs
Environmental differences (dependencies, configuration)
Complexity of multi-step tasks

Manual verification doesn’t scale. You need automated, reliable evaluation built into your agent workflow.

2. The Judge Pattern

A Judge evaluates whether an agent achieved its goal by examining the agent’s output, workspace state, and execution context.

// Without judge - manual checking
AgentClientResponse response = agentClientBuilder
    .goal("Create a REST API")
    .call();

// ❓ Did it work? Check manually?
// ❓ Are tests passing?
// ❓ Is the code correct?

// With judge - automated verification
AgentClientResponse response = agentClientBuilder
    .goal("Create a REST API")
    .advisors(JudgeAdvisor.builder()
        .judge(new BuildSuccessJudge())
        .build())
    .call();

Judgment judgment = response.getJudgment();
if (judgment.pass()) {
    deploy(); // Safe - judge verified success
} else {
    alert("Agent failed: " + judgment.reasoning());
}

Real-world example: The code coverage agent uses a CoverageJudge to verify test coverage targets:

CoverageJudge judge = new CoverageJudge(80.0);  // Target 80% coverage

AgentClientResponse response = agentClient
    .goal("Increase JaCoCo test coverage to 80%")
    .advisors(JudgeAdvisor.builder().judge(judge).build())
    .run();

Judgment judgment = response.getJudgment();
if (judgment.pass() && judgment.score() instanceof NumericalScore score) {
    System.out.println("Coverage achieved: " + score.value() + "%");
}

// Real results: 0% → 71.4% coverage on Spring gs-rest-service

3. Core Abstractions

The Judge API consists of four primary abstractions:

3.1. Judge Interface

The core evaluation interface:

public interface Judge {
    Judgment judge(JudgmentContext context);

    default CompletableFuture<Judgment> judgeAsync(JudgmentContext context) {
        return CompletableFuture.supplyAsync(() -> judge(context));
    }
}

Key characteristics:

Single responsibility - One method: judge()
Async support - Default async implementation via CompletableFuture
Stateless - Judges don’t maintain state between calls
Composable - Judges can be combined (see Jury Pattern)

3.2. JudgmentContext

Contains all information needed for evaluation:

public record JudgmentContext(
    String goal,                          // What the agent was asked to do
    Path workspace,                       // Where the agent worked
    Optional<AgentOutput> agentOutput,    // What the agent produced
    Optional<AgentOutput> expectedOutput, // Expected result (if known)
    Optional<List<String>> referenceData, // Context documents
    Instant startedAt,                    // Execution start time
    Duration executionTime,               // How long execution took
    Map<String, Object> metadata          // Additional context
) {}

3.3. Judgment

The evaluation result:

public record Judgment(
    JudgmentStatus status,  // PASS, FAIL, ABSTAIN, ERROR
    Score score,            // Detailed score (boolean, numerical, categorical)
    String reasoning,       // Explanation of judgment
    List<Check> checks,     // Individual verification checks
    Optional<Duration> elapsed,  // Judgment duration
    Map<String, Object> metadata // Additional metadata
) {
    public boolean pass() {
        return status == JudgmentStatus.PASS;
    }
}

Convenience methods:

// Quick checks
if (judgment.pass()) { /* ... */ }
if (judgment.fail()) { /* ... */ }
if (judgment.abstain()) { /* ... */ }

// Score access
if (judgment.score() instanceof NumericalScore numerical) {
    double value = numerical.normalized(); // 0.0 to 1.0
}

3.4. Score Types

Type-safe scoring with sealed interfaces:

public sealed interface Score permits BooleanScore, NumericalScore, CategoricalScore {
    Object value();
    ScoreType type();
}

// Boolean: pass/fail
BooleanScore pass = new BooleanScore(true);

// Numerical: scored metrics
NumericalScore quality = new NumericalScore(8.5, 0, 10);
double normalized = quality.normalized(); // 0.85

// Categorical: classification
CategoricalScore level = new CategoricalScore(
    "excellent",
    List.of("poor", "good", "excellent")
);

4. Judge Types

Judges fall into three main categories:

4.1. Deterministic Judges

Rule-based evaluation using file system checks, command execution, or assertions:

Judge Purpose Example

Judge	Purpose	Example
`FileExistsJudge`	Verify file creation	`new FileExistsJudge("report.txt")`
`FileContentJudge`	Verify file contents	`new FileContentJudge("pom.xml", content → content.contains("<artifactId>my-app</artifactId>"))`
`CommandJudge`	Verify command success	`new CommandJudge("mvn test")`
`BuildSuccessJudge`	Verify build success	`new BuildSuccessJudge()`
`AssertJJudge`	Custom assertions	`judge.assertThat(output).contains("Hello")`

FileExistsJudge

Verify file creation

new FileExistsJudge("report.txt")

FileContentJudge

Verify file contents

new FileContentJudge("pom.xml", content → content.contains("<artifactId>my-app</artifactId>"))

CommandJudge

Verify command success

new CommandJudge("mvn test")

BuildSuccessJudge

Verify build success

new BuildSuccessJudge()

AssertJJudge

Custom assertions

judge.assertThat(output).contains("Hello")

See Deterministic Judges for details.

4.2. LLM-Powered Judges

AI-based evaluation using language models:

Judge Purpose Example

Judge	Purpose	Example
`CorrectnessJudge`	Semantic correctness	`new CorrectnessJudge(chatClient)`
`GEvalJudge`	Custom criteria evaluation	`new GEvalJudge(chatClient, "Code follows SOLID principles")`
`FaithfulnessJudge`	Ground output in context	`new FaithfulnessJudge(chatClient)`
`SimpleCriteriaJudge`	Simple yes/no criteria	`new SimpleCriteriaJudge(chatClient, "API returns valid JSON")`

CorrectnessJudge

Semantic correctness

new CorrectnessJudge(chatClient)

GEvalJudge

Custom criteria evaluation

new GEvalJudge(chatClient, "Code follows SOLID principles")

FaithfulnessJudge

Ground output in context

new FaithfulnessJudge(chatClient)

SimpleCriteriaJudge

Simple yes/no criteria

new SimpleCriteriaJudge(chatClient, "API returns valid JSON")

See LLM-Powered Judges for details.

4.3. Agent as Judge

Use an agent to evaluate another agent’s work:

AgentJudge codeReviewer = AgentJudge.builder()
    .agentClient(agentClient)
    .goal("Review the code for bugs, security issues, and code quality")
    .build();

Judgment review = codeReviewer.judge(context);

See Agent as Judge for details.

5. Configuration for Driver Programs

Driver programs (like spring-ai-bench) can use JudgeSpec for YAML-based judge configuration:

// JudgeSpec - Pure data class for configuration
public class JudgeSpec {
    private String type;        // "file-exists", "file-content", etc.
    private String path;        // File path
    private String expected;    // Expected content
    private String matchMode;   // "EXACT", "CONTAINS", etc.
    private Map<String, Object> config; // Additional configuration

    // Getters and setters
}

YAML configuration example:

judge:
  type: file-content
  path: hello.txt
  expected: "Hello World!"
  matchMode: EXACT

Driver program instantiation pattern:

Driver programs load JudgeSpec from YAML and instantiate judges using their preferred dependency injection mechanism:

// Spring DI pattern (recommended)
@Configuration
public class JudgeConfiguration {

    @Bean(name = "hello-world")
    public Judge helloWorldJudge() {
        return Judges.allOf(
            new FileExistsJudge("hello.txt"),
            new FileContentJudge("hello.txt", "Hello World!",
                                 FileContentJudge.MatchMode.EXACT)
        );
    }
}

// Or manual instantiation from JudgeSpec
JudgeSpec spec = loadFromYaml("judge.yaml");
Judge judge = switch (spec.getType()) {
    case "file-exists" -> new FileExistsJudge(spec.getPath());
    case "file-content" -> new FileContentJudge(
        spec.getPath(),
        spec.getExpected(),
        FileContentJudge.MatchMode.valueOf(spec.getMatchMode())
    );
    default -> throw new IllegalArgumentException("Unknown judge type");
};

Design principle:

JudgeSpec is a pure data class with no behavior
Judge instantiation is left to driver programs
Supports any dependency injection mechanism
Framework-agnostic configuration

6. Integration with AgentClient

Judges integrate via the JudgeAdvisor:

// Single judge
AgentClientResponse response = agentClientBuilder
    .goal("Build and test the application")
    .workingDirectory(projectRoot)
    .advisors(JudgeAdvisor.builder()
        .judge(new BuildSuccessJudge())
        .build())
    .call();

// Multiple judges
AgentClientResponse response = agentClientBuilder
    .goal("Generate documentation")
    .advisors(
        JudgeAdvisor.builder()
            .judge(new FileExistsJudge("README.md"))
            .build(),
        JudgeAdvisor.builder()
            .judge(new CorrectnessJudge(chatClient))
            .build()
    )
    .call();

See JudgeAdvisor for integration details.

7. Ensemble Evaluation: The Jury Pattern

Combine multiple judges for robust evaluation:

Jury qualityJury = Juries.builder()
    .addJudge("build", new BuildSuccessJudge())
    .addJudge("correctness", new CorrectnessJudge(chatClient))
    .addJudge("quality", new CodeQualityJudge(chatClient))
    .votingStrategy(VotingStrategies.weightedAverage(Map.of(
        "build", 0.5,
        "correctness", 0.3,
        "quality", 0.2
    )))
    .build();

Verdict verdict = qualityJury.vote(context);

// Examine overall result
if (verdict.aggregated().pass()) {
    System.out.println("Quality bar met!");
}

// Examine individual judges
verdict.individual().forEach(judgment -> {
    System.out.println(judgment.score());
});

See Jury Pattern for ensemble evaluation.

8. Research Foundations

The Spring AI Agents Judge API synthesizes design patterns from leading AI evaluation frameworks. This ensures production-grade architecture informed by real-world usage.

8.1. Framework Influences

Framework	Language	Key Contribution	GitHub
judges	Python	Core abstraction, jury ensemble pattern	UpstageAI/judges
deepeval	Python	G-Eval, threshold-based success, metrics	confident-ai/deepeval
ragas	Python	Multi-step evaluation, faithfulness, self-consistency	explodinggradients/ragas
evals	Python	Systematic evaluation, reproducibility, recording	openai/evals
JudgeLM	Python	Judge type taxonomy, pairwise comparison, prompt templates	baaivision/JudgeLM
langfuse	TypeScript	Observability as cross-cutting concern	langfuse/langfuse

Framework

Language

Key Contribution

GitHub

judges

Python

Core abstraction, jury ensemble pattern

UpstageAI/judges

deepeval

Python

G-Eval, threshold-based success, metrics

confident-ai/deepeval

ragas

Python

Multi-step evaluation, faithfulness, self-consistency

explodinggradients/ragas

evals

Python

Systematic evaluation, reproducibility, recording

openai/evals

JudgeLM

Python

Judge type taxonomy, pairwise comparison, prompt templates

baaivision/JudgeLM

langfuse

TypeScript

Observability as cross-cutting concern

langfuse/langfuse

8.2. Key Patterns Adopted

From these frameworks, we adopted:

1. Clean Core Interface (from judges): Single judge() method with async support. A jury is itself a judge, enabling recursive composition.
2. Flexible Scoring (from judges, deepeval): Type-safe score variants (boolean, numerical, categorical) with normalization support.
3. Ensemble Pattern (from judges): Jury extends Judge with multiple voting strategies and parallel execution.
4. Multi-Step Evaluation (from ragas): Break complex evaluation into stages: Decompose → Verify → Aggregate (e.g., FaithfulnessJudge).
5. Self-Consistency (from ragas): Run judgment N times with majority voting for robustness (SimpleCriteriaJudge with strictness parameter).
6. G-Eval Pattern (from deepeval): Auto-generate evaluation steps from criteria using LLM, then execute structured chain-of-thought reasoning.
7. Threshold-Based Success (from deepeval): Metrics have configurable thresholds determining pass/fail (e.g., new CorrectnessJudge(chatClient, 0.8)).
8. Pairwise Comparison (from JudgeLM): Compare two agent outputs to determine which is better (PairwiseJudge).
9. Reproducibility (from evals): Deterministic evaluation via timestamps, metadata, and structured recording.
10. Observability as Cross-Cutting (from langfuse): Don’t couple judge interface to observability—use decorator pattern or AOP for tracing.

8.3. Unique Spring AI Agents Contributions

Beyond synthesizing existing patterns, we added:

1. AssertJ Integration

Leverage 2000+ AssertJ assertions with AssertJJudge and SoftAssertions for declarative testing.

AssertJJudge.create(context -> judge -> {
    String output = context.agentOutput().get().asText();
    judge.assertThat(output).contains("Hello");
    judge.assertThat(output).hasLineCount(5);
});

2. Agent-as-Judge

Use AgentClient for judgment—agents evaluate other agents with structured reasoning.

3. Workspace-Centric Context

Agent-specific evaluation with Path workspace, file system operations, and build integration.

4. Rich Agent Output

Beyond string output—sealed AgentOutput interface with TextOutput, StructuredOutput, MultimodalOutput.

5. Spring Integration

Native Spring Boot integration with ChatClient from Spring AI, bean-based configuration, and future auto-configuration support.

8.4. Research Deep Dive

For detailed analysis of how each framework influenced specific design decisions, see:

Research Foundations (coming soon) - Complete design rationale with code examples from each framework

9. Production Patterns

9.1. Pattern 1: CI/CD Integration

Verify builds and tests before deployment:

@Service
public class ContinuousIntegration {

    private final AgentClient.Builder agentClientBuilder;

    public boolean fixAndDeploy(Path projectRoot) {
        AgentClientResponse response = agentClientBuilder
            .goal("Fix failing tests and run 'mvn clean install'")
            .workingDirectory(projectRoot)
            .advisors(JudgeAdvisor.builder()
                .judge(new BuildSuccessJudge())
                .build())
            .call();

        Judgment judgment = response.getJudgment();

        if (judgment.pass()) {
            deploy(projectRoot);
            return true;
        } else {
            alertTeam("Build failed: " + judgment.reasoning());
            return false;
        }
    }
}

9.2. Pattern 2: Quality Gates

Enforce quality standards:

Jury qualityGate = Juries.builder()
    .addJudge("build", new BuildSuccessJudge())
    .addJudge("coverage", new CoverageJudge(80.0))
    .addJudge("correctness", new CorrectnessJudge(chatClient))
    .votingStrategy(VotingStrategies.allMustPass())
    .build();

Verdict verdict = qualityGate.vote(context);

if (!verdict.aggregated().pass()) {
    throw new QualityGateException("Quality standards not met");
}

9.3. Pattern 3: Self-Healing Systems

Agents verify and retry:

int maxRetries = 3;
Judgment judgment = null;

for (int attempt = 0; attempt < maxRetries; attempt++) {
    AgentClientResponse response = agentClientBuilder
        .goal("Fix the failing tests")
        .advisors(JudgeAdvisor.builder()
            .judge(new BuildSuccessJudge())
            .build())
        .call();

    judgment = response.getJudgment();

    if (judgment.pass()) {
        break; // Success!
    }

    logger.warn("Attempt {} failed: {}", attempt + 1, judgment.reasoning());
}

if (!judgment.pass()) {
    escalateToHuman(judgment);
}

10. Next Steps

Explore the Judge API in depth:

Start here: JudgeAdvisor - Integration with AgentClient (primary entry point)
Deterministic: Deterministic Judges - Rule-based evaluation
LLM-Powered: LLM-Powered Judges - AI-based evaluation
Agent as Judge: Agent as Judge - Agents evaluating agents
Ensemble: Jury Pattern - Combine judges for robust evaluation
Research: Research Foundations (coming soon) - Complete design rationale

11. Further Reading

Anthropic SDK Blog: Building Agents with the Claude Agent SDK - Agent architecture foundations
GitHub Research: judges, deepeval, ragas, evals, JudgeLM
Your First Judge - Practical introduction
CLI Agents - Understanding autonomous agents

The Judge API transforms agents from "fire and forget" tools into production-grade, self-verifying systems with automated quality assurance built into every execution.