LLM-Powered Judges

LLM-powered judges use language models to evaluate agent execution. They provide semantic understanding and nuanced evaluation that’s difficult to express in deterministic rules.

1. What Are LLM Judges?

LLM judges send evaluation prompts to language models and parse the responses into judgments. They excel at subjective or semantic criteria:

  • Code quality - "Is this code maintainable?"

  • Correctness - "Did the agent accomplish the goal?"

  • Helpfulness - "Is this documentation useful?"

  • Creativity - "Is this solution elegant?"

LLM-as-a-Judge is a widely-adopted pattern in AI evaluation frameworks. Python frameworks like deepeval and ragas pioneered this approach for evaluating RAG systems and LLM outputs. Spring AI Agents brings this pattern to Java developers with Spring-native integration.

2. When to Use LLM Judges

Use LLM judges for subjective or semantic evaluation that deterministic judges can’t capture:

Criteria Type Deterministic Judge LLM Judge

File exists

FileExistsJudge

❌ Overkill

Build succeeds

BuildSuccessJudge

❌ Overkill

Code quality

❌ Too subjective

CodeQualityJudge

Semantic correctness

❌ Requires understanding

CorrectnessJudge

Documentation helpful

❌ Subjective

✅ Custom LLM judge

Best practice: Use deterministic judges for objective criteria, LLM judges for subjective assessment.

3. LLMJudge Base Class

All LLM judges extend LLMJudge, which uses the template method pattern:

public abstract class LLMJudge implements Judge {

    protected final ChatClient chatClient;

    // Template method - orchestrates evaluation
    @Override
    public Judgment judge(JudgmentContext context) {
        String prompt = buildPrompt(context);           // 1. Build prompt
        String response = chatClient                     // 2. Call LLM
            .prompt()
            .user(prompt)
            .call()
            .content();
        return parseResponse(response, context);         // 3. Parse response
    }

    // Subclasses implement these
    protected abstract String buildPrompt(JudgmentContext context);
    protected abstract Judgment parseResponse(String response, JudgmentContext context);
}

Design rationale: The template method pattern separates the evaluation flow (build → call → parse) from judge-specific logic. This approach is common in Python evaluation frameworks—https://github.com/confident-ai/deepeval[deepeval] uses similar abstractions for metrics like G-Eval and faithfulness.

4. Creating Custom LLM Judges

Extend LLMJudge and implement two methods:

4.1. Example: Code Quality Judge

import org.springframework.ai.chat.client.ChatClient;
import org.springaicommunity.agents.judge.llm.LLMJudge;
import org.springaicommunity.agents.judge.context.JudgmentContext;
import org.springaicommunity.agents.judge.result.Judgment;
import org.springaicommunity.agents.judge.result.JudgmentStatus;
import org.springaicommunity.agents.judge.score.NumericalScore;

public class CodeQualityJudge extends LLMJudge {

    public CodeQualityJudge(ChatClient.Builder chatClientBuilder) {
        super("CodeQuality", "Evaluates code quality 0-10", chatClientBuilder);
    }

    @Override
    protected String buildPrompt(JudgmentContext context) {
        String goal = context.goal();
        String output = context.agentOutput().orElse("No output");

        return String.format("""
            Evaluate the code quality for this task:

            Goal: %s
            Agent Output: %s

            Rate the code quality on a scale of 0-10 considering:
            - Readability and clarity
            - Proper naming conventions
            - Code organization
            - Best practices adherence

            Format:
            Score: [0-10]
            Reasoning: [Your detailed explanation]
            """, goal, output);
    }

    @Override
    protected Judgment parseResponse(String response, JudgmentContext context) {
        // Extract score
        double score = extractScore(response);

        // Extract reasoning
        String reasoning = extractReasoning(response);

        // Pass if score >= 7
        boolean pass = score >= 7.0;

        return Judgment.builder()
            .score(new NumericalScore(score, 0, 10))
            .status(pass ? JudgmentStatus.PASS : JudgmentStatus.FAIL)
            .reasoning(reasoning)
            .build();
    }

    private double extractScore(String response) {
        // Find "Score: X" in response
        String[] lines = response.split("\n");
        for (String line : lines) {
            if (line.startsWith("Score:")) {
                String scoreStr = line.substring("Score:".length()).trim();
                try {
                    return Double.parseDouble(scoreStr);
                } catch (NumberFormatException e) {
                    return 0.0;
                }
            }
        }
        return 0.0;
    }

    private String extractReasoning(String response) {
        int index = response.indexOf("Reasoning:");
        if (index >= 0) {
            return response.substring(index + "Reasoning:".length()).trim();
        }
        return response;
    }
}

4.2. Usage

Judge qualityJudge = new CodeQualityJudge(chatClientBuilder);

AgentClientResponse response = agentClientBuilder
    .goal("Refactor UserService for better maintainability")
    .workingDirectory(projectRoot)
    .advisors(JudgeAdvisor.builder()
        .judge(qualityJudge)
        .build())
    .call();

Judgment judgment = response.getJudgment();

if (judgment.pass()) {
    System.out.println("✓ Code quality meets standards");
    if (judgment.score() instanceof NumericalScore numerical) {
        System.out.println("Quality score: " + numerical.value() + "/10");
    }
} else {
    System.out.println("✗ Code quality below threshold");
    System.out.println("Reasoning: " + judgment.reasoning());
}

5. Prompt Engineering for Judges

Effective LLM judges require well-crafted prompts:

5.1. Pattern 1: Structured Output

Request specific format for easy parsing:

@Override
protected String buildPrompt(JudgmentContext context) {
    return """
        Evaluate if the agent accomplished this goal: """ + context.goal() + """

        Agent output: """ + context.agentOutput().orElse("") + """

        Answer in this EXACT format:
        Answer: YES or NO
        Confidence: [1-10]
        Reasoning: [Your explanation]
        """;
}

5.2. Pattern 2: Chain-of-Thought

Request step-by-step reasoning (inspired by deepeval’s G-Eval approach):

@Override
protected String buildPrompt(JudgmentContext context) {
    return """
        Evaluate code quality using these steps:

        1. Read the code carefully
        2. Check naming conventions (are variables/methods well-named?)
        3. Assess code organization (is it structured logically?)
        4. Verify best practices (does it follow Java conventions?)
        5. Provide final score 0-10

        Code to evaluate:
        """ + context.agentOutput().orElse("") + """

        Provide your step-by-step evaluation, then conclude with:
        Final Score: [0-10]
        """;
}

Why this works: Asking the LLM to "show its work" often produces more accurate and consistent judgments. This is a core technique in evaluation frameworks—https://github.com/explodinggradients/ragas[ragas] uses similar multi-step evaluation for faithfulness metrics.

5.3. Pattern 3: Few-Shot Examples

Provide examples of good/bad outputs:

@Override
protected String buildPrompt(JudgmentContext context) {
    return """
        Evaluate documentation quality. Here are examples:

        GOOD (Score: 9):
        "## Installation
        Run: `mvn install`
        This will download dependencies and build the project."

        BAD (Score: 3):
        "Just run the build command."

        Now evaluate this documentation:
        """ + context.agentOutput().orElse("") + """

        Score: [0-10]
        Reasoning: [Explanation]
        """;
}

6. Built-in LLM Judges

Spring AI Agents provides production-ready LLM judges:

6.1. CorrectnessJudge

Evaluates if the agent accomplished its goal:

Judge judge = new CorrectnessJudge(chatClientBuilder);

AgentClientResponse response = agentClientBuilder
    .goal("Write helpful installation documentation")
    .advisors(JudgeAdvisor.builder().judge(judge).build())
    .call();

// Returns YES/NO with reasoning
Judgment judgment = response.getJudgment();
System.out.println(judgment.reasoning());

See CorrectnessJudge for complete details.

7. Trade-offs: LLM vs Deterministic

Understanding when to use each type:

Aspect Deterministic Judges LLM Judges

Speed

Milliseconds

Seconds (LLM inference)

Cost

Free

$0.001-$0.01 per judgment

Reliability

100% deterministic

Non-deterministic (variance)

Capabilities

Exact matches, rules

Semantic understanding, nuance

Use Cases

File checks, build success, exact validation

Quality, correctness, subjective criteria

Best Practice

Use for all objective criteria

Use only for subjective assessment

Recommendation: Start with deterministic judges for objective checks, add LLM judges only for criteria that require semantic understanding.

8. Combining Deterministic + LLM Judges

The most robust evaluation uses both:

AgentClientResponse response = agentClientBuilder
    .goal("Create REST API with documentation")
    .workingDirectory(projectRoot)
    .advisors(
        // Fast deterministic checks first
        JudgeAdvisor.builder()
            .judge(new FileExistsJudge("README.md"))
            .order(100)
            .build(),

        JudgeAdvisor.builder()
            .judge(BuildSuccessJudge.maven("compile"))
            .order(200)
            .build(),

        // Expensive LLM check last (only if above passed)
        JudgeAdvisor.builder()
            .judge(new CorrectnessJudge(chatClientBuilder))
            .order(300)
            .build()
    )
    .call();

Why this ordering works: 1. Fast file check (< 5ms) - fail immediately if README missing 2. Fast build check (~30s) - fail if code doesn’t compile 3. Expensive LLM check (~3s + API cost) - only runs if above passed

9. Cost and Performance

LLM judges have different cost/performance characteristics:

9.1. Typical Costs (OpenAI GPT-4)

  • Input: ~$0.03 per 1K tokens

  • Output: ~$0.06 per 1K tokens

  • Judgment: ~$0.01-0.05 each (varies by prompt complexity)

9.2. Optimization Strategies

9.2.1. 1. Use Smaller Models

// Expensive: GPT-4 for simple yes/no
ChatClient.Builder expensiveBuilder = ChatClient.builder(chatModel)
    .defaultOptions(ChatOptions.builder()
        .model("gpt-4-turbo")
        .build());

// Cheaper: GPT-3.5 for simple judgments
ChatClient.Builder cheaperBuilder = ChatClient.builder(chatModel)
    .defaultOptions(ChatOptions.builder()
        .model("gpt-3.5-turbo")
        .build());

// Use appropriate model for task complexity
Judge simpleJudge = new CorrectnessJudge(cheaperBuilder);
Judge complexJudge = new CodeQualityJudge(expensiveBuilder);

9.2.2. 2. Cache Judgments

@Service
public class CachedJudgmentService {

    private final Map<String, Judgment> cache = new ConcurrentHashMap<>();

    public Judgment judgeWithCache(Judge judge, JudgmentContext context) {
        String cacheKey = generateKey(context);

        return cache.computeIfAbsent(cacheKey, key -> {
            logger.info("Cache miss - calling LLM judge");
            return judge.judge(context);
        });
    }

    private String generateKey(JudgmentContext context) {
        return context.goal() + "|" + context.agentOutput().orElse("");
    }
}

9.2.3. 3. Limit LLM Judges

Only use LLM judges for criteria that truly need semantic understanding:

// ❌ Wasteful: LLM for file existence
new CorrectnessJudge(chatClientBuilder) // Costs $0.01 per call

// ✅ Efficient: Deterministic judge
new FileExistsJudge("output.txt") // Free, < 5ms

10. Best Practices

10.1. 1. Structured Output Formats

// ✅ Good: Structured format
"""
Score: [0-10]
Reasoning: [Explanation]
"""

// ❌ Poor: Freeform (hard to parse)
"""
Tell me what you think about the code quality.
"""

10.2. 2. Clear Success Criteria

// ✅ Good: Specific threshold
boolean pass = score >= 7.0;

// ❌ Vague: Unclear what "good" means
boolean pass = response.contains("good");

10.3. 3. Robust Parsing

@Override
protected Judgment parseResponse(String response, JudgmentContext context) {
    try {
        // Attempt to extract structured data
        double score = extractScore(response);
        String reasoning = extractReasoning(response);

        return Judgment.builder()
            .score(new NumericalScore(score, 0, 10))
            .reasoning(reasoning)
            .build();

    } catch (Exception e) {
        // Fallback: return failure with full response
        return Judgment.builder()
            .score(new BooleanScore(false))
            .status(JudgmentStatus.ABSTAIN)
            .reasoning("Failed to parse LLM response: " + response)
            .build();
    }
}

10.4. 4. Combine with Deterministic Judges

// Use Jury for hybrid evaluation
Jury hybridJury = Juries.builder()
    .addJudge("build", new BuildSuccessJudge(), 0.4)      // Objective
    .addJudge("files", new FileExistsJudge("README.md"), 0.3)  // Objective
    .addJudge("quality", new CodeQualityJudge(chatClient), 0.3) // Subjective
    .votingStrategy(VotingStrategies.weightedAverage())
    .build();

See Jury Pattern for ensemble evaluation.

11. Next Steps

12. Further Reading


LLM judges bring semantic understanding and nuanced evaluation to agent workflows. Use them strategically for subjective criteria that deterministic judges can’t capture.