Agent as Judge: Meta-Evaluation
AgentJudge
uses an autonomous agent to evaluate another agent’s work. This enables sophisticated evaluation that leverages the full capabilities of CLI agents: code analysis, security auditing, multi-step reasoning, and iterative refinement.
1. Overview
Instead of using simple LLM prompts, Agent as Judge runs a complete agent task for evaluation. The judge agent can:
-
Navigate code - Read multiple files, follow references
-
Run commands - Execute linters, security scanners, tests
-
Iterate - Refine analysis through multiple steps
-
Use tools - Leverage grep, find, git for thorough inspection
When to use:
-
Complex evaluation requires multiple steps
-
Need to analyze multiple files or entire codebase
-
Evaluation benefits from running tools (linters, scanners)
-
Subjective criteria benefit from deep code understanding
When NOT to use:
-
Simple yes/no judgment (use
CorrectnessJudge
instead) -
Objective criteria (use deterministic judges instead)
-
Cost/latency are major concerns
2. Basic Usage
import org.springaicommunity.agents.judge.agent.AgentJudge;
@Service
public class CodeReviewService {
private final AgentClient agentClient;
public void reviewPullRequest(Path projectRoot) {
// Agent judge performs code review
Judge reviewJudge = AgentJudge.builder()
.agentClient(agentClient)
.criteria("""
Review the code for:
- Code quality and maintainability
- Potential bugs or logic errors
- Security vulnerabilities
- Best practices adherence
""")
.build();
// Use as any other judge
AgentClientResponse response = agentClientBuilder
.goal("Implement user authentication feature")
.workingDirectory(projectRoot)
.advisors(JudgeAdvisor.builder()
.judge(reviewJudge)
.build())
.call();
Judgment judgment = response.getJudgment();
if (judgment.pass()) {
System.out.println("✓ Code review passed");
approvePullRequest();
} else {
System.out.println("✗ Code review found issues");
System.out.println("Reasoning: " + judgment.reasoning());
requestChanges(judgment.reasoning());
}
}
}
3. How It Works
3.1. 1. Judge Agent Receives Goal
AgentJudge
formats an evaluation goal for the judge agent:
Evaluate the following agent execution:
Original Goal: Implement user authentication feature
Workspace: /projects/my-app
Agent Output: Created UserController with login/logout endpoints
Execution Status: COMPLETED
Evaluation Criteria:
Review the code for:
- Code quality and maintainability
- Potential bugs or logic errors
- Security vulnerabilities
- Best practices adherence
Provide your judgment in the following format:
PASS: true/false
SCORE: X.X (0-10, optional)
REASONING: Your detailed explanation
Be thorough and specific in your reasoning.
3.2. 2. Judge Agent Executes
The judge agent autonomously:
-
Navigates to workspace
-
Reads relevant files (
UserController.java
, tests, config) -
Analyzes code quality
-
Checks for security issues
-
Runs static analysis tools if needed
-
Formulates judgment
3.3. 3. Response Parsed into Judgment
The judge agent responds:
PASS: true
SCORE: 8.5
REASONING: The authentication implementation is well-structured with proper
password hashing (BCrypt) and JWT token management. Code follows Spring Security
best practices. Minor improvements suggested: add rate limiting on login endpoint
and implement refresh token rotation. Overall, code quality is high and ready for
production with the suggested enhancements.
This is parsed into:
Judgment {
status = PASS
score = NumericalScore(8.5, 0, 10)
reasoning = "The authentication implementation is well-structured..."
}
4. Factory Methods
Pre-configured judges for common tasks:
4.1. Code Review
Judge codeReview = AgentJudge.codeReview(agentClient);
AgentClientResponse response = agentClientBuilder
.goal("Implement payment processing")
.advisors(JudgeAdvisor.builder()
.judge(codeReview)
.build())
.call();
Judgment judgment = response.getJudgment();
Evaluation criteria: - Code quality and maintainability - Correctness and potential bugs - Best practices adherence - Performance considerations
4.2. Security Audit
Judge securityAudit = AgentJudge.securityAudit(agentClient);
AgentClientResponse response = agentClientBuilder
.goal("Add user registration endpoint")
.advisors(JudgeAdvisor.builder()
.judge(securityAudit)
.build())
.call();
Judgment judgment = response.getJudgment();
if (!judgment.pass()) {
logger.error("Security vulnerabilities found: {}", judgment.reasoning());
alertSecurityTeam(judgment.reasoning());
}
Evaluation criteria: - Security vulnerabilities (SQL injection, XSS, etc.) - Authentication and authorization flaws - Data exposure risks - Compliance issues
5. Custom Evaluation Criteria
Create domain-specific agent judges:
5.1. Example: API Design Review
Judge apiDesignJudge = AgentJudge.builder()
.agentClient(agentClient)
.name("APIDesignReview")
.description("Evaluates REST API design quality")
.criteria("""
Evaluate the REST API design against these criteria:
1. RESTful principles:
- Proper HTTP methods (GET, POST, PUT, DELETE)
- Resource-oriented URLs
- Appropriate status codes
2. API design:
- Consistent naming conventions
- Proper error handling and responses
- Request/response structure
3. Documentation:
- OpenAPI/Swagger annotations
- Clear endpoint descriptions
4. Security:
- Authentication requirements
- Input validation
- Rate limiting considerations
Provide specific examples of issues found.
""")
.build();
AgentClientResponse response = agentClientBuilder
.goal("Create REST API for Product management")
.workingDirectory(projectRoot)
.advisors(JudgeAdvisor.builder()
.judge(apiDesignJudge)
.build())
.call();
5.2. Example: Test Coverage Review
Judge testCoverageJudge = AgentJudge.builder()
.agentClient(agentClient)
.name("TestCoverageReview")
.description("Evaluates test coverage and quality")
.criteria("""
Review test coverage:
1. Unit tests:
- All public methods tested
- Edge cases covered
- Proper assertions
2. Integration tests:
- API endpoints tested
- Database interactions tested
3. Test quality:
- Clear test names
- Arrange-Act-Assert pattern
- No test dependencies
4. Coverage metrics:
- Run 'mvn jacoco:report' and check coverage
- Minimum 80% line coverage expected
Report specific gaps in coverage.
""")
.build();
6. Custom Goal Templates
Override the default evaluation prompt:
Judge customJudge = AgentJudge.builder()
.agentClient(agentClient)
.name("CustomEvaluation")
.criteria("Your evaluation criteria")
.goalTemplate("""
CUSTOM EVALUATION TASK:
The agent attempted to: {goal}
Working in: {workspace}
Result: {output}
Your task: {criteria}
Analyze thoroughly and respond with:
PASS: true/false
SCORE: 0-10
REASONING: Detailed findings with specific file references
""")
.build();
Available placeholders:
- {goal}
- Original agent goal
- {workspace}
- Workspace path
- {output}
- Agent output
- {status}
- Execution status
- {criteria}
- Evaluation criteria
7. Agent Judge vs LLM Judge
Understanding the differences:
Aspect | LLM Judge | Agent Judge |
---|---|---|
Execution |
Single LLM call |
Full agent task (multi-step) |
Capabilities |
Text analysis only |
File navigation, command execution, tool use |
Latency |
~3 seconds |
~30-60 seconds |
Cost |
$0.01 per judgment |
$0.10-0.50 per judgment |
Use Cases |
Simple semantic evaluation |
Complex multi-file analysis |
Example |
"Is documentation helpful?" |
"Audit codebase for security issues" |
Best practice: Use LLM judges for simple semantic checks, Agent judges for complex multi-step evaluation.
8. Production Patterns
8.1. Pattern 1: Pre-Merge Code Review
@Service
public class PullRequestService {
private final AgentClient agentClient;
private final AgentClient.Builder agentClientBuilder;
public boolean approvePullRequest(Path projectRoot) {
// Run automated code review
Judge codeReview = AgentJudge.codeReview(agentClient);
AgentClientResponse response = agentClientBuilder
.goal("Review code changes in the pull request")
.workingDirectory(projectRoot)
.advisors(JudgeAdvisor.builder()
.judge(codeReview)
.build())
.call();
Judgment judgment = response.getJudgment();
if (judgment.pass()) {
// Auto-approve if high score
if (judgment.score() instanceof NumericalScore numerical) {
if (numerical.value() >= 9.0) {
approvePR();
return true;
}
}
// Request human review for medium scores
requestHumanReview(judgment.reasoning());
return false;
} else {
// Block PR if failed
rejectPR(judgment.reasoning());
return false;
}
}
}
8.2. Pattern 2: Security-First Development
@Service
public class SecureDeployment {
public void deployWithSecurity(Path projectRoot) {
// Run security audit before deployment
Judge securityAudit = AgentJudge.securityAudit(agentClient);
AgentClientResponse response = agentClientBuilder
.goal("Prepare application for production deployment")
.workingDirectory(projectRoot)
.advisors(
// Build must succeed
JudgeAdvisor.builder()
.judge(BuildSuccessJudge.maven("clean", "install"))
.order(100)
.build(),
// Security audit (expensive, runs last)
JudgeAdvisor.builder()
.judge(securityAudit)
.order(200)
.build()
)
.call();
Judgment judgment = response.getJudgment();
if (!judgment.pass()) {
logger.error("Security audit failed: {}", judgment.reasoning());
alertSecurityTeam(judgment.reasoning());
throw new SecurityException("Deployment blocked due to security issues");
}
deploy(projectRoot);
}
}
8.3. Pattern 3: Iterative Quality Improvement
@Service
public class QualityImprovement {
public void improveUntilQuality(Path projectRoot, double targetScore) {
int maxAttempts = 3;
for (int attempt = 1; attempt <= maxAttempts; attempt++) {
// Run code review
Judge codeReview = AgentJudge.codeReview(agentClient);
AgentClientResponse response = agentClientBuilder
.goal("Improve code quality based on previous feedback")
.workingDirectory(projectRoot)
.advisors(JudgeAdvisor.builder()
.judge(codeReview)
.build())
.call();
Judgment judgment = response.getJudgment();
if (judgment.score() instanceof NumericalScore numerical) {
double score = numerical.value();
logger.info("Attempt {}: Quality score = {}", attempt, score);
if (score >= targetScore) {
logger.info("✓ Target quality achieved");
return;
}
// Use judgment reasoning to guide next iteration
logger.info("Feedback: {}", judgment.reasoning());
}
}
throw new QualityException("Failed to achieve target quality after " + maxAttempts + " attempts");
}
}
9. Cost and Performance
Agent judges are more expensive than LLM judges:
9.1. Typical Costs
-
Agent execution: $0.05-0.30 per judgment
-
LLM judge: $0.01 per judgment
-
Deterministic: Free
9.2. Performance
-
Latency: 30-90 seconds (full agent task)
-
Complexity: Handles multi-file analysis, command execution
-
Thoroughness: Can review entire codebases
9.3. Optimization Strategies
9.3.1. 1. Use Sparingly
// ✅ Good: Agent judge for complex review
AgentJudge.codeReview(agentClient)
// ❌ Wasteful: Agent judge for simple check
AgentJudge.builder()
.criteria("Check if file exists")
.build()
// Use FileExistsJudge instead
9.3.2. 2. Layer Checks
// Fast checks first, expensive agent judge last
.advisors(
JudgeAdvisor.builder()
.judge(BuildSuccessJudge.maven("test")) // ~60s
.order(100)
.build(),
JudgeAdvisor.builder()
.judge(AgentJudge.codeReview(agentClient)) // ~90s + $0.20
.order(200)
.build()
)
9.3.3. 3. Cache Results
@Service
public class CachedAgentJudge implements Judge {
private final AgentJudge delegate;
private final ConcurrentHashMap<String, Judgment> cache = new ConcurrentHashMap<>();
@Override
public Judgment judge(JudgmentContext context) {
String key = computeHash(context.workspace());
return cache.computeIfAbsent(key, k -> {
logger.info("Cache miss - running agent judge");
return delegate.judge(context);
});
}
}
10. Best Practices
10.1. 1. Clear, Specific Criteria
// ✅ Good: Specific checklist
.criteria("""
Check for:
1. SQL injection vulnerabilities
2. XSS prevention
3. CSRF token usage
4. Password hashing (BCrypt)
""")
// ❌ Vague: Hard for agent to evaluate
.criteria("Check security")
10.2. 2. Request Structured Output
.criteria("""
Evaluate code quality and respond with:
PASS: true/false
SCORE: 0-10
REASONING: Detailed explanation with file references
Be specific about file names and line numbers.
""")
10.3. 3. Combine with Deterministic Judges
// Hybrid approach: fast checks + thorough agent review
.advisors(
JudgeAdvisor.builder().judge(BuildSuccessJudge.maven("test")).build(),
JudgeAdvisor.builder().judge(new FileExistsJudge("README.md")).build(),
JudgeAdvisor.builder().judge(AgentJudge.codeReview(agentClient)).build()
)
10.4. 4. Log Detailed Reasoning
Judgment judgment = response.getJudgment();
logger.info("Agent Judge Result:");
logger.info(" Status: {}", judgment.status());
logger.info(" Score: {}", judgment.score());
logger.info(" Reasoning:\n{}", judgment.reasoning());
// Parse for specific issues if needed
if (!judgment.pass()) {
String reasoning = judgment.reasoning();
if (reasoning.contains("security")) {
alertSecurityTeam(reasoning);
}
}
11. Limitations
12. Next Steps
-
Jury Pattern: Combine multiple judges for robust evaluation
-
LLM Judges: Simpler semantic evaluation
-
Deterministic Judges: Fast, free rule-based checks
-
Judge Advisor: Integration with AgentClient
13. Further Reading
-
Judge API Overview - Complete Judge API documentation
-
CLI Agents - Understanding autonomous agents
-
Your First Judge - Practical introduction
Agent as Judge brings the full power of autonomous agents to evaluation tasks. Use it strategically for complex, multi-step analysis that simpler judges can’t handle.