Judges
Structured LLM-as-judge scoring that gates outputs against the artifact that launched the work.
Judges turn review into a repeatable system. ClosedLoop.ai ships 21 judge agents in the judges plugin, plus a context-manager-for-judges that builds compressed context packs, plus the deterministic aggregators that validate output.
The judge contract
Every judge reads a judge-input.json envelope (per schemas/judge-input.schema.json) with:
evaluation_type(one ofplan,code,prd)task— the task or phase under reviewprimary_artifact— the artifact being scoredsupporting_artifacts[]— references and contextsource_of_truth[]— the PRD or contract the output must matchfallback_mode— how to proceed if context is missingmetadata— run IDs, timestamps, and ticket references
Each judge returns a CaseScore:
{
"type": "case_score",
"case_id": "...",
"final_status": 1,
"metrics": [
{ "metric_name": "...", "threshold": 0.8, "score": 0.92, "justification": "..." }
]
}final_status: 1 = pass, 2 = fail, 3 = error.
Judge rosters
Plan judges (16 judges, 4 batches, max 4 concurrent)
Core principles, best practices and response quality, SOLID, plan grounding and testing — including:
brownfield-accuracy-judge,codebase-grounding-judge,convention-adherence-judgecode-organization-judge,custom-best-practices-judge,dry-judgegoal-alignment-judge(plan-only),kiss-judge,readability-judgesolid-isp-dip-judge,solid-liskov-substitution-judge,solid-open-closed-judgessot-judge,technical-accuracy-judge,test-judgeverbosity-judge(plan-only)
Code judges (11 judges, 3 batches)
The plan-judges roster minus goal-alignment-judge and verbosity-judge.
PRD judges (4 judges, single parallel batch)
prd-auditor— structural completenessprd-dependency-judgeprd-scope-judgeprd-testability-judge
How judges run
The judges:run-judges skill orchestrates the batches:
# Invoked implicitly by the code loop, or directly:
@judges:run-judges --artifact-type plan
@judges:run-judges --artifact-type code
@judges:run-judges --artifact-type prdIt:
- Builds the judge-input contract and the context pack.
- Takes an idempotent agent snapshot under
agents-snapshot/. - Runs the configured batches in parallel.
- Aggregates
CaseScoreoutputs. - Validates the aggregated report with
validate_judge_report.py(Pydantic). - Writes
plan-judges.json,code-judges.json, orprd-judges.json.
Thresholds and overrides
Default metric threshold is 0.8. Override per-judge or per-metric with a JSON config using keys like:
{ "code:test-judge": 0.75 }Caching
The judges:eval-cache skill short-circuits plan evaluation when plan-evaluation.json is newer than plan.json. Judges do not re-run when the plan hasn't changed.
Model selection
Judges use a mix of Opus (creative evaluators like brownfield-accuracy-judge and codebase-grounding-judge), Sonnet (complex structural judges), and Haiku (lightweight check judges). This balances cost and signal.
Fallback behavior
- Plan mode: probes
@code:pre-explorer, falls back to an internal best-effort investigation log if unavailable. - Code mode: attempts best-effort pre-explorer, continues non-blocking on failure.
- Context-manager failure: one-run compatibility fallback using
plan.jsonplusprd.mddirectly.
Why judges matter
Execution without judgment produces velocity without trust.
Judges make quality a first-class, structured output. You can:
- gate merges on judge scores
- surface judge summaries in review UIs
- track judge score trends over time as a leading indicator of plan quality
- tune judges per-organization by overriding thresholds