Judges

Structured LLM-as-judge scoring that gates outputs against the artifact that launched the work.

Judges turn review into a repeatable system. ClosedLoop.ai ships 21 judge agents in the judges plugin, plus a context-manager-for-judges that builds compressed context packs, plus the deterministic aggregators that validate output.

The judge contract

Every judge reads a judge-input.json envelope (per schemas/judge-input.schema.json) with:

evaluation_type (one of plan, code, prd)
task — the task or phase under review
primary_artifact — the artifact being scored
supporting_artifacts[] — references and context
source_of_truth[] — the PRD or contract the output must match
fallback_mode — how to proceed if context is missing
metadata — run IDs, timestamps, and ticket references

Each judge returns a CaseScore:

{
  "type": "case_score",
  "case_id": "...",
  "final_status": 1,
  "metrics": [
    { "metric_name": "...", "threshold": 0.8, "score": 0.92, "justification": "..." }
  ]
}

final_status: 1 = pass, 2 = fail, 3 = error.

Judge rosters

Plan judges (16 judges, 4 batches, max 4 concurrent)

Core principles, best practices and response quality, SOLID, plan grounding and testing — including:

brownfield-accuracy-judge, codebase-grounding-judge, convention-adherence-judge
code-organization-judge, custom-best-practices-judge, dry-judge
goal-alignment-judge (plan-only), kiss-judge, readability-judge
solid-isp-dip-judge, solid-liskov-substitution-judge, solid-open-closed-judge
ssot-judge, technical-accuracy-judge, test-judge
verbosity-judge (plan-only)

Code judges (11 judges, 3 batches)

The plan-judges roster minus goal-alignment-judge and verbosity-judge.

PRD judges (4 judges, single parallel batch)

prd-auditor — structural completeness
prd-dependency-judge
prd-scope-judge
prd-testability-judge

How judges run

The judges:run-judges skill orchestrates the batches:

# Invoked implicitly by the code loop, or directly:
@judges:run-judges --artifact-type plan
@judges:run-judges --artifact-type code
@judges:run-judges --artifact-type prd

It:

Builds the judge-input contract and the context pack.
Takes an idempotent agent snapshot under agents-snapshot/.
Runs the configured batches in parallel.
Aggregates CaseScore outputs.
Validates the aggregated report with validate_judge_report.py (Pydantic).
Writes plan-judges.json, code-judges.json, or prd-judges.json.

Thresholds and overrides

Default metric threshold is 0.8. Override per-judge or per-metric with a JSON config using keys like:

{ "code:test-judge": 0.75 }

Caching

The judges:eval-cache skill short-circuits plan evaluation when plan-evaluation.json is newer than plan.json. Judges do not re-run when the plan hasn't changed.

Model selection

Judges use a mix of Opus (creative evaluators like brownfield-accuracy-judge and codebase-grounding-judge), Sonnet (complex structural judges), and Haiku (lightweight check judges). This balances cost and signal.

Fallback behavior

Plan mode: probes @code:pre-explorer, falls back to an internal best-effort investigation log if unavailable.
Code mode: attempts best-effort pre-explorer, continues non-blocking on failure.
Context-manager failure: one-run compatibility fallback using plan.json plus prd.md directly.

Why judges matter

Execution without judgment produces velocity without trust.

Judges make quality a first-class, structured output. You can:

gate merges on judge scores
surface judge summaries in review UIs
track judge score trends over time as a leading indicator of plan quality
tune judges per-organization by overriding thresholds

On this page