Reviewer
reviewer is the persona that second-signs every mutating operation. Unlike the other workers it is never spawned by the coordinator — the catalog deliberately excludes it. The only spawner is the ReviewGate decorator, which wraps every tool whose Class is "write" or "destructive".
The reviewer is the answer to one question: "who watches the watcher?"
Don't disable this.
The reviewer is what makes Ongrid safe to give host_restart_service or execute_skill at all. Skipping the gate is an explicit configuration choice and surfaces in the audit trail. If you fork the persona to "always approve", the audit log will show every decision — auditors will notice.
The ReviewGate decorator
Position in the tool decorator chain (chain.go):
tenant_bind → REVIEW_GATE → timeout → audit → ratelimit → metric → <inner tool>Why this exact order:
- Outside
timeout. The reviewer is itself agraph.Invokewith its own turn budget. Wrapping it inside the inner tool's 15s timeout would force the reviewer to finish in 15s — unrealistic. The gate carries its own independent 60s ceiling (DefaultReviewerTimeout). - Outside
audit. Rejections shouldn't write a synthetic execution row. The gate writes achat_mutating_proposalsrow instead. Audit only logs the execution of the inner tool, which only happens on approve. - Inside
tenant_bind. The proposal payload includes the operatoruser_id, whichtenant_bindresolves from ctx — the gate must run aftertenant_bindhas populated it.
Trigger condition
The decorator inspects the wrapped tool's Class:
| Class | Gate behavior |
|---|---|
"read" | Pass through. Inner tool runs. |
"write" | Intercept. Spawn reviewer. Inner runs only on approve. |
"destructive" | Same as "write". Both classes pass through the gate. |
Concrete Class: "write" tools in the bag today:
host_restart_service— the canonical mutating skill.execute_skill— the marketplace shim that runs any marketplace-installed skill (the gate can't read the skill body to classify, so it gates them all).AgentTool— yes, the dispatch primitive itself is"write"because it spawns a worker and therefore commits resources. The reviewer rarely gates AgentTool calls in practice though — see the persona's allowlist below.
The proposal payload
When intercepting, the gate builds:
{
"action": "host_restart_service",
"target": {"device_id": 7, "service": "nginx"},
"reason": "User reported 502; nginx error log shows OOM",
"blast_radius": "single_device",
"operator": "user_42",
"context_summary": "<coordinator's running summary of the conversation>"
}blast_radius is one of single_device | cluster | tenant_wide, derived from the tool's metadata. The reviewer uses it to weight the "is this reasonable" decision — a tenant_wide action faces higher scrutiny than a single_device one.
The reviewer persona
Frontmatter highlights (verbatim from agents/reviewer.md):
name: reviewer
description: SOP 二审 reviewer worker,对 mutating / destructive 提案做静态审查
permission_mode: read-only
max_turns: 5
model: anthropic/claude-opus-4-7 # 关键路径用最强
background: true # async: spawn returns immediately
tools:
- get_incident_detail
- get_edge_summary
- query_promql
- query_logql
- get_sop_text # planned (HLD-003 PR-D)
disallowed_tools:
- "*_skill" # wildcard — no skill execution
- run_shell
- execute_skill
- host_restart_service # explicitly: reviewer cannot itself act
- kill_process
critical_reminder: |
你是高危操作二审 reviewer。reject 是默认选项,approve 必须三条都满足:
1. 找得到对应 SOP 且明确覆盖此场景
2. 当前没有并行的同类操作(看告警 / 看运维窗口)
3. 回滚路径已知Key observations:
max_turns: 5. Reviewers don't beat around the bush. Five turns to decide; the persona prompt explicitly says "看不清就 reject".model: anthropic/claude-opus-4-7. The critical-path call gets the strongest model, regardless of the org's default. Reviews shouldn't be downgraded to a fast model — the cost of a wrong approval vastly outweighs the cost of a slow review.background: true. Reviewer runs async. The coordinator's UI doesn't block on the review; the review'stask_notificationposts back when done. (In current wiring, the inner tool blocks until the reviewer returns — but the UI doesn't, so the user sees "review pending" while it runs.)disallowed_toolswildcard*_skill. The reviewer cannot execute any skill. It can read context (get_incident_detail,get_edge_summary, telemetry queries) but never act.- No
run_shell, nohost_restart_service, nokill_process. Even if the persona body told it to, the runtime filter strips these tools. Defense in depth.
The 5-step workflow
From the persona body, the reviewer runs:
- Find the SOP.
get_sop_text(action)for the official procedure. No SOP → reject with"no SOP for action <X>". - Check target state.
get_edge_summary(device_id). Device offline / in a restart loop / last mutating action within 5min → reject. - Check for parallel operations.
query_logqlforaudit:entries touching the same target in the last 10 minutes. Parallel action → reject with"parallel operation detected; wait for X to complete". - Validate the reason against alerts.
get_incident_detailfor the referenced incident. Reason contradicts the alert content → reject ("operator may have misread the symptom"). - Decide.
- Approve when all three core gates pass. Output
{decision: approve, sop_id, rollback_path, gates_passed}. - Reject otherwise. Output
{decision: reject, reason, missing_gates}.
- Approve when all three core gates pass. Output
The three-of-three rule
The critical reminder is explicit:
reject 是默认选项,approve 必须三条都满足:
- 找得到对应 SOP 且明确覆盖此场景
- 当前没有并行的同类操作(看告警 / 看运维窗口)
- 回滚路径已知
"Not sure" is equivalent to "reject". This is a deliberate calibration — the cost asymmetry between a wrong approval (production damage) and a wrong rejection (one extra round-trip) heavily favors the latter.
Output format
The reviewer's final reply (posted back to the coordinator via <task-notification>):
**Decision: approve | reject**
**Gates**
- ✓ SOP-007 covers restart nginx
- ✓ node-01 status online; last mutating 17min ago
- ✓ No parallel operation
- ✓ Rollback: `systemctl start nginx`
**Notes**
{1-2 sentence risk note; included even on approve.}The decorator parses for Decision: approve (case-insensitive, must appear in the first non-blank line of the reply). Anything else — including ambiguous phrasing like "approve with caveats" — is treated as reject.
What happens on approve
The decorator:
- Writes a
chat_mutating_proposalsrow with status=approved, the reviewer's decision text, and the SOP id. - Lets the call fall through to the inner tool.
- The audit decorator (downstream in the chain) writes the execution row.
- The inner tool returns its normal result to the coordinator LLM.
The coordinator LLM sees a normal tool result and proceeds.
What happens on reject
The decorator:
- Writes a
chat_mutating_proposalsrow with status=rejected, the reviewer's reason, and the missing gates. - Returns
ErrReviewRejectedwrapping the reviewer's reason. - The audit decorator does not write an execution row (no execution happened).
The coordinator LLM sees an error message — review rejected: <reviewer reason> — and is expected to explain the situation to the user. It should not retry the same tool call; the dedupe protection plus a short-term LRU on rejection reasons keep the LLM from looping.
Edge cases
What if the user is the operator and admin?
Reviewer still runs. There is no "skip review because you're admin" flag. Auditability is the point — the reviewer's decision row is the paper trail that says "this action was approved against SOP-X with rollback Y at time T". Admin override would erase the trail.
What if no get_sop_text SOP exists?
Today the SOP corpus is sparse — get_sop_text is a planned tool (HLD-003 PR-D). In the current MVP the reviewer uses a placeholder and falls back to a heuristic: critical actions on production-tagged devices need explicit force_approve_no_sop: true in the proposal context, which the coordinator can only set after explicit user confirmation. This will be replaced by the SOP corpus once it's populated.
What if the reviewer times out?
The 60s gate timeout fires. The decorator returns ErrReviewTimeout. The coordinator LLM sees a timeout error and should report it to the user — not retry. Persistent reviewer timeouts are a manager-side problem (likely model latency); the audit log lets the SRE see the pattern.
What about read-only worker tools?
Workers' read-only tool calls (query_promql, get_host_load, etc.) have Class: "read" and never trip the gate. Only "write" / "destructive" tools do. The reviewer itself only carries Class: "read" tools, so a reviewer worker can't itself trip a nested ReviewGate.
Customizing the gate
Things you might change in a fork:
- The reviewer model. If your org doesn't have Anthropic access, rewrite the frontmatter
model:to the strongest model you do have. The gate uses whatever the persona declares. - The 60s ceiling. Override via
ReviewGateconstructor option if your reviewer regularly takes longer. Don't drop below 30s; the reviewer needs room for 5 turns × LLM + tool round-trips. - The reviewer persona body. Add team-specific checklists (PCI scope, change-freeze windows, on-call calendar). The gate doesn't care what's in the body as long as the final decision line is parseable.
Things you must not change:
- The
name: reviewer— the gate looks it up by name (DefaultReviewerAgent). - The
disallowed_toolswildcard*_skill— the reviewer must not itself perform actions; defense in depth says the runtime filter enforces this, not the prompt.
For a per-tool custom reviewer (e.g. a db-specific one), the decorator constructor accepts a reviewerAgent override so different tools can route to different reviewer personas. Use it sparingly — multiple reviewers fragment the audit trail.