Incident investigator
incident-investigator is the deepest of Ongrid's workers. The coordinator dispatches to it whenever the user wants root cause, not just a symptom summary. The persona walks the causal chain from the observed alert back to the originating source ("patient zero" — 0 号病人 in the persona prompt) and returns a structured report.
This is the persona behind RCA
The HLD-013 causal RCA pipeline (test env, May 2026 rollout) is built on this persona. When /incident/<id> → Get RCA runs, the manager spawns this worker with the incident id pre-filled into the prompt.
When the coordinator picks it
The persona file declares the trigger patterns. Quoting the frontmatter verbatim:
when_to_use: |
coordinator 在用户问以下场景时 spawn 本 worker:
• "这条告警的根因是什么 / 到底是谁导致的"
• "incident 123 怎么排查 / 受影响范围 / 持续多久"
• "这个告警是不是误报 / 跟上次那个相关吗"
• "这台机器 mem 飙了,看一下"Translation: any time the user asks why rather than what. The investigator's deliverable is the causal chain back to the source, not a "current state" snapshot.
Tool bag
Whitelisted in the persona — the runtime filter strips everything else:
tools:
- query_knowledge # KB / vault / uploads
- get_incident_detail
- query_incidents
- correlate_incident # metric+log+trace pulled together
- query_change_events # config / rule / device mutations
- query_promql
- query_logql
- query_traceql
- get_edge_summary
- query_alert_rules
- query_devices
- get_host_load
- get_host_processes
- expand_topology
- find_topology_node
- host_find_large_files
- host_du_summary
- host_stat_file
disallowed_tools:
- execute_skill
- host_restart_service
- run_shell
permission_mode: read-onlyKey consequences:
- Read-only. No
host_restart_service, noexecute_skill, norun_shell. If the investigation concludes a service needs restarting, it returns that as a proposal to the coordinator; the coordinator dispatchesspecialist-opswith a mutating intent, and the reviewer gates it. - Topology and change events. The unique edge it has over the specialists is
expand_topology+find_topology_node(walk the service graph back to upstream sources) andquery_change_events(correlate symptoms with recent config / rule / device mutations around the alert'sfired_at). - Cross-host bash and per-host probes are absent. Those live on the specialists (
specialist-network,specialist-compute,specialist-disk); the investigator coordinates over the observability data plane.
The 5-step workflow
The persona body encodes the workflow. Every investigation runs:
- KB first (mandatory). Once
incident_idis in hand,query_knowledgeexactly once with the rule name + symptom as the natural-language query (e.g. "swap_high 告警怎么排查"). A hit (score ≥ 0.6) means follow the playbook; the final answer carries a(参考 KB: <title>)citation. A miss means proceed to step 1. - Symptom + blast radius.
get_incident_detailfor the rule name / severity / target /fired_at/ labels. This is the end of the causal chain (the effect), not the root cause. Do not stop here. - Timeline.
correlate_incidentpulls metric + log + trace for the same incident window in one call. Sort byfired_at/ first deviation time. The earliest signal is the source candidate; downstream high-CPU / high-latency is usually effect, not cause. "Loudest" ≠ "earliest". - One causal hop upstream. Pick one tool with a clear purpose:
- What changed? →
query_change_events(around_ts=fired_at). Product-side changes are often patient zero. - Dependencies? →
expand_topologyupstream (not downstream blast radius) /find_topology_node. - Trace caller chain? →
query_traceqlto find the slowest span's originator. - First error? →
query_logqlgrep by device_id for the earliest ERROR / PANIC / OOM beforefired_at. - Who moved first? →
query_promqlto find the metric that deviated first.
- What changed? →
- Recurse. Treat the upstream candidate as the new current point. Repeat step 3 until one of:
- Hit bottom — no in-system upstream remains. The leaf is a process / a single change / an external dependency = patient zero.
- Signal exhausted — can't go further. Report "deepest layer reached + what signal would let us continue".
- Validate. The proposed root cause must explain the entire downstream chain — temporally prior to the symptom, magnitude and direction consistent. If not, downgrade to "hypothesis" and say so.
The 18-tool budget — dig deep, never spin
The persona enforces a strict iteration discipline. From the body:
你有 ~18 个工具调用预算(够上溯 4-6 层)。深挖允许,但死分支立刻砍:
- 工具返回空(
result:[]/streams:[]):第一次空可换思路;第二次 空立刻停这条线,换方向或就此上溯为止。- 同一工具失败 / 空 ≥2 次 → 必须换工具或换方向,禁止反复换表达式空转 (v0.7.51-55 的失败都栽在这).
- 每一步都要朝"再上溯一层"前进 — 调之前问自己 "这步能让我更接近源头吗"。
- 上溯到 4-5 层仍未触底、或预算用到 ~15:停,输出"目前最深一层 + 缺失 信号",别为凑满空转。
The max_turns: 40 cap in the frontmatter is the hard ceiling — eino's graph counts MaxStep = MaxIterations*2+2, so 40 → MaxStep=82 → roughly 41 ChatModel turns. The 18-tool budget is the softer guidance in the prompt; the cap is the runtime safety net.
Dead branches are non-negotiable
The investigator's prompt makes this an explicit "do not do" rule because the early v0.7.51-55 evals saw the worker burn 30+ turns re-permuting PromQL expressions trying to make an empty range query return data. A single empty result is information; a second empty result means you've already learned what you'll learn — pivot or ascend.
Output format
The final reply to the coordinator is verbatim Markdown of this shape:
**根因(0 号病人)**
{One-line patient zero — process / change / upstream service+node /
config. Concrete identifiers (pid + cmdline / service name / change
key) — this is the source of pinpoint_target. If we didn't hit bottom:
"未触底,最深到 X;要继续需 Y 信号".}
**因果链**
{Source → … → alert symptom. One line per hop, each with "why this
caused the next" plus evidence (PromQL / LogQL / trace span /
process line).}
**现象**
{1-2 sentences: when did it start / which host / what crossed
threshold / for how long.}
**置信度与验证**
{High / medium / low + reason. Plus: "what query or action would
further validate or falsify this root cause".}The coordinator synthesizes this into the user-facing reply (one language, one paragraph, the citation if KB hit). The investigator's raw Markdown is also persisted as the worker's Result in the session — the RCA UI surfaces it verbatim under the Reasoning disclosure.
The F1 e2e test
F1 is the end-to-end eval that exercises this persona against a seeded incident on the test environment. Shape:
- Seed a synthetic
swap_highincident onnode-01(device_id=7) with asearxngprocess pinned at 95% RSS for 30 minutes. - Wire the alert pipeline so the incident fires with the expected labels +
fired_at. - Dispatch the investigator with
prompt = "rca incident 1234". - Assert the final reply contains:
swap_highrule mentioned in现象.searxngprocess identified in根因(0 号病人)with a pid + command line.- At least 2 causal hops in
因果链(symptom → upstream). (参考 KB: …)citation IF the seeded KB has a matching playbook.
The first version of HLD-013 failed F1 because default_provider wasn't set in the DB — the resolver fell back to openai with a glm model name, the chat model errored, and the worker returned an empty analysis. The lesson: F1 also covers the LLM resolver as a side-effect, which is why it's the canonical "did RCA actually wire end-to-end?" gate.
Common reasons it stops short
The persona returns "patient zero not reached" honestly when:
- The cause is outside the cluster — DNS provider, upstream API, power.
query_change_eventsdoesn't see infra changes outside the manager's scope. - Trace data is missing — TraceQL queries return no spans for the relevant service. The investigator can't walk a caller chain without traces; it reports "缺失 trace 信号".
- The log line that would point to the trigger has rotated. Loki retention < the time-to-first-investigate. The investigator says so and recommends extending retention for repeat investigations.
This honesty is by design. A confident wrong answer is worse than "we hit signal X and need Y to continue".
Tuning
Things you'd realistically change in a fork of this persona:
- Add domain-specific KB hits — write playbooks for your common failure modes, ship them via the vault, the investigator's
KB firststep will discover them. - Adjust the tool whitelist — add
query_traceql-variants if your tracing stack isn't Tempo, or removehost_du_summaryif you don't want the investigator to dispatch disk inspection inline (the default is to delegate tospecialist-diskvia the worker's own output — but the persona does carry the tools to inspect itself when fast). - Tighten the budget — if your model's per-token cost matters, drop
max_turnsto 25-30. The persona body's 18-call soft budget already keeps most investigations under that.
See Custom agents for how to mount your fork over the built-in.