Skip to content

Incident investigator

incident-investigator is the deepest of Ongrid's workers. The coordinator dispatches to it whenever the user wants root cause, not just a symptom summary. The persona walks the causal chain from the observed alert back to the originating source ("patient zero" — 0 号病人 in the persona prompt) and returns a structured report.

This is the persona behind RCA

The HLD-013 causal RCA pipeline (test env, May 2026 rollout) is built on this persona. When /incident/<id> → Get RCA runs, the manager spawns this worker with the incident id pre-filled into the prompt.

When the coordinator picks it

The persona file declares the trigger patterns. Quoting the frontmatter verbatim:

yaml
when_to_use: |
  coordinator 在用户问以下场景时 spawn 本 worker:
    • "这条告警的根因是什么 / 到底是谁导致的"
    • "incident 123 怎么排查 / 受影响范围 / 持续多久"
    • "这个告警是不是误报 / 跟上次那个相关吗"
    • "这台机器 mem 飙了,看一下"

Translation: any time the user asks why rather than what. The investigator's deliverable is the causal chain back to the source, not a "current state" snapshot.

Tool bag

Whitelisted in the persona — the runtime filter strips everything else:

yaml
tools:
  - query_knowledge        # KB / vault / uploads
  - get_incident_detail
  - query_incidents
  - correlate_incident     # metric+log+trace pulled together
  - query_change_events    # config / rule / device mutations
  - query_promql
  - query_logql
  - query_traceql
  - get_edge_summary
  - query_alert_rules
  - query_devices
  - get_host_load
  - get_host_processes
  - expand_topology
  - find_topology_node
  - host_find_large_files
  - host_du_summary
  - host_stat_file

disallowed_tools:
  - execute_skill
  - host_restart_service
  - run_shell

permission_mode: read-only

Key consequences:

  • Read-only. No host_restart_service, no execute_skill, no run_shell. If the investigation concludes a service needs restarting, it returns that as a proposal to the coordinator; the coordinator dispatches specialist-ops with a mutating intent, and the reviewer gates it.
  • Topology and change events. The unique edge it has over the specialists is expand_topology + find_topology_node (walk the service graph back to upstream sources) and query_change_events (correlate symptoms with recent config / rule / device mutations around the alert's fired_at).
  • Cross-host bash and per-host probes are absent. Those live on the specialists (specialist-network, specialist-compute, specialist-disk); the investigator coordinates over the observability data plane.

The 5-step workflow

The persona body encodes the workflow. Every investigation runs:

  1. KB first (mandatory). Once incident_id is in hand, query_knowledge exactly once with the rule name + symptom as the natural-language query (e.g. "swap_high 告警怎么排查"). A hit (score ≥ 0.6) means follow the playbook; the final answer carries a (参考 KB: <title>) citation. A miss means proceed to step 1.
  2. Symptom + blast radius. get_incident_detail for the rule name / severity / target / fired_at / labels. This is the end of the causal chain (the effect), not the root cause. Do not stop here.
  3. Timeline. correlate_incident pulls metric + log + trace for the same incident window in one call. Sort by fired_at / first deviation time. The earliest signal is the source candidate; downstream high-CPU / high-latency is usually effect, not cause. "Loudest" ≠ "earliest".
  4. One causal hop upstream. Pick one tool with a clear purpose:
    • What changed?query_change_events(around_ts=fired_at). Product-side changes are often patient zero.
    • Dependencies?expand_topology upstream (not downstream blast radius) / find_topology_node.
    • Trace caller chain?query_traceql to find the slowest span's originator.
    • First error?query_logql grep by device_id for the earliest ERROR / PANIC / OOM before fired_at.
    • Who moved first?query_promql to find the metric that deviated first.
  5. Recurse. Treat the upstream candidate as the new current point. Repeat step 3 until one of:
    • Hit bottom — no in-system upstream remains. The leaf is a process / a single change / an external dependency = patient zero.
    • Signal exhausted — can't go further. Report "deepest layer reached + what signal would let us continue".
  6. Validate. The proposed root cause must explain the entire downstream chain — temporally prior to the symptom, magnitude and direction consistent. If not, downgrade to "hypothesis" and say so.

The 18-tool budget — dig deep, never spin

The persona enforces a strict iteration discipline. From the body:

你有 ~18 个工具调用预算(够上溯 4-6 层)。深挖允许,但死分支立刻砍

  • 工具返回空(result:[] / streams:[]):第一次空可换思路;第二次 空立刻停这条线,换方向或就此上溯为止。
  • 同一工具失败 / 空 ≥2 次 → 必须换工具或换方向,禁止反复换表达式空转 (v0.7.51-55 的失败都栽在这).
  • 每一步都要朝"再上溯一层"前进 — 调之前问自己 "这步能让我更接近源头吗"。
  • 上溯到 4-5 层仍未触底、或预算用到 ~15:停,输出"目前最深一层 + 缺失 信号",别为凑满空转。

The max_turns: 40 cap in the frontmatter is the hard ceiling — eino's graph counts MaxStep = MaxIterations*2+2, so 40 → MaxStep=82 → roughly 41 ChatModel turns. The 18-tool budget is the softer guidance in the prompt; the cap is the runtime safety net.

Dead branches are non-negotiable

The investigator's prompt makes this an explicit "do not do" rule because the early v0.7.51-55 evals saw the worker burn 30+ turns re-permuting PromQL expressions trying to make an empty range query return data. A single empty result is information; a second empty result means you've already learned what you'll learn — pivot or ascend.

Output format

The final reply to the coordinator is verbatim Markdown of this shape:

markdown
**根因(0 号病人)**
{One-line patient zero — process / change / upstream service+node /
config. Concrete identifiers (pid + cmdline / service name / change
key) — this is the source of pinpoint_target. If we didn't hit bottom:
"未触底,最深到 X;要继续需 Y 信号".}

**因果链**
{Source → … → alert symptom. One line per hop, each with "why this
caused the next" plus evidence (PromQL / LogQL / trace span /
process line).}

**现象**
{1-2 sentences: when did it start / which host / what crossed
threshold / for how long.}

**置信度与验证**
{High / medium / low + reason. Plus: "what query or action would
further validate or falsify this root cause".}

The coordinator synthesizes this into the user-facing reply (one language, one paragraph, the citation if KB hit). The investigator's raw Markdown is also persisted as the worker's Result in the session — the RCA UI surfaces it verbatim under the Reasoning disclosure.

The F1 e2e test

F1 is the end-to-end eval that exercises this persona against a seeded incident on the test environment. Shape:

  1. Seed a synthetic swap_high incident on node-01 (device_id=7) with a searxng process pinned at 95% RSS for 30 minutes.
  2. Wire the alert pipeline so the incident fires with the expected labels + fired_at.
  3. Dispatch the investigator with prompt = "rca incident 1234".
  4. Assert the final reply contains:
    • swap_high rule mentioned in 现象.
    • searxng process identified in 根因(0 号病人) with a pid + command line.
    • At least 2 causal hops in 因果链 (symptom → upstream).
    • (参考 KB: …) citation IF the seeded KB has a matching playbook.

The first version of HLD-013 failed F1 because default_provider wasn't set in the DB — the resolver fell back to openai with a glm model name, the chat model errored, and the worker returned an empty analysis. The lesson: F1 also covers the LLM resolver as a side-effect, which is why it's the canonical "did RCA actually wire end-to-end?" gate.

Common reasons it stops short

The persona returns "patient zero not reached" honestly when:

  • The cause is outside the cluster — DNS provider, upstream API, power. query_change_events doesn't see infra changes outside the manager's scope.
  • Trace data is missing — TraceQL queries return no spans for the relevant service. The investigator can't walk a caller chain without traces; it reports "缺失 trace 信号".
  • The log line that would point to the trigger has rotated. Loki retention < the time-to-first-investigate. The investigator says so and recommends extending retention for repeat investigations.

This honesty is by design. A confident wrong answer is worse than "we hit signal X and need Y to continue".

Tuning

Things you'd realistically change in a fork of this persona:

  • Add domain-specific KB hits — write playbooks for your common failure modes, ship them via the vault, the investigator's KB first step will discover them.
  • Adjust the tool whitelist — add query_traceql-variants if your tracing stack isn't Tempo, or remove host_du_summary if you don't want the investigator to dispatch disk inspection inline (the default is to delegate to specialist-disk via the worker's own output — but the persona does carry the tools to inspect itself when fast).
  • Tighten the budget — if your model's per-token cost matters, drop max_turns to 25-30. The persona body's 18-call soft budget already keeps most investigations under that.

See Custom agents for how to mount your fork over the built-in.