RCA (root cause)
When an alert fires, Ongrid spawns an LLM worker that drives the graph-kernel ReAct agent on the incident-investigator persona, calls tools to gather evidence, and writes a structured report back into investigation_reports.
The report renders on the SPA's /alerts/incidents/:id page next to the firing series — the human SRE never has to start "from a blank prompt."
HLD-013
The current pipeline lands HLD-013's Phase 1+2 causal model. The naive "summarise what fired" approach (PR-2) was replaced once it became clear operators wanted the zero patient — the specific process / container / line that started the cascade — not a recap of the alarm text.
Lifecycle
incident.fire (isNew=true)
└─ alert.Usecase.RecordFiring
└─ Investigator.InvestigateAsync(incident)
└─ Enqueue → Gate 1-3 (severity, inflight, semaphore)
└─ repo.Create(pending row) ← UI shows "investigating…"
└─ go run(reportID, incident, dedupKey, locale)
├─ spawner.SpawnWorker(incident-investigator persona)
├─ worker drives the ReAct loop with tools
├─ Pass-2 structured extraction (cheap model)
└─ repo.MarkReady(report fields) ← UI shows readyInvestigateAsync is the public seam alert.Usecase calls — see usecase.go:301.
Gates
Three gates filter before a worker is ever spawned. Each rejection is persisted as a status=skipped row so the SPA shows the operator a reason instead of "not started forever."
| Gate | Default | Behaviour on miss |
|---|---|---|
Severity floor (Config.MinSeverity) | warning | Silent skip, no row written |
In-process inflight (per incident_id) | always on | Silent coalesce |
Concurrency cap (Config.MaxConcurrent) | 5 | skipped: concurrency limit reached (N workers in flight) row |
The concurrency cap defends LLM provider rate-limits and bounds RAM when 100 incidents fire at once. Over-cap callers get an explicit skipped row, not a queue.
The worker
The runtime spawns a worker with chatruntime.SpawnRequest:
// internal/manager/biz/alert/investigator/usecase.go:571
worker, err := uc.spawner.SpawnWorker(ctx, chatruntime.SpawnRequest{
AgentName: uc.cfg.AgentName, // "incident-investigator"
Prompt: prompt,
Background: false,
SessionKind: "investigation",
})The prompt is rendered by renderAlertPrompt and includes:
- Incident metadata (rule, severity, device_id, value, threshold, summary).
- An explicit starting instruction:
Start with correlate_incident to pull metrics + logs + traces + topology around the fire window. - A hard budget: 10 tool calls max, must start writing the report by call #7. This budget lives in the user message because non-frontier models (GLM, DeepSeek) follow user-message constraints more reliably than system-message ones — without it, the eino MaxStep cap was hit on every other run.
- A locale directive that overrides the persona's implicit language with the operator's UI locale (see Models / Routing for how the locale propagates).
Tool budget + salvage
The eino ReAct graph caps total steps. When a worker exhausts the cap without writing a final answer, the investigator salvages the partial trail:
MessageReader.ListMessages(sessionID, limit=100)pulls every turn.- Assistant + tool messages are concatenated into a synthetic "what we found" markdown.
- The salvage is fed through the same Pass-2 extractor.
- The report is marked
readywith a low-confidence note prepended:工作器超出最大步数预算(exceeds max steps);以下为根据已收集工具结果的局部分析,置信度偏低。
Without salvage the operator saw status=failed with no useful data — the worker had typically called 10+ tools and gathered the answer, it just never wrote the synthesis turn.
Structured extraction (Pass 2)
The worker's final assistant message is markdown — the SPA needs structured fields. A second, cheap LLM call extracts:
root_cause— one-paragraph TL;DR.affected_window— when did the symptom span start / stop.pinpointed_target— the specific process / container / file that changed.related_alerts— co-firing incidents (viaRelatedAlertQuerier).evidence— bulleted source quotes (PromQL output, log lines, etc.).suggested_actions— operator-runnable next steps.confidence+confidence_factors.tool_call_count— read back fromchat_messagesso the UI shows how many tools the worker actually invoked (not a hardcoded 0).
Configurable model + provider via Config.SummarizerProvider / Config.SummarizerModel. Default 30s timeout (short prompt, short reply, no tool loop).
When the extractor is unwired or errors, the fallback uses firstParagraphOneLine over the worker's markdown to fill root_cause and ships the whole markdown verbatim as findings_md.
Bold-header trap
firstParagraphOneLine (usecase.go:846) skips pure markdown scaffolding (headings, dividers, fully-bold section titles) so root_cause reads as a sentence, not as **现象**. A prior bug only stripped the leading ** and left a trailing pair — fixed in the same function.
Manual re-trigger
POST /v1/alerts/incidents/{id}/investigation
Accept-Language: enForceEnqueue runs the manual path:
- Stop any currently-running worker for this incident (best-effort — warns and continues if the worker_id is stale post-restart).
- Hard-delete the prior
investigation_reportsrow (covers soft-deleted rows from earlier force-enqueues so the uniqueincident_idindex doesn't reject the next Create). - Release the inflight guard.
- Call
EnqueueWithwith the locale parsed fromAccept-Languageso the regenerated report comes back in the operator's UI language.
Severity floor still applies — manual triggers on info-level incidents return severity below floor errors.
Boot-time backfill
Fresh installs hit a chicken-and-egg: the structured RCA chain only wires when at least one LLM provider is configured, but incidents can fire before the operator adds a provider. BackfillUnstartedIncidents runs at boot, walks ListIncidentsWithoutReport(since, limit), and re-enqueues anything that fired in the window. The normal gates still apply, so resolved incidents and over-floor incidents are skipped.
See cmd/ongrid/main.go for the wiring — it's run at startup with a 24-hour window.
See also
- Alerts — what fires.
- Topology — what
expand_topologyexposes for the investigator's blast-radius walk. - Skills —
correlate_incident,get_incident_detail,query_promql,search_logs: the tools the worker calls. - Models / Routing — how the per-investigation locale + provider is resolved.
- Models / Budget — the global per-day token cap that bounds investigation cost.