Skip to content

RCA (root cause)

When an alert fires, Ongrid spawns an LLM worker that drives the graph-kernel ReAct agent on the incident-investigator persona, calls tools to gather evidence, and writes a structured report back into investigation_reports.

The report renders on the SPA's /alerts/incidents/:id page next to the firing series — the human SRE never has to start "from a blank prompt."

HLD-013

The current pipeline lands HLD-013's Phase 1+2 causal model. The naive "summarise what fired" approach (PR-2) was replaced once it became clear operators wanted the zero patient — the specific process / container / line that started the cascade — not a recap of the alarm text.

Lifecycle

text
incident.fire (isNew=true)
    └─ alert.Usecase.RecordFiring
        └─ Investigator.InvestigateAsync(incident)
            └─ Enqueue → Gate 1-3 (severity, inflight, semaphore)
                └─ repo.Create(pending row)   ← UI shows "investigating…"
                └─ go run(reportID, incident, dedupKey, locale)
                    ├─ spawner.SpawnWorker(incident-investigator persona)
                    ├─ worker drives the ReAct loop with tools
                    ├─ Pass-2 structured extraction (cheap model)
                    └─ repo.MarkReady(report fields)  ← UI shows ready

InvestigateAsync is the public seam alert.Usecase calls — see usecase.go:301.

Gates

Three gates filter before a worker is ever spawned. Each rejection is persisted as a status=skipped row so the SPA shows the operator a reason instead of "not started forever."

GateDefaultBehaviour on miss
Severity floor (Config.MinSeverity)warningSilent skip, no row written
In-process inflight (per incident_id)always onSilent coalesce
Concurrency cap (Config.MaxConcurrent)5skipped: concurrency limit reached (N workers in flight) row

The concurrency cap defends LLM provider rate-limits and bounds RAM when 100 incidents fire at once. Over-cap callers get an explicit skipped row, not a queue.

The worker

The runtime spawns a worker with chatruntime.SpawnRequest:

go
// internal/manager/biz/alert/investigator/usecase.go:571
worker, err := uc.spawner.SpawnWorker(ctx, chatruntime.SpawnRequest{
    AgentName:   uc.cfg.AgentName, // "incident-investigator"
    Prompt:      prompt,
    Background:  false,
    SessionKind: "investigation",
})

The prompt is rendered by renderAlertPrompt and includes:

  • Incident metadata (rule, severity, device_id, value, threshold, summary).
  • An explicit starting instruction: Start with correlate_incident to pull metrics + logs + traces + topology around the fire window.
  • A hard budget: 10 tool calls max, must start writing the report by call #7. This budget lives in the user message because non-frontier models (GLM, DeepSeek) follow user-message constraints more reliably than system-message ones — without it, the eino MaxStep cap was hit on every other run.
  • A locale directive that overrides the persona's implicit language with the operator's UI locale (see Models / Routing for how the locale propagates).

Tool budget + salvage

The eino ReAct graph caps total steps. When a worker exhausts the cap without writing a final answer, the investigator salvages the partial trail:

  1. MessageReader.ListMessages(sessionID, limit=100) pulls every turn.
  2. Assistant + tool messages are concatenated into a synthetic "what we found" markdown.
  3. The salvage is fed through the same Pass-2 extractor.
  4. The report is marked ready with a low-confidence note prepended: 工作器超出最大步数预算(exceeds max steps);以下为根据已收集工具结果的局部分析,置信度偏低。

Without salvage the operator saw status=failed with no useful data — the worker had typically called 10+ tools and gathered the answer, it just never wrote the synthesis turn.

Structured extraction (Pass 2)

The worker's final assistant message is markdown — the SPA needs structured fields. A second, cheap LLM call extracts:

  • root_cause — one-paragraph TL;DR.
  • affected_window — when did the symptom span start / stop.
  • pinpointed_target — the specific process / container / file that changed.
  • related_alerts — co-firing incidents (via RelatedAlertQuerier).
  • evidence — bulleted source quotes (PromQL output, log lines, etc.).
  • suggested_actions — operator-runnable next steps.
  • confidence + confidence_factors.
  • tool_call_count — read back from chat_messages so the UI shows how many tools the worker actually invoked (not a hardcoded 0).

Configurable model + provider via Config.SummarizerProvider / Config.SummarizerModel. Default 30s timeout (short prompt, short reply, no tool loop).

When the extractor is unwired or errors, the fallback uses firstParagraphOneLine over the worker's markdown to fill root_cause and ships the whole markdown verbatim as findings_md.

Bold-header trap

firstParagraphOneLine (usecase.go:846) skips pure markdown scaffolding (headings, dividers, fully-bold section titles) so root_cause reads as a sentence, not as **现象**. A prior bug only stripped the leading ** and left a trailing pair — fixed in the same function.

Manual re-trigger

http
POST /v1/alerts/incidents/{id}/investigation
Accept-Language: en

ForceEnqueue runs the manual path:

  1. Stop any currently-running worker for this incident (best-effort — warns and continues if the worker_id is stale post-restart).
  2. Hard-delete the prior investigation_reports row (covers soft-deleted rows from earlier force-enqueues so the unique incident_id index doesn't reject the next Create).
  3. Release the inflight guard.
  4. Call EnqueueWith with the locale parsed from Accept-Language so the regenerated report comes back in the operator's UI language.

Severity floor still applies — manual triggers on info-level incidents return severity below floor errors.

Boot-time backfill

Fresh installs hit a chicken-and-egg: the structured RCA chain only wires when at least one LLM provider is configured, but incidents can fire before the operator adds a provider. BackfillUnstartedIncidents runs at boot, walks ListIncidentsWithoutReport(since, limit), and re-enqueues anything that fired in the window. The normal gates still apply, so resolved incidents and over-floor incidents are skipped.

See cmd/ongrid/main.go for the wiring — it's run at startup with a 24-hour window.

See also

  • Alerts — what fires.
  • Topology — what expand_topology exposes for the investigator's blast-radius walk.
  • Skillscorrelate_incident, get_incident_detail, query_promql, search_logs: the tools the worker calls.
  • Models / Routing — how the per-investigation locale + provider is resolved.
  • Models / Budget — the global per-day token cap that bounds investigation cost.