Capabilities overview

Ongrid is organised as a 4-layer stack. The split exists because each layer has a different update cadence, a different blast radius, and a different audit posture — collapsing them produced the "AI agent that ssh's around" anti-pattern that the early prototypes failed at.

Why this matters for operators

Most of the cross-cutting features below (audit, role gating, hot provider swap, blast-radius walk) live at one specific layer. If you have an "X doesn't work" question, the layer this page assigns it to is where its config lives.

The four layers

L1 — Cluster

The physical or virtual infrastructure: hosts, the manager process, the embedded MySQL / Prometheus / Loki / Tempo / Grafana / Qdrant stack, and the bidirectional geminio tunnel broker.

Ongrid does NOT abstract this layer — there is no inventory schema, no CMDB. Hosts are discovered when an edge agent dials home. The cluster layer is "everything the manager binary touches at runtime."

L2 — Edge tunnel + device-direct

Each host runs one ongrid-edge binary that establishes a single outbound geminio connection to the manager. The tunnel multiplexes:

Reverse RPCs — manager → edge calls invoking a skill on the host (Caller.Call(ctx, edgeID, method, body), internal/manager/biz/aiops/tools/registry.go:34).
WebSSH streams — interactive terminal traffic over a dedicated stream class, see WebShell.
Plugin signalling — a control channel that tells the edge which sub-plugins (promtail, otelcol, node-exporter) to spawn.

The "device-direct" idea is L2's defining bet: the manager addresses real hosts, not service abstractions. When the agent says "restart nginx on edge-prod-04," exactly one host runs the command.

L3 — Intelligence

The graph-kernel ReAct agent, the tool registry, the persona registry, the knowledge base, and the LLM provider router. Lives entirely manager-side, talks to L2 only through the tool bag.

Key files:

internal/manager/biz/aiops/tools/ — 30+ BaseTools, the LLM's hands.
internal/pkg/llm/ — MultiClient, RoutingChatModel, BudgetChecker.
internal/manager/biz/knowledge/ — Qdrant + vault + upload.

L4 — Alerting

Rule evaluation, incident lifecycle, auto-RCA fan-out, channel routing, inhibition. Driven by the Prometheus + Loki + Tempo signals L1 collects, and writes back through L3 (the investigator persona) when an incident fires.

Key files:

internal/manager/biz/alert/pipeline.go — the evaluator tick.
internal/manager/biz/alert/investigator/usecase.go — auto-RCA.

Capabilities matrix

Capability	Layer	Page
Alert rules (8 metric + 6 log/trace kinds)	L4	Alerts
Auto root-cause analysis on incident fire	L3 + L4	RCA
Prometheus + Grafana embed	L1	Monitoring
Loki log search + log alerts	L1 + L4	Logs
Tempo trace search + trace alerts	L1 + L4	Traces
Service / device graph with blast-radius walk	L3	Topology
RAG against vault + your own repos	L3	Knowledge
30+ host / observability / knowledge tools	L2 + L3	Skills
WebSSH with full session recording	L2	WebShell

What this page is not

This is the operator-facing overview. For the design rationale (why PromQL was kept as the canonical predicate, why edge dials out, why remote_write is preferred over scrape) see the ADR/HLD index in the GitHub repo's docs/ tree.

Capabilities overview ​

The four layers ​

L1 — Cluster ​

L2 — Edge tunnel + device-direct ​

L3 — Intelligence ​

L4 — Alerting ​

Capabilities matrix ​

What this page is not ​

See also ​