Capabilities overview
Ongrid is organised as a 4-layer stack. The split exists because each layer has a different update cadence, a different blast radius, and a different audit posture — collapsing them produced the "AI agent that ssh's around" anti-pattern that the early prototypes failed at.
Why this matters for operators
Most of the cross-cutting features below (audit, role gating, hot provider swap, blast-radius walk) live at one specific layer. If you have an "X doesn't work" question, the layer this page assigns it to is where its config lives.
The four layers
L1 — Cluster
The physical or virtual infrastructure: hosts, the manager process, the embedded MySQL / Prometheus / Loki / Tempo / Grafana / Qdrant stack, and the bidirectional geminio tunnel broker.
Ongrid does NOT abstract this layer — there is no inventory schema, no CMDB. Hosts are discovered when an edge agent dials home. The cluster layer is "everything the manager binary touches at runtime."
L2 — Edge tunnel + device-direct
Each host runs one ongrid-edge binary that establishes a single outbound geminio connection to the manager. The tunnel multiplexes:
- Reverse RPCs — manager → edge calls invoking a skill on the host (
Caller.Call(ctx, edgeID, method, body),internal/manager/biz/aiops/tools/registry.go:34). - WebSSH streams — interactive terminal traffic over a dedicated stream class, see WebShell.
- Plugin signalling — a control channel that tells the edge which sub-plugins (
promtail,otelcol,node-exporter) to spawn.
The "device-direct" idea is L2's defining bet: the manager addresses real hosts, not service abstractions. When the agent says "restart nginx on edge-prod-04," exactly one host runs the command.
L3 — Intelligence
The graph-kernel ReAct agent, the tool registry, the persona registry, the knowledge base, and the LLM provider router. Lives entirely manager-side, talks to L2 only through the tool bag.
Key files:
internal/manager/biz/aiops/tools/— 30+ BaseTools, the LLM's hands.internal/pkg/llm/—MultiClient,RoutingChatModel,BudgetChecker.internal/manager/biz/knowledge/— Qdrant + vault + upload.
L4 — Alerting
Rule evaluation, incident lifecycle, auto-RCA fan-out, channel routing, inhibition. Driven by the Prometheus + Loki + Tempo signals L1 collects, and writes back through L3 (the investigator persona) when an incident fires.
Key files:
internal/manager/biz/alert/pipeline.go— the evaluator tick.internal/manager/biz/alert/investigator/usecase.go— auto-RCA.
Capabilities matrix
| Capability | Layer | Page |
|---|---|---|
| Alert rules (8 metric + 6 log/trace kinds) | L4 | Alerts |
| Auto root-cause analysis on incident fire | L3 + L4 | RCA |
| Prometheus + Grafana embed | L1 | Monitoring |
| Loki log search + log alerts | L1 + L4 | Logs |
| Tempo trace search + trace alerts | L1 + L4 | Traces |
| Service / device graph with blast-radius walk | L3 | Topology |
| RAG against vault + your own repos | L3 | Knowledge |
| 30+ host / observability / knowledge tools | L2 + L3 | Skills |
| WebSSH with full session recording | L2 | WebShell |
What this page is not
This is the operator-facing overview. For the design rationale (why PromQL was kept as the canonical predicate, why edge dials out, why remote_write is preferred over scrape) see the ADR/HLD index in the GitHub repo's docs/ tree.
See also
- Architecture — the same 4-layer split expressed as a deployment diagram.
- Concepts — vocabulary (edge, device, incident, persona, scope).
- Alert rule schema — the wire format of the rule rows the Alerts page summarises.