Skip to content

Capabilities overview

Ongrid is organised as a 4-layer stack. The split exists because each layer has a different update cadence, a different blast radius, and a different audit posture — collapsing them produced the "AI agent that ssh's around" anti-pattern that the early prototypes failed at.

Why this matters for operators

Most of the cross-cutting features below (audit, role gating, hot provider swap, blast-radius walk) live at one specific layer. If you have an "X doesn't work" question, the layer this page assigns it to is where its config lives.

The four layers

L1 — Cluster

The physical or virtual infrastructure: hosts, the manager process, the embedded MySQL / Prometheus / Loki / Tempo / Grafana / Qdrant stack, and the bidirectional geminio tunnel broker.

Ongrid does NOT abstract this layer — there is no inventory schema, no CMDB. Hosts are discovered when an edge agent dials home. The cluster layer is "everything the manager binary touches at runtime."

L2 — Edge tunnel + device-direct

Each host runs one ongrid-edge binary that establishes a single outbound geminio connection to the manager. The tunnel multiplexes:

  • Reverse RPCs — manager → edge calls invoking a skill on the host (Caller.Call(ctx, edgeID, method, body), internal/manager/biz/aiops/tools/registry.go:34).
  • WebSSH streams — interactive terminal traffic over a dedicated stream class, see WebShell.
  • Plugin signalling — a control channel that tells the edge which sub-plugins (promtail, otelcol, node-exporter) to spawn.

The "device-direct" idea is L2's defining bet: the manager addresses real hosts, not service abstractions. When the agent says "restart nginx on edge-prod-04," exactly one host runs the command.

L3 — Intelligence

The graph-kernel ReAct agent, the tool registry, the persona registry, the knowledge base, and the LLM provider router. Lives entirely manager-side, talks to L2 only through the tool bag.

Key files:

L4 — Alerting

Rule evaluation, incident lifecycle, auto-RCA fan-out, channel routing, inhibition. Driven by the Prometheus + Loki + Tempo signals L1 collects, and writes back through L3 (the investigator persona) when an incident fires.

Key files:

Capabilities matrix

CapabilityLayerPage
Alert rules (8 metric + 6 log/trace kinds)L4Alerts
Auto root-cause analysis on incident fireL3 + L4RCA
Prometheus + Grafana embedL1Monitoring
Loki log search + log alertsL1 + L4Logs
Tempo trace search + trace alertsL1 + L4Traces
Service / device graph with blast-radius walkL3Topology
RAG against vault + your own reposL3Knowledge
30+ host / observability / knowledge toolsL2 + L3Skills
WebSSH with full session recordingL2WebShell

What this page is not

This is the operator-facing overview. For the design rationale (why PromQL was kept as the canonical predicate, why edge dials out, why remote_write is preferred over scrape) see the ADR/HLD index in the GitHub repo's docs/ tree.

See also

  • Architecture — the same 4-layer split expressed as a deployment diagram.
  • Concepts — vocabulary (edge, device, incident, persona, scope).
  • Alert rule schema — the wire format of the rule rows the Alerts page summarises.