Skip to content

Architecture

Ongrid is a four-layer system. The edge runs on every monitored host; the manager is the cloud. They communicate over one outbound tunnel (the control plane) plus a separate auth-gated direct upload path for telemetry (the data plane).

The 4-layer model

text
┌──────────────────────────────────────────────────────────────────┐
│ L4  Alert / notification                                         │
│      built-in rules + custom kinds → channels (Slack / TG / IM)  │
├──────────────────────────────────────────────────────────────────┤
│ L3  Intelligence (graph-kernel ReAct agent)                      │
│      coordinator → specialist sub-agents → ~30 skills            │
├──────────────────────────────────────────────────────────────────┤
│ L2  Observability triad + edge direct path                       │
│      Prometheus · Loki · Tempo  +  push_host_metrics RPC         │
├──────────────────────────────────────────────────────────────────┤
│ L1  Cluster (signal collection)                                  │
│      ongrid-edge + plugins on every host                         │
└──────────────────────────────────────────────────────────────────┘
  • L1 — Cluster. Where signals come from. One ongrid-edge per host, plus its subprocess plugins (promtail, node_exporter, process_exporter, otelcol-contrib).
  • L2 — Observability triad. Prometheus / Loki / Tempo store the signals. They're shipped in the docker-compose; the same UI works against external managed equivalents (Grafana Cloud, Mimir, VictoriaMetrics) if you switch them out from Settings. A separate edge direct path carries push_host_metrics over the tunnel for low-cardinality closed-set host metrics (this is what powers the built-in CPU / mem / disk / load alerts even before Prom is configured).
  • L3 — Intelligence. The graph-kernel ReAct agent: coordinator decomposes the question, dispatches to specialist sub-agents, calls skills, synthesises an answer.
  • L4 — Alerts & notifications. Built-in rules + custom kinds evaluate against L2 / L1 streams, fire into channels (Slack, Telegram, Larksuite, DingTalk, WeCom, raw webhook).

The strategic bet is L1 + L2 edge direct path: a single tarball, one outbound tunnel, host metrics flowing without a Prom round-trip. That's what makes the 10-minute install actually 10 minutes.

Edge → frontier → manager

text
   host (yours)                  cloud (yours, self-hosted)
 ┌──────────────────┐
 │ ongrid-edge      │
 │ ├─ plugins/      │           ┌─────────────────────────────┐
 │ │  promtail      │           │ frontier (broker, port 40012)│
 │ │  node_exporter │           │  · multiplexed geminio       │
 │ │  process_exp.  │── one ───▶│  · auth: access/secret key   │
 │ │  otelcol       │  outbound │  · service-end → manager     │
 │ └─ runtime       │   TCP     └──────────────┬──────────────┘
 │    geminio client│   :40012                 │ service-end (40011)
 └──────────────────┘                          ▼
                                 ┌─────────────────────────────┐
                                 │ ongrid (manager)            │
                                 │  · http API (nginx 443)     │
                                 │  · bounded contexts         │
                                 │  · agent runtime            │
                                 └──────────────┬──────────────┘


                                       Prom / Loki / Tempo /
                                       MySQL / Qdrant /
                                       SearXNG / Grafana
  • One TCP connection per host. ongrid-edge dials frontier:40012 outbound. Nothing inbound. No port-forward, no jumpbox, no reverse-tunnel SaaS.
  • Geminio multiplex. Many logical RPC streams ride one TCP connection. Bidirectional: the manager can call into the edge (bash, host_probe_*, query_processes) and the edge can call into the manager (push_host_metrics, report_register).
  • Frontier is the broker. Upstream singchia/frontier, shipped in the release tarball. ADR-007 is the rationale (we don't reimplement what an external broker already does).
  • Manager is one Go binary. Ten or so bounded contexts (edges, alerts, incidents, agent, knowledge, channels, identity, audit…), all behind the nginx front-door.

Data plane vs. control plane

Edges have two distinct egress paths to the cloud, by design:

text
 ┌──────── ongrid-edge ────────┐
 │                              │
 │  ┌──────── runtime ────────┐ │     ── control plane ──▶ frontier:40012
 │  │ geminio client (RPC)    │ │       (TLS-by-default if cert provided)
 │  └─────────────────────────┘ │       multiplex, bidirectional, low rate
 │                              │
 │  ┌──────── plugins ────────┐ │     ── data plane ──▶ nginx :443
 │  │ promtail   → Loki push  │ │       https POST per batch
 │  │ otelcol    → OTLP push  │ │       auth_request → manager edgeauth
 │  │ exporters  → /metrics   │ │       high rate, large payloads
 │  └─────────────────────────┘ │
 │                              │
 └──────────────────────────────┘

Why the split? ADR-014. Logs + traces are high-volume, large-batch, naturally HTTP. Multiplexing them onto the tunnel kills throughput under load. Direct push over nginx auth_request keeps the security posture (only enrolled edges can push) while keeping the data plane fast.

What still rides the tunnel?

  • Metrics — currently still push_host_metrics RPC over geminio. Direct remote_write from edge to Prometheus is on the roadmap once cluster sizes justify it. Until then the metric volume is tolerable.
  • All RPCs — query_processes, bash, query_logs_tail, host_probe_*, expand_topology, file reads, WebShell.

Container map (docker-compose)

What sudo ./install.sh brings up on the manager host:

ContainerImageHost portsRole
ongridongrid:<version>9100 (metrics)The manager. Go binary; HTTP API on :8080 proxied by nginx.
ongrid-nginxongrid-web:<version>443, 80TLS terminator + SPA + reverse proxy. Serves /api/*, /grafana/*, /install.sh, /edge/*.
ongrid-mysqlmysql:8.03306All operational state (edges, alerts, users, audit log, channel configs).
ongrid-frontiersingchia/frontier:1.2.540012Geminio broker. Edges dial 40012; manager dials 40011 over the compose net.
ongrid-prometheusprom/prometheus:v2.54.0(none)TSDB. Receives remote_write from manager, queries from the query_promql skill.
ongrid-lokigrafana/loki:3.4.0(none)Logs backend. Push at /loki/api/v1/push via nginx auth-gate.
ongrid-tempografana/tempo:2.5.0(none)Traces backend. OTLP push at /v1/traces via nginx auth-gate.
ongrid-grafanagrafana/grafana-oss:11.1.43000Dashboards. Embedded as iframes under /grafana/ in the SPA.
ongrid-qdrantqdrant/qdrant(none)Vector store for the knowledge base.
ongrid-searxngsearxng/searxng(none)Self-hosted meta-search for the web_search skill.

All stateful services bind-mount to host paths under /var/lib/ongrid/, not docker named volumes — operators can back up and inspect files without docker gymnastics. Override the data root with ONGRID_DATA_DIR and the log root with ONGRID_LOG_DIR.

Manager — bounded contexts

The manager binary is a single Go process. Internally it's split into bounded contexts (each owns its own DB schema, its own HTTP routes, its own background workers). Roughly:

text
┌─────────┐   ┌─────────┐   ┌─────────────┐   ┌──────────────┐
│ identity│   │ edge    │   │ alert       │   │ incident     │
│ ┌─────┐ │   │ ┌─────┐ │   │ ┌─────────┐ │   │ ┌──────────┐ │
│ │users│ │   │ │edges│ │   │ │rules    │ │   │ │incidents │ │
│ │roles│ │   │ │tnls │ │   │ │events   │ │   │ │timeline  │ │
│ │JWT  │ │   │ │keys │ │   │ │evaluator│ │   │ │investig. │ │
│ └─────┘ │   │ └─────┘ │   │ └─────────┘ │   │ └──────────┘ │
└─────────┘   └─────────┘   └─────────────┘   └──────────────┘

┌─────────┐   ┌──────────┐   ┌─────────┐   ┌──────────┐   ┌─────────┐
│ agent   │   │ knowledge│   │ channels│   │ audit    │   │ skills  │
│ kernel  │   │ vault+   │   │ slack/  │   │ events / │   │ registry│
│ + sub-  │   │ embed+   │   │ telegram│   │ chains   │   │ marketp.│
│ agents  │   │ search   │   │ /lark/dt│   │          │   │         │
└─────────┘   └──────────┘   └─────────┘   └──────────┘   └─────────┘

Inter-BC traffic is over Go function calls (biz layer) — there's no RPC inside the binary. The arch lint (make arch-lint) enforces the dependency direction (cmd → service → biz → data, no cycles).

Edge — plugin runtime

ongrid-edge is one Go binary that supervises a small fleet of subprocesses. Each subprocess is an off-the-shelf collector reskinned as a "plugin":

text
ongrid-edge (PID 1)
 ├─ geminio runtime  ← control plane to frontier
 ├─ plugin: logs
 │   └─ promtail subprocess
 │       config: /etc/ongrid-edge/promtail.yaml
 │       data plane: https://<manager>/loki/api/v1/push
 ├─ plugin: traces
 │   └─ otelcol-contrib subprocess
 │       config: /etc/ongrid-edge/otelcol.yaml
 │       data plane: https://<manager>/v1/traces
 ├─ plugin: hostmetrics
 │   └─ node_exporter subprocess
 │       /metrics scraped by Prometheus inside the manager
 └─ plugin: procmetrics
     └─ process_exporter subprocess

ADR-015 is the rationale: every collector is best-of-breed in its own ecosystem; reinventing them as Go libs is hopeless. So the edge owns the runtime contract (config delivery, healthcheck, log capture, upgrade) and shells out for the actual data work.

Upgrades — stage-then-swap

ADR-024 governs whole-bundle upgrades. The flow:

  1. Operator drops edge-bundle-<arch>-<ver>.tar.gz + .sha256 into /opt/ongrid/edge/ on the manager (the release tarball's install.sh and upgrade.sh both do this).
  2. Operator triggers "upgrade all edges" from the UI. Manager sends MethodFetchPackage over the tunnel.
  3. Edge downloads the bundle, verifies the sha256, stages files into /var/lib/ongrid-edge/.upgrade/incoming/.
  4. Edge writes a marker, exits clean. systemd restarts it.
  5. On restart, apply-pending-upgrade.sh runs as root (via ExecStartPre with +) — verifies every file's sha, backs up <dest> to <dest>.previous, atomically mvs the new file into place.
  6. If the new agent doesn't write a healthy_marker before the next restart, apply-pending-upgrade.sh rolls back each .previous automatically.

This is why an edge upgrade is just "restart the unit" — no fragile in-process rewire.

Where things live on disk

On the manager:

PathWhat
/opt/ongrid/Compose file, configs, certs, .env, edge artifacts.
/opt/ongrid/.envSecrets + tunables (mode 0600).
/opt/ongrid/certs/tls.crt, tls.key for nginx. Replace for prod.
/opt/ongrid/edge/Edge upgrade bundles + per-arch loose binaries.
/var/lib/ongrid/Bind-mount root for stateful containers.
/var/log/ongrid/Manager and nginx log files.

On the edge:

PathWhat
/usr/local/bin/ongrid-edgeThe agent binary.
/usr/local/lib/ongrid-edge/Plugin binaries + apply-pending-upgrade.sh.
/etc/ongrid-edge/ongrid-edge.env (access/secret keys), plugin configs.
/var/lib/ongrid-edge/.upgrade/Staged upgrade incoming + markers.
/var/log/ongrid-edge/Plugin stdout/stderr capture. Journal owns agent logs.
/etc/systemd/system/ongrid-edge.serviceSystemd unit.

Where to read more

  • Concepts — glossary of the nouns.
  • Server install — the docker-compose path.
  • Edge install — curl-pipe + verification.
  • Upgradeapply-pending-upgrade.sh, bundle invariants, rollback.
  • Telemetry data plane (Reference) — exact endpoints, auth, and limits.