Architecture
Ongrid is a four-layer system. The edge runs on every monitored host; the manager is the cloud. They communicate over one outbound tunnel (the control plane) plus a separate auth-gated direct upload path for telemetry (the data plane).
The 4-layer model
┌──────────────────────────────────────────────────────────────────┐
│ L4 Alert / notification │
│ built-in rules + custom kinds → channels (Slack / TG / IM) │
├──────────────────────────────────────────────────────────────────┤
│ L3 Intelligence (graph-kernel ReAct agent) │
│ coordinator → specialist sub-agents → ~30 skills │
├──────────────────────────────────────────────────────────────────┤
│ L2 Observability triad + edge direct path │
│ Prometheus · Loki · Tempo + push_host_metrics RPC │
├──────────────────────────────────────────────────────────────────┤
│ L1 Cluster (signal collection) │
│ ongrid-edge + plugins on every host │
└──────────────────────────────────────────────────────────────────┘- L1 — Cluster. Where signals come from. One
ongrid-edgeper host, plus its subprocess plugins (promtail,node_exporter,process_exporter,otelcol-contrib). - L2 — Observability triad. Prometheus / Loki / Tempo store the signals. They're shipped in the docker-compose; the same UI works against external managed equivalents (Grafana Cloud, Mimir, VictoriaMetrics) if you switch them out from Settings. A separate edge direct path carries
push_host_metricsover the tunnel for low-cardinality closed-set host metrics (this is what powers the built-in CPU / mem / disk / load alerts even before Prom is configured). - L3 — Intelligence. The graph-kernel ReAct agent: coordinator decomposes the question, dispatches to specialist sub-agents, calls skills, synthesises an answer.
- L4 — Alerts & notifications. Built-in rules + custom kinds evaluate against L2 / L1 streams, fire into channels (Slack, Telegram, Larksuite, DingTalk, WeCom, raw webhook).
The strategic bet is L1 + L2 edge direct path: a single tarball, one outbound tunnel, host metrics flowing without a Prom round-trip. That's what makes the 10-minute install actually 10 minutes.
Edge → frontier → manager
host (yours) cloud (yours, self-hosted)
┌──────────────────┐
│ ongrid-edge │
│ ├─ plugins/ │ ┌─────────────────────────────┐
│ │ promtail │ │ frontier (broker, port 40012)│
│ │ node_exporter │ │ · multiplexed geminio │
│ │ process_exp. │── one ───▶│ · auth: access/secret key │
│ │ otelcol │ outbound │ · service-end → manager │
│ └─ runtime │ TCP └──────────────┬──────────────┘
│ geminio client│ :40012 │ service-end (40011)
└──────────────────┘ ▼
┌─────────────────────────────┐
│ ongrid (manager) │
│ · http API (nginx 443) │
│ · bounded contexts │
│ · agent runtime │
└──────────────┬──────────────┘
│
▼
Prom / Loki / Tempo /
MySQL / Qdrant /
SearXNG / Grafana- One TCP connection per host.
ongrid-edgedialsfrontier:40012outbound. Nothing inbound. No port-forward, no jumpbox, no reverse-tunnel SaaS. - Geminio multiplex. Many logical RPC streams ride one TCP connection. Bidirectional: the manager can call into the edge (
bash,host_probe_*,query_processes) and the edge can call into the manager (push_host_metrics,report_register). - Frontier is the broker. Upstream singchia/frontier, shipped in the release tarball. ADR-007 is the rationale (we don't reimplement what an external broker already does).
- Manager is one Go binary. Ten or so bounded contexts (edges, alerts, incidents, agent, knowledge, channels, identity, audit…), all behind the nginx front-door.
Data plane vs. control plane
Edges have two distinct egress paths to the cloud, by design:
┌──────── ongrid-edge ────────┐
│ │
│ ┌──────── runtime ────────┐ │ ── control plane ──▶ frontier:40012
│ │ geminio client (RPC) │ │ (TLS-by-default if cert provided)
│ └─────────────────────────┘ │ multiplex, bidirectional, low rate
│ │
│ ┌──────── plugins ────────┐ │ ── data plane ──▶ nginx :443
│ │ promtail → Loki push │ │ https POST per batch
│ │ otelcol → OTLP push │ │ auth_request → manager edgeauth
│ │ exporters → /metrics │ │ high rate, large payloads
│ └─────────────────────────┘ │
│ │
└──────────────────────────────┘Why the split? ADR-014. Logs + traces are high-volume, large-batch, naturally HTTP. Multiplexing them onto the tunnel kills throughput under load. Direct push over nginx auth_request keeps the security posture (only enrolled edges can push) while keeping the data plane fast.
What still rides the tunnel?
- Metrics — currently still
push_host_metricsRPC over geminio. Directremote_writefrom edge to Prometheus is on the roadmap once cluster sizes justify it. Until then the metric volume is tolerable. - All RPCs —
query_processes,bash,query_logs_tail,host_probe_*,expand_topology, file reads, WebShell.
Container map (docker-compose)
What sudo ./install.sh brings up on the manager host:
| Container | Image | Host ports | Role |
|---|---|---|---|
ongrid | ongrid:<version> | 9100 (metrics) | The manager. Go binary; HTTP API on :8080 proxied by nginx. |
ongrid-nginx | ongrid-web:<version> | 443, 80 | TLS terminator + SPA + reverse proxy. Serves /api/*, /grafana/*, /install.sh, /edge/*. |
ongrid-mysql | mysql:8.0 | 3306 | All operational state (edges, alerts, users, audit log, channel configs). |
ongrid-frontier | singchia/frontier:1.2.5 | 40012 | Geminio broker. Edges dial 40012; manager dials 40011 over the compose net. |
ongrid-prometheus | prom/prometheus:v2.54.0 | (none) | TSDB. Receives remote_write from manager, queries from the query_promql skill. |
ongrid-loki | grafana/loki:3.4.0 | (none) | Logs backend. Push at /loki/api/v1/push via nginx auth-gate. |
ongrid-tempo | grafana/tempo:2.5.0 | (none) | Traces backend. OTLP push at /v1/traces via nginx auth-gate. |
ongrid-grafana | grafana/grafana-oss:11.1.4 | 3000 | Dashboards. Embedded as iframes under /grafana/ in the SPA. |
ongrid-qdrant | qdrant/qdrant | (none) | Vector store for the knowledge base. |
ongrid-searxng | searxng/searxng | (none) | Self-hosted meta-search for the web_search skill. |
All stateful services bind-mount to host paths under /var/lib/ongrid/, not docker named volumes — operators can back up and inspect files without docker gymnastics. Override the data root with ONGRID_DATA_DIR and the log root with ONGRID_LOG_DIR.
Manager — bounded contexts
The manager binary is a single Go process. Internally it's split into bounded contexts (each owns its own DB schema, its own HTTP routes, its own background workers). Roughly:
┌─────────┐ ┌─────────┐ ┌─────────────┐ ┌──────────────┐
│ identity│ │ edge │ │ alert │ │ incident │
│ ┌─────┐ │ │ ┌─────┐ │ │ ┌─────────┐ │ │ ┌──────────┐ │
│ │users│ │ │ │edges│ │ │ │rules │ │ │ │incidents │ │
│ │roles│ │ │ │tnls │ │ │ │events │ │ │ │timeline │ │
│ │JWT │ │ │ │keys │ │ │ │evaluator│ │ │ │investig. │ │
│ └─────┘ │ │ └─────┘ │ │ └─────────┘ │ │ └──────────┘ │
└─────────┘ └─────────┘ └─────────────┘ └──────────────┘
┌─────────┐ ┌──────────┐ ┌─────────┐ ┌──────────┐ ┌─────────┐
│ agent │ │ knowledge│ │ channels│ │ audit │ │ skills │
│ kernel │ │ vault+ │ │ slack/ │ │ events / │ │ registry│
│ + sub- │ │ embed+ │ │ telegram│ │ chains │ │ marketp.│
│ agents │ │ search │ │ /lark/dt│ │ │ │ │
└─────────┘ └──────────┘ └─────────┘ └──────────┘ └─────────┘Inter-BC traffic is over Go function calls (biz layer) — there's no RPC inside the binary. The arch lint (make arch-lint) enforces the dependency direction (cmd → service → biz → data, no cycles).
Edge — plugin runtime
ongrid-edge is one Go binary that supervises a small fleet of subprocesses. Each subprocess is an off-the-shelf collector reskinned as a "plugin":
ongrid-edge (PID 1)
├─ geminio runtime ← control plane to frontier
├─ plugin: logs
│ └─ promtail subprocess
│ config: /etc/ongrid-edge/promtail.yaml
│ data plane: https://<manager>/loki/api/v1/push
├─ plugin: traces
│ └─ otelcol-contrib subprocess
│ config: /etc/ongrid-edge/otelcol.yaml
│ data plane: https://<manager>/v1/traces
├─ plugin: hostmetrics
│ └─ node_exporter subprocess
│ /metrics scraped by Prometheus inside the manager
└─ plugin: procmetrics
└─ process_exporter subprocessADR-015 is the rationale: every collector is best-of-breed in its own ecosystem; reinventing them as Go libs is hopeless. So the edge owns the runtime contract (config delivery, healthcheck, log capture, upgrade) and shells out for the actual data work.
Upgrades — stage-then-swap
ADR-024 governs whole-bundle upgrades. The flow:
- Operator drops
edge-bundle-<arch>-<ver>.tar.gz+.sha256into/opt/ongrid/edge/on the manager (the release tarball'sinstall.shandupgrade.shboth do this). - Operator triggers "upgrade all edges" from the UI. Manager sends
MethodFetchPackageover the tunnel. - Edge downloads the bundle, verifies the sha256, stages files into
/var/lib/ongrid-edge/.upgrade/incoming/. - Edge writes a marker, exits clean. systemd restarts it.
- On restart,
apply-pending-upgrade.shruns as root (via ExecStartPre with+) — verifies every file's sha, backs up<dest>to<dest>.previous, atomicallymvs the new file into place. - If the new agent doesn't write a
healthy_markerbefore the next restart,apply-pending-upgrade.shrolls back each.previousautomatically.
This is why an edge upgrade is just "restart the unit" — no fragile in-process rewire.
Where things live on disk
On the manager:
| Path | What |
|---|---|
/opt/ongrid/ | Compose file, configs, certs, .env, edge artifacts. |
/opt/ongrid/.env | Secrets + tunables (mode 0600). |
/opt/ongrid/certs/ | tls.crt, tls.key for nginx. Replace for prod. |
/opt/ongrid/edge/ | Edge upgrade bundles + per-arch loose binaries. |
/var/lib/ongrid/ | Bind-mount root for stateful containers. |
/var/log/ongrid/ | Manager and nginx log files. |
On the edge:
| Path | What |
|---|---|
/usr/local/bin/ongrid-edge | The agent binary. |
/usr/local/lib/ongrid-edge/ | Plugin binaries + apply-pending-upgrade.sh. |
/etc/ongrid-edge/ | ongrid-edge.env (access/secret keys), plugin configs. |
/var/lib/ongrid-edge/.upgrade/ | Staged upgrade incoming + markers. |
/var/log/ongrid-edge/ | Plugin stdout/stderr capture. Journal owns agent logs. |
/etc/systemd/system/ongrid-edge.service | Systemd unit. |
Where to read more
- Concepts — glossary of the nouns.
- Server install — the docker-compose path.
- Edge install — curl-pipe + verification.
- Upgrade —
apply-pending-upgrade.sh, bundle invariants, rollback. - Telemetry data plane (Reference) — exact endpoints, auth, and limits.