Telemetry data plane
Ongrid edges talk to the manager over two physical paths:
- Control plane — the geminio tunnel (ADR-001 / ADR-007). Used for edge lifecycle, RPC, heartbeats, alert events, and metric push (
push_prom_samples). - Telemetry data plane — direct outbound HTTPS POST from the edge to the manager's public ingest endpoints. Used for logs (Loki API) and traces (OTLP).
This page explains the split, the authentication model, and the migration trigger for metrics.
This page is the concise reference. The full decision record is ADR-014 in the source tree.
The split
| Plane | Channel | Carries |
|---|---|---|
| Control plane | geminio tunnel (existing ADR-001) | edge lifecycle, RPCs, heartbeats, alert events, metric push (for now) |
| Telemetry data plane | edge → manager public HTTPS, independent outbound connections | logs (ADR-012), traces (ADR-013) |
Both planes are outbound from the edge — the agent dials the manager. No inbound ports on the edge.
Why split?
The geminio tunnel was designed as a control-RPC bus. It multiplexes low-latency RPC calls inside one persistent connection. Metric push was a "free ride" added in ADR-009 because metric volume is small (a few KB/s per edge) and tunnel had spare capacity.
Logs and traces are not small.
| Signal | Per-edge steady state | Per-edge peak |
|---|---|---|
| metric | a few KB/s | ~10 KB/s |
| log | tens of KB/s | 1–10 MB/s |
| trace | depends on sample rate | comparable to log |
100 edges × 1 MB/s = 100 MB/s sustained ingress at the manager. The tunnel was not designed for that. Forcing logs and traces over it hits two problems:
- Manager CPU melts. Tunnel frame encoding/decoding + forwarding to downstream stores happens in the Go process. nginx + downstream store directly is several times cheaper.
- HOL blocking. High-throughput byte streams contend with control RPCs on the mux. Operators experience second-level jitter when asking "what's wrong with edge 42?".
Tunnel is control plane. Data plane is data plane. Mixing them was an expedient under NAT constraints; the split makes the boundary explicit.
Architecture
┌──────────────────────────────┐
┌──────────────┐ │ manager │
│ ongrid-edge │ │ │
│ │ │ nginx :443 ──► /api/* ───► ongrid (manager)
│ ┌─────────┐ │ geminio tunnel │ │
│ │ agent │ ├────► :40012 ──────►│ frontier (geminio broker) ──┤
│ └─────────┘ │ (control plane) │ │
│ │ │ nginx :443 ──► /loki/* ───► loki ← data plane
│ ┌─────────┐ │ │ ──► /v1/traces ─► tempo ← data plane
│ │promtail │ ├────► :443 ────────►│ ──► /api/v1/write─► prom (today)
│ └─────────┘ │ (data plane) │ │
│ │ │ edgeauth verifies the token │
│ ┌─────────┐ │ │ via /internal/auth/ │
│ │otelcol │ ├────► :443 ────────►│ dataplane-verify before │
│ └─────────┘ │ (data plane) │ proxying. │
└──────────────┘ └──────────────────────────────┘The agent uses one TLS connection per plugin (promtail, otelcol-contrib). They are all outbound from the edge to ONGRID_PUBLIC_URL.
Authentication
One trust root, two paths.
The edge's access-key/secret-key pair authenticates the tunnel via geminio's session credentials. The same credential pair is exchanged for a Bearer token used on every data-plane HTTPS POST. nginx's auth_request directive calls back into the manager at /internal/auth/dataplane-verify to validate the token; on 200, the request is proxied to Loki or Tempo. On 401/403, the edge sees an HTTP error and backs off.
This means:
- Rotating the edge's
secret_keyinvalidates both planes simultaneously. - There is no second secrets store, no per-plugin credential, no separate ACL.
- The manager owns auth — Loki and Tempo never see edge identity directly.
The Bearer token is short-lived. The agent refreshes it transparently over the tunnel; an edge whose tunnel is down for an extended period will see data-plane POSTs start to 401 once the token expires, which forces a tunnel reconnect.
NAT compatibility
Outbound HTTPS to the manager's public port is the same network class as the outbound tunnel — both are originated from the edge to a single destination port range (443 for data, 40012 for tunnel). Both pass through ordinary corporate egress firewalls. No special carve-outs, no inbound rules.
For NAT-only edges that can only open one outbound connection, geminio has a raw-stream fallback (zstd-compressed log payloads, bounded buffer, drop-old on overflow). This is an escape hatch, not the default — we have not had to ship it to a customer yet.
What about metrics?
push_prom_samples is still on the tunnel today.
| Why we kept it on the tunnel |
|---|
| Metric volume per edge is far below saturation. |
| The existing path works; migration cost outweighs current benefit. |
push_prom_samples is exercised in every release; touching it is risky. |
We will move metrics to the data plane (Prometheus remote_write directly to https://<manager>/api/v1/write) when any one of the following holds:
- Single manager's tunnel CPU sustained > 60%.
- Single edge's metric push rate sustained > 100 KB/s (almost always means runaway cardinality — fix that first; if it persists after, migrate).
- Control RPC P95 latency degrades > 500ms under metric-stream pressure.
The Prometheus remote_write client is already in node_exporter and the edge's metric plugin can be re-pointed via env. The trigger above just sets the priority.
Edge implementation
On disk, the agent ships every telemetry sender through internal/edgeagent/dataplane/. That package centralises:
- Token reuse from the tunnel session credential.
- Dual-destination routing (tunnel for control, HTTPS for data).
- Retry + exponential backoff with jitter, capped at 1 minute.
- Local bounded queue (in-memory; the agent does not spool to disk).
Every plugin (promtail, otelcol-contrib, future ones) consumes this package — there is no per-plugin retry / auth logic. That keeps ongrid-edge a single binary plus four subprocesses; no per-plugin sidecar, no per-plugin systemd unit.
What you tune
| Variable | Where | Effect |
|---|---|---|
ONGRID_PUBLIC_URL | manager | The URL handed to edges as the data-plane root. Required for data plane to work. |
ONGRID_LOG_QUERY_URL | manager | Read path — manager → Loki for the Logs page. Independent of edge push. |
ONGRID_TRACE_QUERY_URL | manager | Read path — manager → Tempo for the Traces page. |
ONGRID_EDGE_PLUGIN_BIN_DIR | edge | Where the plugin binaries live (promtail, otelcol-contrib). |
ONGRID_EDGE_PLUGIN_WORK_DIR | edge | Per-plugin runtime dirs (configs, PID, queue spool). |
ONGRID_PUBLIC_URL is the single most important production setting on the manager side. Empty disables the data plane entirely — edges connect over the tunnel, the agent boots, but logs and traces never ship.
See also
- Architecture — full system diagram.
- Capabilities → Logs — operator tour of the logs plane.
- Capabilities → Traces — operator tour of the traces plane.
- Environment variables — the knobs that wire all of this.