Monitoring
Ongrid ships a working monitoring pipeline out of the box and is designed to bend toward your existing one when you have one.
The data plane
edge:
hostmetrics, procmetrics, node-exporter ─┐
│ remote_write (HTTPS direct)
▼
┌─────────┐
│ Prom │ (default: bundled)
└────┬────┘
│ query_range / instant
▼
┌──────────────────────┐
│ manager: │
│ - alert evaluator │
│ - query_promql │
│ - /api/grafana │
└──────────────────────┘The default is bundled Prometheus running in the docker-compose at prometheus:9090. Edges push directly to it over HTTPS — no scrape, no node discovery. This is ADR-014's data-plane / control-plane split: the geminio tunnel carries control, telemetry goes direct.
Why direct, not via the tunnel
At cardinality ≥ 5 k series/sec on a 50-host install, multiplexing remote_write through the geminio control tunnel choked. Direct remote_write removes the manager from the hot path and lets Prom's own write-ahead-log handle ingestion buffering. The trade-off is one more HTTPS endpoint to expose; see Server install for the nginx reverse-proxy snippet.
Built-in vs external Prom
Both modes are first-class.
Built-in (default)
docker compose up brings prometheus:v2.55 with --web.enable-remote-write-receiver. Storage is a named volume ongrid_prom_data. Retention defaults to 15d — tune via PROMETHEUS_RETENTION in .env.
The manager talks to it at http://prometheus:9090 via the ONGRID_PROM_QUERY_URL env. No further configuration needed.
External
Point ONGRID_PROM_QUERY_URL at your own Prom / VictoriaMetrics / Thanos query endpoint. Edges still remote_write — point them at your ingest URL by setting the per-edge remote_write_url in the Edge plugin config (internal/manager/biz/edge/plugin_config.go).
The cloud-bundled Prom can stay running as a "self-observability" Prom that only scrapes the manager itself (ADR-026 self-obs metrics live there). Alerts that need both halves can be defined twice with different match_scope_types.
The query path
Two consumers, one client.
Alert evaluator
PipelineEvaluator.evaluatePromQuery runs every enabled metric_raw rule's Expr on a tick. PromQL's own comparison operators (up == 0, cpu_pct > 90) ARE the predicate — Prom drops non-matching series from the response, so the evaluator just fires one incident per returned vector entry and reaps the rest on the next tick's recovery sweep.
// pipeline.go:269
res, err := e.prom.Query(ctx, rule.Expr, now)
// ...
for _, ent := range entries {
dedupeKey := fmt.Sprintf("pipeline:%s:%s", rule.RuleKey, labelSetKey(ent.Metric))
// ... RecordFiring + notify
}The query_promql tool
The LLM gets a BaseTool called query_promql that takes an instant or range query and returns the JSON vector / matrix. The investigator persona uses it as its primary metric probe; the coordinator chat uses it whenever you ask "what's the cpu on edge-prod-04 right now?"
Schema lives in query_promql_basetool.go; the underlying engine is internal/pkg/promquery.
Embedded Grafana
The compose ships Grafana at grafana:3000. The manager proxies it under /api/grafana/* (auth checked at the proxy) so the SPA's MonitorPanel.tsx can embed iframes with one click — no separate Grafana login.
Out of box, the manager mirrors internal monitor panel definitions into Grafana via biz/grafana/Service on a sync tick. Operators edit panels in Grafana (the rich editor), the manager picks them up. The MonitorEditor page in the SPA is a thin read-then-redirect that pops the Grafana panel editor pre-filled with the relevant series.
Grafana credentials
The bundled Grafana ships with admin / admin. Change before exposing. See First-boot checklist.
Self-observability
ADR-026 wired /metrics on the manager + 6 baseline alerts (LLM token spike, alert evaluator stall, tunnel disconnect storm, audit-log lag, investigator backlog, RCA latency p99). These are seeded on first boot into the rules + dashboard tables and re-asserted on every upgrade.
The bundled Prom scrapes the manager's /metrics. External Prom mode needs you to add a scrape target at manager:9100 (default ONGRID_METRICS_ADDR).
See also
- Alerts — the rule kinds that query Prom.
- Traces — the spanmetrics generator that feeds
traces_spanmetrics_*series back into the same Prom. - Logs — the parallel pipeline for Loki.
- Environment variables — every
ONGRID_PROM_*andPROMETHEUS_*knob.