Skip to content

Monitoring

Ongrid ships a working monitoring pipeline out of the box and is designed to bend toward your existing one when you have one.

The data plane

text
edge:
  hostmetrics, procmetrics, node-exporter  ─┐
                                            │  remote_write (HTTPS direct)

                                       ┌─────────┐
                                       │  Prom   │  (default: bundled)
                                       └────┬────┘
                                            │ query_range / instant

                                  ┌──────────────────────┐
                                  │  manager:            │
                                  │   - alert evaluator  │
                                  │   - query_promql     │
                                  │   - /api/grafana     │
                                  └──────────────────────┘

The default is bundled Prometheus running in the docker-compose at prometheus:9090. Edges push directly to it over HTTPS — no scrape, no node discovery. This is ADR-014's data-plane / control-plane split: the geminio tunnel carries control, telemetry goes direct.

Why direct, not via the tunnel

At cardinality ≥ 5 k series/sec on a 50-host install, multiplexing remote_write through the geminio control tunnel choked. Direct remote_write removes the manager from the hot path and lets Prom's own write-ahead-log handle ingestion buffering. The trade-off is one more HTTPS endpoint to expose; see Server install for the nginx reverse-proxy snippet.

Built-in vs external Prom

Both modes are first-class.

Built-in (default)

docker compose up brings prometheus:v2.55 with --web.enable-remote-write-receiver. Storage is a named volume ongrid_prom_data. Retention defaults to 15d — tune via PROMETHEUS_RETENTION in .env.

The manager talks to it at http://prometheus:9090 via the ONGRID_PROM_QUERY_URL env. No further configuration needed.

External

Point ONGRID_PROM_QUERY_URL at your own Prom / VictoriaMetrics / Thanos query endpoint. Edges still remote_write — point them at your ingest URL by setting the per-edge remote_write_url in the Edge plugin config (internal/manager/biz/edge/plugin_config.go).

The cloud-bundled Prom can stay running as a "self-observability" Prom that only scrapes the manager itself (ADR-026 self-obs metrics live there). Alerts that need both halves can be defined twice with different match_scope_types.

The query path

Two consumers, one client.

Alert evaluator

PipelineEvaluator.evaluatePromQuery runs every enabled metric_raw rule's Expr on a tick. PromQL's own comparison operators (up == 0, cpu_pct > 90) ARE the predicate — Prom drops non-matching series from the response, so the evaluator just fires one incident per returned vector entry and reaps the rest on the next tick's recovery sweep.

go
// pipeline.go:269
res, err := e.prom.Query(ctx, rule.Expr, now)
// ...
for _, ent := range entries {
    dedupeKey := fmt.Sprintf("pipeline:%s:%s", rule.RuleKey, labelSetKey(ent.Metric))
    // ... RecordFiring + notify
}

The query_promql tool

The LLM gets a BaseTool called query_promql that takes an instant or range query and returns the JSON vector / matrix. The investigator persona uses it as its primary metric probe; the coordinator chat uses it whenever you ask "what's the cpu on edge-prod-04 right now?"

Schema lives in query_promql_basetool.go; the underlying engine is internal/pkg/promquery.

Embedded Grafana

The compose ships Grafana at grafana:3000. The manager proxies it under /api/grafana/* (auth checked at the proxy) so the SPA's MonitorPanel.tsx can embed iframes with one click — no separate Grafana login.

Out of box, the manager mirrors internal monitor panel definitions into Grafana via biz/grafana/Service on a sync tick. Operators edit panels in Grafana (the rich editor), the manager picks them up. The MonitorEditor page in the SPA is a thin read-then-redirect that pops the Grafana panel editor pre-filled with the relevant series.

Grafana credentials

The bundled Grafana ships with admin / admin. Change before exposing. See First-boot checklist.

Self-observability

ADR-026 wired /metrics on the manager + 6 baseline alerts (LLM token spike, alert evaluator stall, tunnel disconnect storm, audit-log lag, investigator backlog, RCA latency p99). These are seeded on first boot into the rules + dashboard tables and re-asserted on every upgrade.

The bundled Prom scrapes the manager's /metrics. External Prom mode needs you to add a scrape target at manager:9100 (default ONGRID_METRICS_ADDR).

See also

  • Alerts — the rule kinds that query Prom.
  • Traces — the spanmetrics generator that feeds traces_spanmetrics_* series back into the same Prom.
  • Logs — the parallel pipeline for Loki.
  • Environment variables — every ONGRID_PROM_* and PROMETHEUS_* knob.