Skip to content

Logs

Logs are the Loki half of the L1 stack. They follow the same data-plane shape as metrics: edges push directly to Loki over HTTPS, the manager only queries.

The data plane

text
edge:
  ongrid-edge agent
    └─ plugin: promtail (subprocess)
        ├─ tails /var/log/syslog, /var/log/messages, journald
        └─ remote write HTTPS → loki:3100

                                     ▼ /loki/api/v1/query_range
                              ┌───────────────┐
                              │  manager:     │
                              │   - alert     │
                              │   - LLM tools │
                              └───────────────┘

ADR-015 plugin runtime

promtail runs as a sub-process under the edge agent's plugin runtime. The agent supervises lifecycle — restart on crash, drain on shutdown, log to the same systemd journal as the agent. From the operator's POV it's one systemd service, not two.

The earlier alternative ("vendor promtail as a static-linked Go library") was rejected because every Loki client upgrade would force an edge agent re-release. Subprocess decouples them.

Configuration

Per-edge promtail config is rendered server-side by biz/edge/plugin_config.go and pushed down the control tunnel on plugin start. It consults system_settings.loki.url so an admin URL change propagates without rebuilding the agent.

Default tail set:

SourceLoki labels
/var/log/syslog, /var/log/messagesjob=node-syslog, host=<edge-name>
systemd journald (all units)job=systemd-journal, host=<edge-name>, unit=<svc>

Custom tails are added via the SPA's /edges/<id>/logs/sources page — they write into the same system_settings.loki.* namespace.

The data plane split

ADR-014: telemetry goes direct, control goes through the tunnel. For logs this means:

  • promtail HTTPS POST → loki:3100/loki/api/v1/push (direct).
  • ✅ Loki HTTPS GET ← manager /loki/api/v1/query_range (direct).
  • ❌ Logs do NOT travel through the geminio control tunnel.

Why: the tunnel is an in-process multiplexer; piping log bursts through it starved the control RPCs (RCA tool calls, WebSSH I/O) when ingest spiked. Splitting the data plane out fixes the noisy-neighbour problem and lets nginx own the public surface (TLS, rate limiting, auth) for both halves uniformly.

Alert kinds

Two log-driven rule kinds — both Phase-B, both in evaluators_phaseB.go.

log_match

Fires when count_over_time(<stream_selector> |~ <line_filter> [window]) <op> threshold returns at least one matrix entry.

json
{
  "kind": "log_match",
  "scope_type": "global",
  "conditions_json": {
    "stream_selector": "{job=\"systemd-journal\",unit=\"nginx.service\"}",
    "line_filter": "(?i)5\\d{2}",
    "window": "5m",
    "operator": ">=",
    "threshold": 50
  }
}

line_filter is optional — when empty, the rule counts every line in the stream. The query is built per-tick by compileLogMatchRule in rules.go:733.

log_volume

Same engine as log_match in v1 (current-window count vs absolute threshold). The spec's original "ratio vs previous window" semantics is parked — it needs two LogQL queries + Go-side division; the absolute form already covers the common "logs spiked past N" use case.

The spec column is ratio_op / ratio_threshold (kept for forward compat); the compiler maps these to operator / threshold.

Tools

search_logs / query_logql

The LLM-facing log search. Two registration paths point to one executor:

  • query_logql — raw LogQL passthrough. Used by the investigator worker when the persona knows exactly what stream to query.
  • search_logs — friendlier wrapper exposed in the chat UI's Quick Actions, takes a free-text query + a service name.

Schema + executor in query_logql_basetool.go. Underlying client: internal/pkg/logquery.

Both honour the global ONGRID_LOG_QUERY_URL env (default http://loki:3100) and inherit auth from the same nginx layer the edges use to push.

host_tail_file

Skills-side log probe — runs on the edge, not on the manager. Use this when you need the raw file (e.g. /var/log/ongrid-edge.log not in journald) instead of the Loki-ingested view. ScopeHost; requires an edge_id. See Skills for the dispatch model.

See also