Alerts

Ongrid's alerting subsystem is a single tick loop that walks every enabled rule row, asks the appropriate backend (Prom for metrics + trace spanmetrics, Loki for logs) whether the predicate matches, and records firings into the incidents table.

There is no separate Alertmanager, no separate rules file. Rules live in MySQL, the evaluator polls them on a 30s cache refresh, and notifications fan out through the channel registry.

The 14 rule kinds

Rules are stored with a kind column. The compiler dispatches on it.

The compiler is in rules.go and the evaluators in evaluators_phaseA.go + evaluators_phaseB.go.

The 8+6 split is HLD-004's Phase-A (metrics) / Phase-B (logs + traces), landed 2026-05-08.

Metric kinds (Phase A)

Kind	What it does	Spec fields
`metric_raw`	PromQL expression IS the predicate. Fires per returned vector entry.	`expr`
`metric_anomaly`	Z-score or MAD over a rolling baseline window.	`metric`, `method`, `baseline_window`, `baseline_step`, `deviation`, `for_seconds`
`metric_forecast`	`predict_linear(metric[fit_window], predict_seconds) <op> threshold`.	`metric`, `fit_window`, `predict_seconds`, `operator`, `threshold`
`metric_burn_rate`	Google SRE multi-window multi-burn over an SLO. ALL windows must trigger.	`sli`, `slo`, `burns[].window`, `burns[].multiplier`

The legacy prom_query kind was renamed to metric_raw. The legacy metric_threshold form is now a UI-only entry that compiles to metric_raw at save time — there is no separate evaluator for it.

// internal/manager/biz/alert/rules.go:36
type MetricRawRule struct {
    ID         uint64
    RuleKey    string
    Name       string
    Severity   string
    ScopeType  string // host / global / monitoring_pipeline
    RunbookURL string
    Labels     map[string]string
    Expr       string // canonical predicate, e.g. `up == 0`
}

Log + trace kinds (Phase B)

Kind	What it does	Backend
`log_match`	`count_over_time(<stream> \|~ <filter> [window]) <op> threshold` against Loki. Per label-set firing.	Loki
`log_volume`	Same shape as `log_match`, current-window count vs absolute threshold.	Loki
`trace_latency`	`histogram_quantile(q, sum by(le)(rate(traces_spanmetrics_latency_bucket[w]))) > threshold_ms`.	Prom (spanmetrics)
`trace_error_rate`	`100 * (sum rate(traces_spanmetrics_calls_total{status_code="STATUS_CODE_ERROR"}) / sum rate(...)) > pct`.	Prom (spanmetrics)

Trace kinds query Prometheus, not Tempo. The spanmetrics generator scrapes Tempo and writes traces_spanmetrics_* series back into Prom — querying Prom keeps the alert evaluator on one query engine and reuses all the operator filtering / threshold logic.

Scope types

Every rule has scope_type ∈ {host, global, monitoring_pipeline}. Default per-kind defined in defaultScopeForKind in rules.go.

host — incident must carry a device_id. The evaluator parses the device_id label from Prom result labels; validateFiring rejects host-scoped firings without one.
global — service-level alerts (trace_, log_) that don't pin to a single host.
monitoring_pipeline — meta-alerts about Ongrid itself (scrape_down, prom_ingest_fail, ...).

The evaluator tick

PipelineEvaluator.evaluate runs every Interval (default 5 min, configurable via PipelineEvaluatorOpts.Interval).

func (e *PipelineEvaluator) evaluate(ctx context.Context) {
    now := e.now()
    if e.edges != nil {
        e.refreshDeviceStalenessGauge(ctx, now)
    }
    if e.prom != nil {
        e.evaluatePromQuery(ctx, now)
        e.evaluateMetricAnomaly(ctx, now)
        e.evaluateMetricForecast(ctx, now)
        e.evaluateMetricBurnRate(ctx, now)
        e.evaluateTraceLatency(ctx, now)
        e.evaluateTraceErrorRate(ctx, now)
    }
    if e.logq != nil {
        e.evaluateLogMatch(ctx, now)
        e.evaluateLogVolume(ctx, now)
    }
}

A nil backend silently skips the corresponding kinds — Loki down doesn't break metric alerts.

Dedup + recovery

The evaluator tracks firingSnapshot[ruleKey] = set<dedupeKey> across ticks. A key present last tick but absent this tick → PromQL's comparison filter dropped the series → predicate cleared → SystemResolveIncident fires with "prom condition cleared". This is how alarms recover without a separate "resolve" evaluator.

Dedupe key shape: pipeline:<rule_key>:<sorted-label-set> — provenance labels (__name__, ongrid_source) are stripped so the same alarm reported by both the embedded and the cloud collector deduplicates to one incident, not two (labelSetKey).

Channel fan-out

When an incident fires, the Notifier.MaybeNotify path consults the ChannelResolver:

Per-rule pinning — if rule.notify_channel_ids_json is non-empty, only those channel ids match (and only the enabled ones).
Otherwise, every enabled notification_channels row is filtered by match_severity_min and match_scope_types.
If nothing matches, the resolver falls back to a synthetic channel list seeded from DefaultChannels so notifications never disappear.

See router.go.

Inhibition

Two built-in inhibition rules (inhibit.go), covering the noisy default cases:

edge_offline:edge_X inhibits any host:X:* — when an edge is unreachable, every host-scoped alarm on it is suppressed.
pipeline:prom_ingest_fail inhibits pipeline:scrape_down:* — when Prometheus itself can't ingest, every "target down" alarm is noise.

A future inhibition_rules table extends this to admin-defined groups.

Cooldown + dampening

NotifyOpts.Cooldown (default 10 minutes) bounds re-notification on the same dedupe_key. The dampening filter sits inside Usecase.MaybeNotify so the channel resolver and inhibitor still run on every firing — only the actual Notifier.Send is skipped.

Alerts ​

The 14 rule kinds ​

Metric kinds (Phase A) ​

Log + trace kinds (Phase B) ​

Scope types ​

The evaluator tick ​

Dedup + recovery ​

Channel fan-out ​

Inhibition ​

Cooldown + dampening ​

See also ​