Skip to content

Alerts

Ongrid's alerting subsystem is a single tick loop that walks every enabled rule row, asks the appropriate backend (Prom for metrics + trace spanmetrics, Loki for logs) whether the predicate matches, and records firings into the incidents table.

There is no separate Alertmanager, no separate rules file. Rules live in MySQL, the evaluator polls them on a 30s cache refresh, and notifications fan out through the channel registry.

The 14 rule kinds

Rules are stored with a kind column. The compiler dispatches on it.

The compiler is in rules.go and the evaluators in evaluators_phaseA.go + evaluators_phaseB.go.

The 8+6 split is HLD-004's Phase-A (metrics) / Phase-B (logs + traces), landed 2026-05-08.

Metric kinds (Phase A)

KindWhat it doesSpec fields
metric_rawPromQL expression IS the predicate. Fires per returned vector entry.expr
metric_anomalyZ-score or MAD over a rolling baseline window.metric, method, baseline_window, baseline_step, deviation, for_seconds
metric_forecastpredict_linear(metric[fit_window], predict_seconds) <op> threshold.metric, fit_window, predict_seconds, operator, threshold
metric_burn_rateGoogle SRE multi-window multi-burn over an SLO. ALL windows must trigger.sli, slo, burns[].window, burns[].multiplier

The legacy prom_query kind was renamed to metric_raw. The legacy metric_threshold form is now a UI-only entry that compiles to metric_raw at save time — there is no separate evaluator for it.

go
// internal/manager/biz/alert/rules.go:36
type MetricRawRule struct {
    ID         uint64
    RuleKey    string
    Name       string
    Severity   string
    ScopeType  string // host / global / monitoring_pipeline
    RunbookURL string
    Labels     map[string]string
    Expr       string // canonical predicate, e.g. `up == 0`
}

Log + trace kinds (Phase B)

KindWhat it doesBackend
log_matchcount_over_time(<stream> |~ <filter> [window]) <op> threshold against Loki. Per label-set firing.Loki
log_volumeSame shape as log_match, current-window count vs absolute threshold.Loki
trace_latencyhistogram_quantile(q, sum by(le)(rate(traces_spanmetrics_latency_bucket[w]))) > threshold_ms.Prom (spanmetrics)
trace_error_rate100 * (sum rate(traces_spanmetrics_calls_total{status_code="STATUS_CODE_ERROR"}) / sum rate(...)) > pct.Prom (spanmetrics)

Trace kinds query Prometheus, not Tempo. The spanmetrics generator scrapes Tempo and writes traces_spanmetrics_* series back into Prom — querying Prom keeps the alert evaluator on one query engine and reuses all the operator filtering / threshold logic.

Scope types

Every rule has scope_type ∈ {host, global, monitoring_pipeline}. Default per-kind defined in defaultScopeForKind in rules.go.

  • host — incident must carry a device_id. The evaluator parses the device_id label from Prom result labels; validateFiring rejects host-scoped firings without one.
  • global — service-level alerts (trace_, log_) that don't pin to a single host.
  • monitoring_pipeline — meta-alerts about Ongrid itself (scrape_down, prom_ingest_fail, ...).

The evaluator tick

PipelineEvaluator.evaluate runs every Interval (default 5 min, configurable via PipelineEvaluatorOpts.Interval).

go
func (e *PipelineEvaluator) evaluate(ctx context.Context) {
    now := e.now()
    if e.edges != nil {
        e.refreshDeviceStalenessGauge(ctx, now)
    }
    if e.prom != nil {
        e.evaluatePromQuery(ctx, now)
        e.evaluateMetricAnomaly(ctx, now)
        e.evaluateMetricForecast(ctx, now)
        e.evaluateMetricBurnRate(ctx, now)
        e.evaluateTraceLatency(ctx, now)
        e.evaluateTraceErrorRate(ctx, now)
    }
    if e.logq != nil {
        e.evaluateLogMatch(ctx, now)
        e.evaluateLogVolume(ctx, now)
    }
}

A nil backend silently skips the corresponding kinds — Loki down doesn't break metric alerts.

Dedup + recovery

The evaluator tracks firingSnapshot[ruleKey] = set<dedupeKey> across ticks. A key present last tick but absent this tick → PromQL's comparison filter dropped the series → predicate cleared → SystemResolveIncident fires with "prom condition cleared". This is how alarms recover without a separate "resolve" evaluator.

Dedupe key shape: pipeline:<rule_key>:<sorted-label-set> — provenance labels (__name__, ongrid_source) are stripped so the same alarm reported by both the embedded and the cloud collector deduplicates to one incident, not two (labelSetKey).

Channel fan-out

When an incident fires, the Notifier.MaybeNotify path consults the ChannelResolver:

  1. Per-rule pinning — if rule.notify_channel_ids_json is non-empty, only those channel ids match (and only the enabled ones).
  2. Otherwise, every enabled notification_channels row is filtered by match_severity_min and match_scope_types.
  3. If nothing matches, the resolver falls back to a synthetic channel list seeded from DefaultChannels so notifications never disappear.

See router.go.

Inhibition

Two built-in inhibition rules (inhibit.go), covering the noisy default cases:

  • edge_offline:edge_X inhibits any host:X:* — when an edge is unreachable, every host-scoped alarm on it is suppressed.
  • pipeline:prom_ingest_fail inhibits pipeline:scrape_down:* — when Prometheus itself can't ingest, every "target down" alarm is noise.

A future inhibition_rules table extends this to admin-defined groups.

Cooldown + dampening

NotifyOpts.Cooldown (default 10 minutes) bounds re-notification on the same dedupe_key. The dampening filter sits inside Usecase.MaybeNotify so the channel resolver and inhibitor still run on every firing — only the actual Notifier.Send is skipped.

See also

  • RCA — what happens when an incident fires.
  • Logs — Loki + the log_match / log_volume evaluators.
  • Traces — Tempo + the trace_latency / trace_error_rate evaluators.
  • Channels overview — how Slack / Telegram / Lark / DingTalk / WeCom + webhook channels are configured.
  • Alert rule schema — the wire format of the rule row.