Alerts
Ongrid's alerting subsystem is a single tick loop that walks every enabled rule row, asks the appropriate backend (Prom for metrics + trace spanmetrics, Loki for logs) whether the predicate matches, and records firings into the incidents table.
There is no separate Alertmanager, no separate rules file. Rules live in MySQL, the evaluator polls them on a 30s cache refresh, and notifications fan out through the channel registry.
The 14 rule kinds
Rules are stored with a kind column. The compiler dispatches on it.
The compiler is in rules.go and the evaluators in evaluators_phaseA.go + evaluators_phaseB.go.
The 8+6 split is HLD-004's Phase-A (metrics) / Phase-B (logs + traces), landed 2026-05-08.
Metric kinds (Phase A)
| Kind | What it does | Spec fields |
|---|---|---|
metric_raw | PromQL expression IS the predicate. Fires per returned vector entry. | expr |
metric_anomaly | Z-score or MAD over a rolling baseline window. | metric, method, baseline_window, baseline_step, deviation, for_seconds |
metric_forecast | predict_linear(metric[fit_window], predict_seconds) <op> threshold. | metric, fit_window, predict_seconds, operator, threshold |
metric_burn_rate | Google SRE multi-window multi-burn over an SLO. ALL windows must trigger. | sli, slo, burns[].window, burns[].multiplier |
The legacy prom_query kind was renamed to metric_raw. The legacy metric_threshold form is now a UI-only entry that compiles to metric_raw at save time — there is no separate evaluator for it.
// internal/manager/biz/alert/rules.go:36
type MetricRawRule struct {
ID uint64
RuleKey string
Name string
Severity string
ScopeType string // host / global / monitoring_pipeline
RunbookURL string
Labels map[string]string
Expr string // canonical predicate, e.g. `up == 0`
}Log + trace kinds (Phase B)
| Kind | What it does | Backend |
|---|---|---|
log_match | count_over_time(<stream> |~ <filter> [window]) <op> threshold against Loki. Per label-set firing. | Loki |
log_volume | Same shape as log_match, current-window count vs absolute threshold. | Loki |
trace_latency | histogram_quantile(q, sum by(le)(rate(traces_spanmetrics_latency_bucket[w]))) > threshold_ms. | Prom (spanmetrics) |
trace_error_rate | 100 * (sum rate(traces_spanmetrics_calls_total{status_code="STATUS_CODE_ERROR"}) / sum rate(...)) > pct. | Prom (spanmetrics) |
Trace kinds query Prometheus, not Tempo. The spanmetrics generator scrapes Tempo and writes traces_spanmetrics_* series back into Prom — querying Prom keeps the alert evaluator on one query engine and reuses all the operator filtering / threshold logic.
Scope types
Every rule has scope_type ∈ {host, global, monitoring_pipeline}. Default per-kind defined in defaultScopeForKind in rules.go.
host— incident must carry adevice_id. The evaluator parses thedevice_idlabel from Prom result labels;validateFiringrejects host-scoped firings without one.global— service-level alerts (trace_, log_) that don't pin to a single host.monitoring_pipeline— meta-alerts about Ongrid itself (scrape_down,prom_ingest_fail, ...).
The evaluator tick
PipelineEvaluator.evaluate runs every Interval (default 5 min, configurable via PipelineEvaluatorOpts.Interval).
func (e *PipelineEvaluator) evaluate(ctx context.Context) {
now := e.now()
if e.edges != nil {
e.refreshDeviceStalenessGauge(ctx, now)
}
if e.prom != nil {
e.evaluatePromQuery(ctx, now)
e.evaluateMetricAnomaly(ctx, now)
e.evaluateMetricForecast(ctx, now)
e.evaluateMetricBurnRate(ctx, now)
e.evaluateTraceLatency(ctx, now)
e.evaluateTraceErrorRate(ctx, now)
}
if e.logq != nil {
e.evaluateLogMatch(ctx, now)
e.evaluateLogVolume(ctx, now)
}
}A nil backend silently skips the corresponding kinds — Loki down doesn't break metric alerts.
Dedup + recovery
The evaluator tracks firingSnapshot[ruleKey] = set<dedupeKey> across ticks. A key present last tick but absent this tick → PromQL's comparison filter dropped the series → predicate cleared → SystemResolveIncident fires with "prom condition cleared". This is how alarms recover without a separate "resolve" evaluator.
Dedupe key shape: pipeline:<rule_key>:<sorted-label-set> — provenance labels (__name__, ongrid_source) are stripped so the same alarm reported by both the embedded and the cloud collector deduplicates to one incident, not two (labelSetKey).
Channel fan-out
When an incident fires, the Notifier.MaybeNotify path consults the ChannelResolver:
- Per-rule pinning — if
rule.notify_channel_ids_jsonis non-empty, only those channel ids match (and only the enabled ones). - Otherwise, every enabled
notification_channelsrow is filtered bymatch_severity_minandmatch_scope_types. - If nothing matches, the resolver falls back to a synthetic channel list seeded from
DefaultChannelsso notifications never disappear.
See router.go.
Inhibition
Two built-in inhibition rules (inhibit.go), covering the noisy default cases:
edge_offline:edge_Xinhibits anyhost:X:*— when an edge is unreachable, every host-scoped alarm on it is suppressed.pipeline:prom_ingest_failinhibitspipeline:scrape_down:*— when Prometheus itself can't ingest, every "target down" alarm is noise.
A future inhibition_rules table extends this to admin-defined groups.
Cooldown + dampening
NotifyOpts.Cooldown (default 10 minutes) bounds re-notification on the same dedupe_key. The dampening filter sits inside Usecase.MaybeNotify so the channel resolver and inhibitor still run on every firing — only the actual Notifier.Send is skipped.
See also
- RCA — what happens when an incident fires.
- Logs — Loki + the
log_match/log_volumeevaluators. - Traces — Tempo + the
trace_latency/trace_error_rateevaluators. - Channels overview — how Slack / Telegram / Lark / DingTalk / WeCom + webhook channels are configured.
- Alert rule schema — the wire format of the rule row.