Skip to content

Alert rule schema

Alert rules are stored in the alert_rules table and submitted to POST /v1/alert-rules. This page is the wire format. Source of truth: internal/manager/model/alert/model.go.

Wire shape

json
{
  "rule_key": "host_cpu_high",
  "kind": "metric_raw",
  "name": "Host CPU pegged",
  "source_type": "ongrid_builtin",
  "scope_type": "host",
  "join_mode": "all",
  "severity": "warning",
  "enabled": true,
  "conditions": [
    { "expr": "node_cpu_usage_percent > 90" }
  ],
  "labels":      { "team": "sre", "service": "host" },
  "annotations": { "summary": "CPU on {{$labels.device_id}} above 90%" },
  "runbook_url": "https://wiki.example.com/runbooks/host-cpu",
  "notify_channel_ids": [12, 17],
  "notify_window_seconds": 600,
  "notify_min_fires": 3
}

Field reference

Identity

FieldTypeRequiredDescription
rule_keystringyesStable lower_snake identifier used in dedupe keys and incident.rule. Unique.
namestringyesDisplay name.
enabledboolno (default true)Disabled rules are skipped by the evaluator and hidden from "active" filters.

Source / scope

FieldTypeRequiredDescription
source_typeenumyesongrid_builtin, prometheus_external. Built-in rules carry canonical rule_key values; external rules originate from an imported Prometheus alerting rule file.
scope_typeenumyeshost, global, monitoring_pipeline. Determines what dimension the evaluator groups by — host produces one incident per device_id; global produces one incident system-wide; monitoring_pipeline is for internal pipeline-health rules.
join_modeenumyesall (every condition must match), any (any condition matches). Only relevant when conditions has more than one element.

Kind

kind discriminates how the evaluator interprets conditions. Each kind drives a different sub-evaluator.

KindStatusWhat it doesConditions shape
metric_thresholdUI-only inputFriendly form. The biz layer rewrites it to metric_raw at save time; you will never see this kind on disk.[{ "metric": "cpu_pct", "operator": ">=", "threshold": 90, "window": "5m", "for": "2m", "aggregator": "avg" }]
metric_rawliveArbitrary PromQL. Ticker-driven.[{ "expr": "rate(http_500[5m]) > 0.1" }]
metric_anomalyliveDeviation from a rolling baseline (z-score). Ticker-driven via the PromQuerier.[{ "metric": "node_cpu_usage_percent", "window": "1h", "z_threshold": 3.0 }]
metric_forecastliveLinear extrapolation (predict_linear) crossing a static threshold within a future window.[{ "metric": "node_filesystem_avail_bytes", "window": "1h", "forecast_for": "24h", "below": 1073741824 }]
metric_burn_rateliveSLO error-budget multi-window multi-burn-rate (Google SRE Workbook).[{ "good": "sum(rate(http_2xx[1h]))", "total": "sum(rate(http_total[1h]))", "slo": 0.999, "long": "1h", "short": "5m" }]
log_matchlive (Phase-B)LogQL pattern that, when it hits, fires.[{ "expr": "{device_id=\"{{.device_id}}\"} |= \"panic\"" }]
log_volumelive (Phase-B)LogQL stream rate above threshold.[{ "expr": "sum(rate({app=\"foo\"}[5m])) > 100" }]
trace_latencylive (Phase-B)TraceQL p95 / p99 above threshold for a service.[{ "service": "payments", "percentile": 95, "above_ms": 800, "window": "5m" }]
trace_error_ratelive (Phase-B)TraceQL error span fraction above threshold.[{ "service": "payments", "above_percent": 1.0, "window": "5m" }]

The full Go enum lives in model/alert/model.go:

go
const (
    RuleKindMetricThreshold = "metric_threshold" // UI-only input
    RuleKindMetricAnomaly   = "metric_anomaly"
    RuleKindMetricForecast  = "metric_forecast"
    RuleKindMetricBurnRate  = "metric_burn_rate"
    RuleKindMetricRaw       = "metric_raw"
    RuleKindLogMatch        = "log_match"
    RuleKindLogVolume       = "log_volume"
    RuleKindTraceLatency    = "trace_latency"
    RuleKindTraceErrorRate  = "trace_error_rate"
)

Legacy kinds (edge_offline, prom_query, ingest_health, edge_absence, health_ingest, event_internal) are silently aliased to metric_raw on save.

Conditions

conditions is an array. Each element's shape depends on kind — see the table above. join_mode decides whether all elements must match (all) or any (any).

For metric_threshold (the UI form), each condition is a RuleCondition:

go
type RuleCondition struct {
    Metric     string  `json:"metric"`            // e.g. "cpu_pct"
    Operator   string  `json:"operator"`          // ">", ">=", "<", "<=", "=="
    Threshold  float64 `json:"threshold"`         // numeric trigger
    Window     string  `json:"window,omitempty"`  // e.g. "5m"
    For        string  `json:"for,omitempty"`     // sustain duration
    Aggregator string  `json:"aggregator,omitempty"` // avg / max / min
}

The biz layer compiles those into a metric_raw expr at save time using the canonical closed-set host metrics (node_cpu_usage_percent, node_memory_used_percent, node_filesystem_used_percent, node_load1, ...).

Severity

ValueTreatment
inforecorded; notifications gated by channel's match_severity_min (most channels skip this floor)
warningdefault
criticalalways notified unless silenced

A channel's match_severity_min set to warning accepts warning + critical; critical accepts only critical. Empty matches any.

Labels & annotations

Free-form key/value maps stored as JSON.

  • labels are appended to the incident's labels at fire time and used for grouping / dedupe. Common: service, team, env.
  • annotations are templated at fire time using Go template syntax with the incident snapshot — for example summary: "CPU on {{$labels.device_id}} above 90%".

Runbook

runbook_url is shown verbatim in the incident detail and on the chat surface alongside the AI investigation report. Use it to link to your internal runbooks / playbooks.

Notification dampening

FieldTypeDefaultDescription
notify_window_secondsint0Rolling window for dampening. 0 disables.
notify_min_firesint0Minimum firings inside the window before notification is sent. 0 disables.
notify_channel_idsint[]emptyPin notifications to specific channel IDs (subject to each channel's own enabled / severity / scope filters). Empty = global router.

A rule that fires fewer than notify_min_fires times inside the trailing notify_window_seconds writes a repeat_suppressed event to the timeline (so you can see the dampening took effect) but does not notify. Both zero = dampening off, every firing notifies subject to cooldown + silence + inhibition gates.

Mixed (one zero, one >0) is rejected at the biz layer with invalid_argument: notify_window_seconds and notify_min_fires must both be zero or both > 0.

Incident lifecycle

The fields above describe a rule (the trigger definition). When a rule fires it produces an incident (alert_incidents table) with these statuses:

text
open ─┬─> acknowledged ─> resolved
      ├─> silenced ─> resolved
      └─> resolved

Incidents auto-resolve when the underlying condition clears for one full evaluator cycle. The state machine is documented in internal/manager/biz/alert/.

Event types

Every state transition records an alert_events row. Stable event-type strings:

Event typeWhen
firingrule first fires (incident created or reopened)
repeat_suppressedfiring inside cooldown / dampening window
acknowledgeduser clicks Ack
silenceduser silences for N hours
resolvedcondition cleared or user clicked Resolve
reopenedresolved incident fired again before deletion
noteuser-added comment
notification_sentone channel delivery succeeded
notification_failedone channel delivery failed
inhibitedsuppressed by a higher-severity active incident
ai_initial_diagnosisproactive AI investigator's first take, written when an incident first fires

Channel types

POST /v1/notification-channels accepts:

Typechannel_type valueSource
Webhookwebhookenv or UI
Slackslackenv or UI
Larksuite / Feishufeishuenv or UI
DingTalkdingtalkenv or UI
WeCom (企业微信)wecomUI only
TelegramtelegramUI only

The legacy log channel type was removed in 2026-05 — alert_events itself is the delivery audit.

See also