Alert rule schema

Alert rules are stored in the alert_rules table and submitted to POST /v1/alert-rules. This page is the wire format. Source of truth: internal/manager/model/alert/model.go.

Wire shape

json

{
  "rule_key": "host_cpu_high",
  "kind": "metric_raw",
  "name": "Host CPU pegged",
  "source_type": "ongrid_builtin",
  "scope_type": "host",
  "join_mode": "all",
  "severity": "warning",
  "enabled": true,
  "conditions": [
    { "expr": "node_cpu_usage_percent > 90" }
  ],
  "labels":      { "team": "sre", "service": "host" },
  "annotations": { "summary": "CPU on {{$labels.device_id}} above 90%" },
  "runbook_url": "https://wiki.example.com/runbooks/host-cpu",
  "notify_channel_ids": [12, 17],
  "notify_window_seconds": 600,
  "notify_min_fires": 3
}

Field reference

Identity

Field	Type	Required	Description
`rule_key`	string	yes	Stable `lower_snake` identifier used in dedupe keys and `incident.rule`. Unique.
`name`	string	yes	Display name.
`enabled`	bool	no (default `true`)	Disabled rules are skipped by the evaluator and hidden from "active" filters.

Source / scope

Field	Type	Required	Description
`source_type`	enum	yes	`ongrid_builtin`, `prometheus_external`. Built-in rules carry canonical `rule_key` values; external rules originate from an imported Prometheus alerting rule file.
`scope_type`	enum	yes	`host`, `global`, `monitoring_pipeline`. Determines what dimension the evaluator groups by — `host` produces one incident per `device_id`; `global` produces one incident system-wide; `monitoring_pipeline` is for internal pipeline-health rules.
`join_mode`	enum	yes	`all` (every condition must match), `any` (any condition matches). Only relevant when `conditions` has more than one element.

Kind

kind discriminates how the evaluator interprets conditions. Each kind drives a different sub-evaluator.

Kind	Status	What it does	Conditions shape
`metric_threshold`	UI-only input	Friendly form. The biz layer rewrites it to `metric_raw` at save time; you will never see this kind on disk.	`[{ "metric": "cpu_pct", "operator": ">=", "threshold": 90, "window": "5m", "for": "2m", "aggregator": "avg" }]`
`metric_raw`	live	Arbitrary PromQL. Ticker-driven.	`[{ "expr": "rate(http_500[5m]) > 0.1" }]`
`metric_anomaly`	live	Deviation from a rolling baseline (z-score). Ticker-driven via the PromQuerier.	`[{ "metric": "node_cpu_usage_percent", "window": "1h", "z_threshold": 3.0 }]`
`metric_forecast`	live	Linear extrapolation (`predict_linear`) crossing a static threshold within a future window.	`[{ "metric": "node_filesystem_avail_bytes", "window": "1h", "forecast_for": "24h", "below": 1073741824 }]`
`metric_burn_rate`	live	SLO error-budget multi-window multi-burn-rate (Google SRE Workbook).	`[{ "good": "sum(rate(http_2xx[1h]))", "total": "sum(rate(http_total[1h]))", "slo": 0.999, "long": "1h", "short": "5m" }]`
`log_match`	live (Phase-B)	LogQL pattern that, when it hits, fires.	`[{ "expr": "{device_id=\"{{.device_id}}\"} \|= \"panic\"" }]`
`log_volume`	live (Phase-B)	LogQL stream rate above threshold.	`[{ "expr": "sum(rate({app=\"foo\"}[5m])) > 100" }]`
`trace_latency`	live (Phase-B)	TraceQL p95 / p99 above threshold for a service.	`[{ "service": "payments", "percentile": 95, "above_ms": 800, "window": "5m" }]`
`trace_error_rate`	live (Phase-B)	TraceQL error span fraction above threshold.	`[{ "service": "payments", "above_percent": 1.0, "window": "5m" }]`

The full Go enum lives in model/alert/model.go:

const (
    RuleKindMetricThreshold = "metric_threshold" // UI-only input
    RuleKindMetricAnomaly   = "metric_anomaly"
    RuleKindMetricForecast  = "metric_forecast"
    RuleKindMetricBurnRate  = "metric_burn_rate"
    RuleKindMetricRaw       = "metric_raw"
    RuleKindLogMatch        = "log_match"
    RuleKindLogVolume       = "log_volume"
    RuleKindTraceLatency    = "trace_latency"
    RuleKindTraceErrorRate  = "trace_error_rate"
)

Legacy kinds (edge_offline, prom_query, ingest_health, edge_absence, health_ingest, event_internal) are silently aliased to metric_raw on save.

Conditions

conditions is an array. Each element's shape depends on kind — see the table above. join_mode decides whether all elements must match (all) or any (any).

For metric_threshold (the UI form), each condition is a RuleCondition:

type RuleCondition struct {
    Metric     string  `json:"metric"`            // e.g. "cpu_pct"
    Operator   string  `json:"operator"`          // ">", ">=", "<", "<=", "=="
    Threshold  float64 `json:"threshold"`         // numeric trigger
    Window     string  `json:"window,omitempty"`  // e.g. "5m"
    For        string  `json:"for,omitempty"`     // sustain duration
    Aggregator string  `json:"aggregator,omitempty"` // avg / max / min
}

The biz layer compiles those into a metric_raw expr at save time using the canonical closed-set host metrics (node_cpu_usage_percent, node_memory_used_percent, node_filesystem_used_percent, node_load1, ...).

Severity

Value	Treatment
`info`	recorded; notifications gated by channel's `match_severity_min` (most channels skip this floor)
`warning`	default
`critical`	always notified unless silenced

A channel's match_severity_min set to warning accepts warning + critical; critical accepts only critical. Empty matches any.

Labels & annotations

Free-form key/value maps stored as JSON.

labels are appended to the incident's labels at fire time and used for grouping / dedupe. Common: service, team, env.
annotations are templated at fire time using Go template syntax with the incident snapshot — for example summary: "CPU on {{$labels.device_id}} above 90%".

Runbook

runbook_url is shown verbatim in the incident detail and on the chat surface alongside the AI investigation report. Use it to link to your internal runbooks / playbooks.

Notification dampening

Field	Type	Default	Description
`notify_window_seconds`	int	`0`	Rolling window for dampening. `0` disables.
`notify_min_fires`	int	`0`	Minimum firings inside the window before notification is sent. `0` disables.
`notify_channel_ids`	int[]	empty	Pin notifications to specific channel IDs (subject to each channel's own enabled / severity / scope filters). Empty = global router.

A rule that fires fewer than notify_min_fires times inside the trailing notify_window_seconds writes a repeat_suppressed event to the timeline (so you can see the dampening took effect) but does not notify. Both zero = dampening off, every firing notifies subject to cooldown + silence + inhibition gates.

Mixed (one zero, one >0) is rejected at the biz layer with invalid_argument: notify_window_seconds and notify_min_fires must both be zero or both > 0.

Incident lifecycle

The fields above describe a rule (the trigger definition). When a rule fires it produces an incident (alert_incidents table) with these statuses:

text

open ─┬─> acknowledged ─> resolved
      ├─> silenced ─> resolved
      └─> resolved

Incidents auto-resolve when the underlying condition clears for one full evaluator cycle. The state machine is documented in internal/manager/biz/alert/.

Event types

Every state transition records an alert_events row. Stable event-type strings:

Event type	When
`firing`	rule first fires (incident created or reopened)
`repeat_suppressed`	firing inside cooldown / dampening window
`acknowledged`	user clicks Ack
`silenced`	user silences for N hours
`resolved`	condition cleared or user clicked Resolve
`reopened`	resolved incident fired again before deletion
`note`	user-added comment
`notification_sent`	one channel delivery succeeded
`notification_failed`	one channel delivery failed
`inhibited`	suppressed by a higher-severity active incident
`ai_initial_diagnosis`	proactive AI investigator's first take, written when an incident first fires

Channel types

POST /v1/notification-channels accepts:

Type	`channel_type` value	Source
Webhook	`webhook`	env or UI
Slack	`slack`	env or UI
Larksuite / Feishu	`feishu`	env or UI
DingTalk	`dingtalk`	env or UI
WeCom (企业微信)	`wecom`	UI only
Telegram	`telegram`	UI only

The legacy log channel type was removed in 2026-05 — alert_events itself is the delivery audit.

Alert rule schema ​

Wire shape ​

Field reference ​

Identity ​

Source / scope ​

Kind ​

Conditions ​

Severity ​

Labels & annotations ​

Runbook ​

Notification dampening ​

Incident lifecycle ​

Event types ​

Channel types ​

See also ​