Alert rule schema
Alert rules are stored in the alert_rules table and submitted to POST /v1/alert-rules. This page is the wire format. Source of truth: internal/manager/model/alert/model.go.
Wire shape
{
"rule_key": "host_cpu_high",
"kind": "metric_raw",
"name": "Host CPU pegged",
"source_type": "ongrid_builtin",
"scope_type": "host",
"join_mode": "all",
"severity": "warning",
"enabled": true,
"conditions": [
{ "expr": "node_cpu_usage_percent > 90" }
],
"labels": { "team": "sre", "service": "host" },
"annotations": { "summary": "CPU on {{$labels.device_id}} above 90%" },
"runbook_url": "https://wiki.example.com/runbooks/host-cpu",
"notify_channel_ids": [12, 17],
"notify_window_seconds": 600,
"notify_min_fires": 3
}Field reference
Identity
| Field | Type | Required | Description |
|---|---|---|---|
rule_key | string | yes | Stable lower_snake identifier used in dedupe keys and incident.rule. Unique. |
name | string | yes | Display name. |
enabled | bool | no (default true) | Disabled rules are skipped by the evaluator and hidden from "active" filters. |
Source / scope
| Field | Type | Required | Description |
|---|---|---|---|
source_type | enum | yes | ongrid_builtin, prometheus_external. Built-in rules carry canonical rule_key values; external rules originate from an imported Prometheus alerting rule file. |
scope_type | enum | yes | host, global, monitoring_pipeline. Determines what dimension the evaluator groups by — host produces one incident per device_id; global produces one incident system-wide; monitoring_pipeline is for internal pipeline-health rules. |
join_mode | enum | yes | all (every condition must match), any (any condition matches). Only relevant when conditions has more than one element. |
Kind
kind discriminates how the evaluator interprets conditions. Each kind drives a different sub-evaluator.
| Kind | Status | What it does | Conditions shape |
|---|---|---|---|
metric_threshold | UI-only input | Friendly form. The biz layer rewrites it to metric_raw at save time; you will never see this kind on disk. | [{ "metric": "cpu_pct", "operator": ">=", "threshold": 90, "window": "5m", "for": "2m", "aggregator": "avg" }] |
metric_raw | live | Arbitrary PromQL. Ticker-driven. | [{ "expr": "rate(http_500[5m]) > 0.1" }] |
metric_anomaly | live | Deviation from a rolling baseline (z-score). Ticker-driven via the PromQuerier. | [{ "metric": "node_cpu_usage_percent", "window": "1h", "z_threshold": 3.0 }] |
metric_forecast | live | Linear extrapolation (predict_linear) crossing a static threshold within a future window. | [{ "metric": "node_filesystem_avail_bytes", "window": "1h", "forecast_for": "24h", "below": 1073741824 }] |
metric_burn_rate | live | SLO error-budget multi-window multi-burn-rate (Google SRE Workbook). | [{ "good": "sum(rate(http_2xx[1h]))", "total": "sum(rate(http_total[1h]))", "slo": 0.999, "long": "1h", "short": "5m" }] |
log_match | live (Phase-B) | LogQL pattern that, when it hits, fires. | [{ "expr": "{device_id=\"{{.device_id}}\"} |= \"panic\"" }] |
log_volume | live (Phase-B) | LogQL stream rate above threshold. | [{ "expr": "sum(rate({app=\"foo\"}[5m])) > 100" }] |
trace_latency | live (Phase-B) | TraceQL p95 / p99 above threshold for a service. | [{ "service": "payments", "percentile": 95, "above_ms": 800, "window": "5m" }] |
trace_error_rate | live (Phase-B) | TraceQL error span fraction above threshold. | [{ "service": "payments", "above_percent": 1.0, "window": "5m" }] |
The full Go enum lives in model/alert/model.go:
const (
RuleKindMetricThreshold = "metric_threshold" // UI-only input
RuleKindMetricAnomaly = "metric_anomaly"
RuleKindMetricForecast = "metric_forecast"
RuleKindMetricBurnRate = "metric_burn_rate"
RuleKindMetricRaw = "metric_raw"
RuleKindLogMatch = "log_match"
RuleKindLogVolume = "log_volume"
RuleKindTraceLatency = "trace_latency"
RuleKindTraceErrorRate = "trace_error_rate"
)Legacy kinds (edge_offline, prom_query, ingest_health, edge_absence, health_ingest, event_internal) are silently aliased to metric_raw on save.
Conditions
conditions is an array. Each element's shape depends on kind — see the table above. join_mode decides whether all elements must match (all) or any (any).
For metric_threshold (the UI form), each condition is a RuleCondition:
type RuleCondition struct {
Metric string `json:"metric"` // e.g. "cpu_pct"
Operator string `json:"operator"` // ">", ">=", "<", "<=", "=="
Threshold float64 `json:"threshold"` // numeric trigger
Window string `json:"window,omitempty"` // e.g. "5m"
For string `json:"for,omitempty"` // sustain duration
Aggregator string `json:"aggregator,omitempty"` // avg / max / min
}The biz layer compiles those into a metric_raw expr at save time using the canonical closed-set host metrics (node_cpu_usage_percent, node_memory_used_percent, node_filesystem_used_percent, node_load1, ...).
Severity
| Value | Treatment |
|---|---|
info | recorded; notifications gated by channel's match_severity_min (most channels skip this floor) |
warning | default |
critical | always notified unless silenced |
A channel's match_severity_min set to warning accepts warning + critical; critical accepts only critical. Empty matches any.
Labels & annotations
Free-form key/value maps stored as JSON.
labelsare appended to the incident's labels at fire time and used for grouping / dedupe. Common:service,team,env.annotationsare templated at fire time using Go template syntax with the incident snapshot — for examplesummary: "CPU on {{$labels.device_id}} above 90%".
Runbook
runbook_url is shown verbatim in the incident detail and on the chat surface alongside the AI investigation report. Use it to link to your internal runbooks / playbooks.
Notification dampening
| Field | Type | Default | Description |
|---|---|---|---|
notify_window_seconds | int | 0 | Rolling window for dampening. 0 disables. |
notify_min_fires | int | 0 | Minimum firings inside the window before notification is sent. 0 disables. |
notify_channel_ids | int[] | empty | Pin notifications to specific channel IDs (subject to each channel's own enabled / severity / scope filters). Empty = global router. |
A rule that fires fewer than notify_min_fires times inside the trailing notify_window_seconds writes a repeat_suppressed event to the timeline (so you can see the dampening took effect) but does not notify. Both zero = dampening off, every firing notifies subject to cooldown + silence + inhibition gates.
Mixed (one zero, one >0) is rejected at the biz layer with invalid_argument: notify_window_seconds and notify_min_fires must both be zero or both > 0.
Incident lifecycle
The fields above describe a rule (the trigger definition). When a rule fires it produces an incident (alert_incidents table) with these statuses:
open ─┬─> acknowledged ─> resolved
├─> silenced ─> resolved
└─> resolvedIncidents auto-resolve when the underlying condition clears for one full evaluator cycle. The state machine is documented in internal/manager/biz/alert/.
Event types
Every state transition records an alert_events row. Stable event-type strings:
| Event type | When |
|---|---|
firing | rule first fires (incident created or reopened) |
repeat_suppressed | firing inside cooldown / dampening window |
acknowledged | user clicks Ack |
silenced | user silences for N hours |
resolved | condition cleared or user clicked Resolve |
reopened | resolved incident fired again before deletion |
note | user-added comment |
notification_sent | one channel delivery succeeded |
notification_failed | one channel delivery failed |
inhibited | suppressed by a higher-severity active incident |
ai_initial_diagnosis | proactive AI investigator's first take, written when an incident first fires |
Channel types
POST /v1/notification-channels accepts:
| Type | channel_type value | Source |
|---|---|---|
| Webhook | webhook | env or UI |
| Slack | slack | env or UI |
| Larksuite / Feishu | feishu | env or UI |
| DingTalk | dingtalk | env or UI |
| WeCom (企业微信) | wecom | UI only |
| Telegram | telegram | UI only |
The legacy log channel type was removed in 2026-05 — alert_events itself is the delivery audit.
See also
- REST API — endpoints for these objects.
- Capabilities → Alerts — operator-facing tour.
- Channels — wiring outbound delivery to chat surfaces.