告警规则 schema

告警规则存在 alert_rules 表，提交到 POST /v1/alert-rules。这一页是 wire format。真理之源：internal/manager/model/alert/model.go。

wire 形状

json

{
  "rule_key": "host_cpu_high",
  "kind": "metric_raw",
  "name": "Host CPU pegged",
  "source_type": "ongrid_builtin",
  "scope_type": "host",
  "join_mode": "all",
  "severity": "warning",
  "enabled": true,
  "conditions": [
    { "expr": "node_cpu_usage_percent > 90" }
  ],
  "labels":      { "team": "sre", "service": "host" },
  "annotations": { "summary": "CPU on {{$labels.device_id}} above 90%" },
  "runbook_url": "https://wiki.example.com/runbooks/host-cpu",
  "notify_channel_ids": [12, 17],
  "notify_window_seconds": 600,
  "notify_min_fires": 3
}

字段参考

身份

字段	类型	必填	描述
`rule_key`	string	是	稳定的 `lower_snake` 标识符，用在 dedupe key 和 `incident.rule` 里。唯一。
`name`	string	是	显示名。
`enabled`	bool	否（默认 `true`）	禁用的规则被 evaluator 跳过，并从 "active" 过滤里隐藏。

source / scope

字段	类型	必填	描述
`source_type`	enum	是	`ongrid_builtin`、`prometheus_external`。内置规则带规范 `rule_key`；外部规则来自导入的 Prometheus alerting rule 文件。
`scope_type`	enum	是	`host`、`global`、`monitoring_pipeline`。决定 evaluator 按哪个维度分组 —— `host` 每个 `device_id` 出一条 incident；`global` 全系统一条 incident；`monitoring_pipeline` 给内部 pipeline 健康规则用。
`join_mode`	enum	是	`all`（每个 condition 都得匹配）、`any`（任一匹配）。只在 `conditions` 有多个元素时相关。

kind

kind 判别 evaluator 怎么解读 conditions。每个 kind 驱动一个不同的子 evaluator。

Kind	状态	做什么	conditions 形状
`metric_threshold`	仅 UI 输入	友好表单。biz 层在保存时改写成 `metric_raw`；你在磁盘上永远看不到这种 kind。	`[{ "metric": "cpu_pct", "operator": ">=", "threshold": 90, "window": "5m", "for": "2m", "aggregator": "avg" }]`
`metric_raw`	上线	任意 PromQL。ticker 驱动。	`[{ "expr": "rate(http_500[5m]) > 0.1" }]`
`metric_anomaly`	上线	偏离滚动基线（z-score）。通过 PromQuerier ticker 驱动。	`[{ "metric": "node_cpu_usage_percent", "window": "1h", "z_threshold": 3.0 }]`
`metric_forecast`	上线	线性外推（`predict_linear`）在未来窗口里穿过静态阈值。	`[{ "metric": "node_filesystem_avail_bytes", "window": "1h", "forecast_for": "24h", "below": 1073741824 }]`
`metric_burn_rate`	上线	SLO 错误预算多窗口多 burn 率（Google SRE Workbook）。	`[{ "good": "sum(rate(http_2xx[1h]))", "total": "sum(rate(http_total[1h]))", "slo": 0.999, "long": "1h", "short": "5m" }]`
`log_match`	上线（Phase-B）	LogQL 模式，命中即触发。	`[{ "expr": "{device_id=\"{{.device_id}}\"} \|= \"panic\"" }]`
`log_volume`	上线（Phase-B）	LogQL stream 速率超阈值。	`[{ "expr": "sum(rate({app=\"foo\"}[5m])) > 100" }]`
`trace_latency`	上线（Phase-B）	TraceQL p95 / p99 超阈值（按服务）。	`[{ "service": "payments", "percentile": 95, "above_ms": 800, "window": "5m" }]`
`trace_error_rate`	上线（Phase-B）	TraceQL error span 比例超阈值。	`[{ "service": "payments", "above_percent": 1.0, "window": "5m" }]`

完整 Go 枚举在 model/alert/model.go：

const (
    RuleKindMetricThreshold = "metric_threshold" // UI-only input
    RuleKindMetricAnomaly   = "metric_anomaly"
    RuleKindMetricForecast  = "metric_forecast"
    RuleKindMetricBurnRate  = "metric_burn_rate"
    RuleKindMetricRaw       = "metric_raw"
    RuleKindLogMatch        = "log_match"
    RuleKindLogVolume       = "log_volume"
    RuleKindTraceLatency    = "trace_latency"
    RuleKindTraceErrorRate  = "trace_error_rate"
)

遗留 kind（edge_offline、prom_query、ingest_health、edge_absence、 health_ingest、event_internal）在保存时静默别名到 metric_raw。

conditions

conditions 是数组。每个元素的形状取决于 kind —— 见上表。join_mode 决定是否所有元素都得匹配（all）还是任一（any）。

metric_threshold（UI 表单）的每个 condition 是一个 RuleCondition：

type RuleCondition struct {
    Metric     string  `json:"metric"`            // e.g. "cpu_pct"
    Operator   string  `json:"operator"`          // ">", ">=", "<", "<=", "=="
    Threshold  float64 `json:"threshold"`         // numeric trigger
    Window     string  `json:"window,omitempty"`  // e.g. "5m"
    For        string  `json:"for,omitempty"`     // sustain duration
    Aggregator string  `json:"aggregator,omitempty"` // avg / max / min
}

biz 层在保存时把它们编译成 metric_raw 的 expr，用规范闭合集主机指标（node_cpu_usage_percent、node_memory_used_percent、 node_filesystem_used_percent、node_load1……）。

severity

值	处理
`info`	记录；通知由通道 `match_severity_min` 卡（多数通道跳过这个地板）
`warning`	默认
`critical`	一直通知，除非被静默

通道 match_severity_min 设为 warning 接受 warning + critical； critical 只接受 critical。空匹配任何。

labels & annotations

自由 key/value 映射，JSON 存。

labels 在触发时追加到 incident 的 label 上，用于分组 / dedupe。常见： service、team、env。
annotations 在触发时用 Go 模板语法基于 incident 快照渲染 —— 比如 summary: "CPU on {{$labels.device_id}} above 90%"。

runbook

runbook_url 在 incident detail 和 chat 面 AI 调查报告旁逐字显示。用来链到你的内部 runbook / playbook。

通知抑噪

字段	类型	默认	描述
`notify_window_seconds`	int	`0`	抑噪滚动窗口。`0` 禁用。
`notify_min_fires`	int	`0`	窗口内触发至少这么多次才发通知。`0` 禁用。
`notify_channel_ids`	int[]	空	把通知钉到具体 channel ID（受每个 channel 自己的启用 / severity / scope 过滤约束）。空 = 全局路由器。

在尾随 notify_window_seconds 窗口里触发少于 notify_min_fires 次的规则写一条 repeat_suppressed 事件到 timeline（让你看到抑噪起作用了）但不通知。两个都 0 = 抑噪关闭，每次触发按冷却 + 静默 + 抑制门发通知。

混合（一个 0、一个 >0）在 biz 层被拒：invalid_argument: notify_window_seconds and notify_min_fires must both be zero or both > 0。

incident 生命周期

上面字段描述的是规则（触发定义）。规则触发产生一条incident （alert_incidents 表），有这些状态：

text

open ─┬─> acknowledged ─> resolved
      ├─> silenced ─> resolved
      └─> resolved

底层条件清掉一整个 evaluator 周期后 incident 自动 resolve。状态机在 internal/manager/biz/alert/ 里记录。

事件类型

每次状态转换记一行 alert_events。稳定的事件类型字符串：

事件类型	何时
`firing`	规则首次触发（incident 创建或重开）
`repeat_suppressed`	冷却 / 抑噪窗口内触发
`acknowledged`	用户点 Ack
`silenced`	用户静默 N 小时
`resolved`	条件清掉或用户点 Resolve
`reopened`	已 resolve 的 incident 在删除前再次触发
`note`	用户添加的评论
`notification_sent`	一个 channel 投递成功
`notification_failed`	一个 channel 投递失败
`inhibited`	被更高 severity 的 active incident 抑制
`ai_initial_diagnosis`	主动 AI investigator 的第一个看法，incident 首次触发时写

通道类型

POST /v1/notification-channels 接受：

类型	`channel_type` 值	来源
Webhook	`webhook`	env 或 UI
Slack	`slack`	env 或 UI
Larksuite / 飞书	`feishu`	env 或 UI
钉钉	`dingtalk`	env 或 UI
企业微信	`wecom`	仅 UI
Telegram	`telegram`	仅 UI

遗留 log channel 类型 2026-05 移除 —— alert_events 本身就是投递审计。

另见

REST API —— 这些对象的 endpoint。
能力 → 告警 —— 运维向导游。
通道 —— 接外发投递到 chat 面。

告警规则 schema ​

wire 形状 ​

字段参考 ​

身份 ​

source / scope ​

kind ​

conditions ​

severity ​

labels & annotations ​

runbook ​

通知抑噪 ​

incident 生命周期 ​

事件类型 ​

通道类型 ​

另见 ​