告警

Ongrid 的告警子系统是单一 tick 循环：扫每一行启用的规则，问对应的后端（指标 + trace spanmetrics 走 Prom，日志走 Loki）谓词是否命中，把触发记到 incidents 表。

没有独立的 Alertmanager，没有独立的规则文件。规则住在 MySQL 里， evaluator 按 30s 缓存刷新轮询，通知按通道注册表扇出。

14 种规则类型

规则用 kind 列存。编译器按它派发。

编译器在 rules.go；evaluator 在 evaluators_phaseA.go 和 evaluators_phaseB.go。

8+6 的分法是 HLD-004 的 Phase-A（指标）/ Phase-B（日志 + trace），2026-05-08 落地。

Metric 类型（Phase A）

类型	做什么	spec 字段
`metric_raw`	PromQL 表达式就是谓词。按返回的 vector 每个条目触发。	`expr`
`metric_anomaly`	在滚动基线窗口上做 z-score 或 MAD。	`metric`、`method`、`baseline_window`、`baseline_step`、`deviation`、`for_seconds`
`metric_forecast`	`predict_linear(metric[fit_window], predict_seconds) <op> threshold`。	`metric`、`fit_window`、`predict_seconds`、`operator`、`threshold`
`metric_burn_rate`	基于 SLO 的 Google SRE 多窗口多 burn。所有窗口必须都触发。	`sli`、`slo`、`burns[].window`、`burns[].multiplier`

老的 prom_query 类型改名成了 metric_raw。老的 metric_threshold 形式现在只是 UI 入口，保存时编译成 metric_raw —— 它没有独立的 evaluator。

// internal/manager/biz/alert/rules.go:36
type MetricRawRule struct {
    ID         uint64
    RuleKey    string
    Name       string
    Severity   string
    ScopeType  string // host / global / monitoring_pipeline
    RunbookURL string
    Labels     map[string]string
    Expr       string // canonical predicate, e.g. `up == 0`
}

Log + trace 类型（Phase B）

类型	做什么	后端
`log_match`	对 Loki 跑 `count_over_time(<stream> \|~ <filter> [window]) <op> threshold`。按 label-set 触发。	Loki
`log_volume`	跟 `log_match` 一个形状，当前窗口计数对绝对阈值。	Loki
`trace_latency`	`histogram_quantile(q, sum by(le)(rate(traces_spanmetrics_latency_bucket[w]))) > threshold_ms`。	Prom（spanmetrics）
`trace_error_rate`	`100 * (sum rate(traces_spanmetrics_calls_total{status_code="STATUS_CODE_ERROR"}) / sum rate(...)) > pct`。	Prom（spanmetrics）

Trace 类型查 Prometheus，不查 Tempo。spanmetrics 生成器去爬 Tempo 然后把 traces_spanmetrics_* series 写回 Prom —— 查 Prom 让告警 evaluator 只用一个查询引擎，复用所有 operator 过滤 / 阈值逻辑。

Scope 类型

每条规则有 scope_type ∈ {host, global, monitoring_pipeline}。每个类型的默认值在 rules.go 的 defaultScopeForKind 里定义。

host —— incident 必须带 device_id。evaluator 从 Prom 结果 label 里解 device_id；validateFiring 会拒掉没 device_id 的 host scope 触发。
global —— 服务级告警（trace_、log_），不绑到具体某台主机。
monitoring_pipeline —— 关于 Ongrid 自身的元告警（scrape_down、 prom_ingest_fail……）。

evaluator tick

PipelineEvaluator.evaluate 每 Interval 跑一次（默认 5 分钟，通过 PipelineEvaluatorOpts.Interval 配置）。

func (e *PipelineEvaluator) evaluate(ctx context.Context) {
    now := e.now()
    if e.edges != nil {
        e.refreshDeviceStalenessGauge(ctx, now)
    }
    if e.prom != nil {
        e.evaluatePromQuery(ctx, now)
        e.evaluateMetricAnomaly(ctx, now)
        e.evaluateMetricForecast(ctx, now)
        e.evaluateMetricBurnRate(ctx, now)
        e.evaluateTraceLatency(ctx, now)
        e.evaluateTraceErrorRate(ctx, now)
    }
    if e.logq != nil {
        e.evaluateLogMatch(ctx, now)
        e.evaluateLogVolume(ctx, now)
    }
}

后端为 nil 就静默跳过对应类型 —— Loki 挂了不会破坏指标告警。

去重 + 恢复

evaluator 跨 tick 维护 firingSnapshot[ruleKey] = set<dedupeKey>。上一 tick 里有这个 key 但这一 tick 没有 → PromQL 的比较过滤器把 series 丢了 → 谓词清除 → SystemResolveIncident 触发，带 "prom condition cleared"。这就是告警自动恢复的机制，不需要独立的 "resolve" evaluator。

去重 key 形状：pipeline:<rule_key>:<sorted-label-set> —— provenance label （__name__、ongrid_source）会被剥掉，这样内嵌 collector 和云端 collector 报上来的同一条告警去重成一条 incident，不是两条（labelSetKey）。

通道分发

incident 触发时，Notifier.MaybeNotify 路径会问 ChannelResolver：

按规则钉死 —— 如果 rule.notify_channel_ids_json 非空，只匹配这些 channel id（且只匹配启用的）。
否则，每行启用的 notification_channels 按 match_severity_min 和 match_scope_types 过滤。
如果一个都没匹配上，resolver 回退到从 DefaultChannels 播种的合成 channel 列表，所以通知不会凭空消失。

参见 router.go。

抑制

两条内置抑制规则（inhibit.go），覆盖最吵的默认场景：

edge_offline:edge_X 抑制任何 host:X:* —— edge 联系不上时，它上面所有 host scope 告警都被压住。
pipeline:prom_ingest_fail 抑制 pipeline:scrape_down:* —— Prometheus 自己都吃不下数据，每条 "target down" 告警都是噪声。

未来的 inhibition_rules 表把这个扩展到 admin 自定义组。

冷却 + 抑噪

NotifyOpts.Cooldown（默认 10 分钟）限制同一 dedupe_key 上的再通知。抑噪过滤器坐在 Usecase.MaybeNotify 里，所以通道 resolver 和抑制器在每次触发都会跑 —— 只是实际的 Notifier.Send 会被跳过。

另见

RCA —— incident 触发后发生什么。
日志 —— Loki + log_match / log_volume evaluator。
链路 —— Tempo + trace_latency / trace_error_rate evaluator。
通道概览 —— Slack / Telegram / 飞书 / 钉钉 / 企业微信 + 裸 webhook 通道怎么配置。
告警规则 schema —— 规则行的 wire format。

告警 ​

14 种规则类型 ​

Metric 类型（Phase A） ​

Log + trace 类型（Phase B） ​

Scope 类型 ​

evaluator tick ​

去重 + 恢复 ​

通道分发 ​

抑制 ​

冷却 + 抑噪 ​

另见 ​

告警