알림 규칙 스키마

알림 규칙은 alert_rules 테이블에 저장되며 POST /v1/alert-rules 로 제출합니다. 이 페이지는 와이어 포맷입니다. 소스 오브 트루스: internal/manager/model/alert/model.go.

와이어 모양

json

{
  "rule_key": "host_cpu_high",
  "kind": "metric_raw",
  "name": "Host CPU pegged",
  "source_type": "ongrid_builtin",
  "scope_type": "host",
  "join_mode": "all",
  "severity": "warning",
  "enabled": true,
  "conditions": [
    { "expr": "node_cpu_usage_percent > 90" }
  ],
  "labels":      { "team": "sre", "service": "host" },
  "annotations": { "summary": "CPU on {{$labels.device_id}} above 90%" },
  "runbook_url": "https://wiki.example.com/runbooks/host-cpu",
  "notify_channel_ids": [12, 17],
  "notify_window_seconds": 600,
  "notify_min_fires": 3
}

필드 레퍼런스

식별

필드	타입	필수	설명
`rule_key`	string	예	dedupe 키와 `incident.rule` 에 쓰이는 안정적 `lower_snake` 식별자. 유일.
`name`	string	예	표시 이름.
`enabled`	bool	아니오 (기본 `true`)	비활성 규칙은 evaluator 가 건너뛰며 "active" 필터에서 숨겨집니다.

Source / scope

필드	타입	필수	설명
`source_type`	enum	예	`ongrid_builtin`, `prometheus_external`. 빌트인 규칙은 정규 `rule_key` 값을 가지며, 외부 규칙은 가져온 Prometheus 알림 규칙 파일에서 유래합니다.
`scope_type`	enum	예	`host`, `global`, `monitoring_pipeline`. evaluator 가 어떤 차원으로 그룹핑할지를 결정합니다 — `host` 는 `device_id` 당 하나의 incident, `global` 은 시스템 전체에 하나의 incident, `monitoring_pipeline` 은 내부 파이프라인 헬스 규칙용입니다.
`join_mode`	enum	예	`all` (모든 조건이 매치되어야 함), `any` (하나라도 매치되면). `conditions` 가 둘 이상일 때만 의미가 있습니다.

Kind

kind 는 evaluator 가 conditions 를 어떻게 해석할지를 구분합니다. 각 kind 는 다른 서브 evaluator 를 구동합니다.

Kind	상태	하는 일	conditions 모양
`metric_threshold`	UI 전용 입력	친절한 폼. biz 레이어가 저장 시점에 `metric_raw` 로 다시 씁니다. 디스크에서는 이 kind 를 보지 못할 것입니다.	`[{ "metric": "cpu_pct", "operator": ">=", "threshold": 90, "window": "5m", "for": "2m", "aggregator": "avg" }]`
`metric_raw`	live	임의의 PromQL. 티커 구동.	`[{ "expr": "rate(http_500[5m]) > 0.1" }]`
`metric_anomaly`	live	롤링 기준선 (z-score) 으로부터의 편차. PromQuerier 를 통한 티커 구동.	`[{ "metric": "node_cpu_usage_percent", "window": "1h", "z_threshold": 3.0 }]`
`metric_forecast`	live	선형 외삽 (`predict_linear`) 이 미래 윈도 안에서 정적 임계값을 넘는지.	`[{ "metric": "node_filesystem_avail_bytes", "window": "1h", "forecast_for": "24h", "below": 1073741824 }]`
`metric_burn_rate`	live	SLO 에러 예산 다중 윈도 다중 burn-rate (Google SRE Workbook).	`[{ "good": "sum(rate(http_2xx[1h]))", "total": "sum(rate(http_total[1h]))", "slo": 0.999, "long": "1h", "short": "5m" }]`
`log_match`	live (Phase-B)	히트하면 발생하는 LogQL 패턴.	`[{ "expr": "{device_id=\"{{.device_id}}\"} \|= \"panic\"" }]`
`log_volume`	live (Phase-B)	LogQL 스트림 율 (rate) 이 임계값 초과.	`[{ "expr": "sum(rate({app=\"foo\"}[5m])) > 100" }]`
`trace_latency`	live (Phase-B)	서비스에 대한 TraceQL p95 / p99 가 임계값 초과.	`[{ "service": "payments", "percentile": 95, "above_ms": 800, "window": "5m" }]`
`trace_error_rate`	live (Phase-B)	TraceQL 에러 span 비율이 임계값 초과.	`[{ "service": "payments", "above_percent": 1.0, "window": "5m" }]`

전체 Go enum 은 model/alert/model.go 에 있습니다:

const (
    RuleKindMetricThreshold = "metric_threshold" // UI-only input
    RuleKindMetricAnomaly   = "metric_anomaly"
    RuleKindMetricForecast  = "metric_forecast"
    RuleKindMetricBurnRate  = "metric_burn_rate"
    RuleKindMetricRaw       = "metric_raw"
    RuleKindLogMatch        = "log_match"
    RuleKindLogVolume       = "log_volume"
    RuleKindTraceLatency    = "trace_latency"
    RuleKindTraceErrorRate  = "trace_error_rate"
)

레거시 kind (edge_offline, prom_query, ingest_health, edge_absence, health_ingest, event_internal) 는 저장 시 조용히 metric_raw 로 별칭 처리됩니다.

Conditions

conditions 는 배열입니다. 각 원소의 모양은 kind 에 의존합니다 — 위 표 참조. join_mode 는 모든 원소가 매치되어야 (all) 하는지 아니면 하나라도 매치되면 (any) 되는지를 결정합니다.

metric_threshold (UI 폼) 에서, 각 조건은 RuleCondition 입니다:

type RuleCondition struct {
    Metric     string  `json:"metric"`            // e.g. "cpu_pct"
    Operator   string  `json:"operator"`          // ">", ">=", "<", "<=", "=="
    Threshold  float64 `json:"threshold"`         // numeric trigger
    Window     string  `json:"window,omitempty"`  // e.g. "5m"
    For        string  `json:"for,omitempty"`     // sustain duration
    Aggregator string  `json:"aggregator,omitempty"` // avg / max / min
}

biz 레이어는 정규 폐쇄형 호스트 메트릭 (node_cpu_usage_percent, node_memory_used_percent, node_filesystem_used_percent, node_load1, ...) 을 사용해 저장 시점에 이들을 metric_raw expr 로 컴파일합니다.

Severity

값	처리
`info`	기록됨; 알림은 채널의 `match_severity_min` 으로 게이팅 (대부분의 채널은 이 바닥값을 건너뜁니다)
`warning`	기본
`critical`	silence 되지 않는 한 항상 알림

채널의 match_severity_min 이 warning 이면 warning + critical 을 받고, critical 이면 critical 만 받습니다. 비어 있으면 무엇이든 받습니다.

Labels & annotations

JSON 으로 저장되는 자유 형식 키/값 맵입니다.

labels 는 발생 시점에 incident 의 라벨에 추가되며 그룹핑 / dedupe 에 사용됩니다. 흔한 예: service, team, env.
annotations 는 incident 스냅샷에 대해 Go 템플릿 문법으로 발생 시점에 템플릿됩니다 — 예: summary: "CPU on {{$labels.device_id}} above 90%".

Runbook

runbook_url 은 incident 상세와 채팅 표면에서 AI 조사 보고서 옆에 그대로 표시됩니다. 내부 runbook / playbook 으로 링크하는 데 쓰세요.

알림 dampening

필드	타입	기본값	설명
`notify_window_seconds`	int	`0`	dampening 을 위한 롤링 윈도. `0` 은 비활성.
`notify_min_fires`	int	`0`	알림이 발송되기 전 윈도 안에서 필요한 최소 발생 횟수. `0` 은 비활성.
`notify_channel_ids`	int[]	empty	특정 채널 ID 들에 알림을 고정 (각 채널의 자체 enabled / severity / scope 필터 적용). 비어 있으면 글로벌 라우터.

발생 횟수가 직전 notify_window_seconds 안에서 notify_min_fires 보다 적은 규칙은 타임라인에 repeat_suppressed 이벤트를 기록 (그래서 dampening 이 적용되었음을 볼 수 있음) 하지만 알림을 보내지는 않습니다. 둘 다 0 이면 dampening 꺼짐 — cooldown + silence + inhibition 게이트를 따라 모든 발생이 알림됩니다.

혼합 (하나는 0, 다른 하나는 >0) 은 biz 레이어에서 invalid_argument: notify_window_seconds and notify_min_fires must both be zero or both > 0 으로 거절됩니다.

Incident 라이프사이클

위 필드들은 규칙 (트리거 정의) 을 기술합니다. 규칙이 발생하면 다음 상태를 가진 incident (alert_incidents 테이블) 를 생성합니다:

text

open ─┬─> acknowledged ─> resolved
      ├─> silenced ─> resolved
      └─> resolved

근본 조건이 evaluator 한 사이클 동안 해소되면 incident 는 자동 resolve 됩니다. 상태 머신은 internal/manager/biz/alert/ 에 문서화되어 있습니다.

이벤트 타입

모든 상태 전이는 alert_events 행을 기록합니다. 안정적인 이벤트 타입 문자열:

이벤트 타입	시점
`firing`	규칙 최초 발생 (incident 생성 또는 재오픈)
`repeat_suppressed`	cooldown / dampening 윈도 안에서의 발생
`acknowledged`	사용자가 Ack 클릭
`silenced`	사용자가 N 시간 동안 silence
`resolved`	조건 해소 또는 사용자가 Resolve 클릭
`reopened`	resolved 된 incident 가 삭제 전에 다시 발생
`note`	사용자가 코멘트 추가
`notification_sent`	채널 전달 성공 1건
`notification_failed`	채널 전달 실패 1건
`inhibited`	더 높은 심각도의 활성 incident 에 의해 억제
`ai_initial_diagnosis`	incident 최초 발생 시 사전 AI investigator 의 첫 견해 기록

채널 타입

POST /v1/notification-channels 가 받는 것:

타입	`channel_type` 값	소스
Webhook	`webhook`	env 또는 UI
Slack	`slack`	env 또는 UI
Larksuite / Feishu	`feishu`	env 또는 UI
DingTalk	`dingtalk`	env 또는 UI
WeCom (企业微信)	`wecom`	UI 전용
Telegram	`telegram`	UI 전용

레거시 log 채널 타입은 2026-05 에 제거되었습니다 — alert_events 자체가 전달 감사입니다.

함께 보기

REST API — 이 객체들의 엔드포인트.
Capabilities → Alerts — 운영자 시점 투어.
Channels — 채팅 표면으로의 아웃바운드 전달 와이어링.

알림 규칙 스키마 ​

와이어 모양 ​

필드 레퍼런스 ​

식별 ​

Source / scope ​

Kind ​

Conditions ​

Severity ​

Labels & annotations ​

Runbook ​

알림 dampening ​

Incident 라이프사이클 ​

이벤트 타입 ​

채널 타입 ​

함께 보기 ​