Specialists

Ongrid 内置 5 个 specialist persona。每个掌管主机 / 集群诊断的一小片，工具袋聚焦，且明确拒掉属于邻居的任务。coordinator 按请求形状派发；specialist 的 when_to_use 块告诉 coordinator 的 LLM 何时它合适。

Persona	掌管什么	不掌管什么
`specialist-compute`	CPU、内存、负载、进程、OOM、调度	磁盘、网络、服务重启
`specialist-disk`	文件系统、du / find / stat、inode	网络、进程、业务日志
`specialist-network`	OVS、netfilter、netns、conntrack、bpf、路由	文件系统、进程
`specialist-ops`	服务启停重启、journalctl、包管理	集群趋势、网络内核
`specialist-sre`	黄金四信号、SLO、错误预算、分诊	单机 bash、深度 RCA

为什么切得细

更紧的工具袋 = 更紧的 system prompt = 每个 token 推得更深。一个带 8 个工具的 specialist-network worker 比一个带 60 个工具、iptables 碰巧也在其中的 coordinator 对 iptables / nft 的回答更扎实。persona 边界也让每个领域在 system prompt 里能带上领域专属的 KB 提示。

KB-first 约定

5 个 specialist 都从同一个第 0 步开始：用问题的自然语言描述调一次 query_knowledge。Top 结果 ≥ 0.6 就跟着 playbook 走，最终回答末尾带 (参考 KB: <title>)。这是 persona prompt 强制的，不是 runtime —— 但 prompt 写得够直接，强模型会一致遵守。

为什么要统一第 0 步：

vault 里有团队专属 playbook（偏好命令、已知陷阱、升级路径）。对你的机队，它们打模型的通识。
它把 worker 第一个 turn 锚到一份已知好的方案上，而不是任它即兴。
(参考 KB: …) 引用让出处可审计。

vault 默认含什么、怎么加你自己的 playbook，见知识库。

`specialist-compute` —— CPU / 内存 / 负载 / 进程

frontmatter 要点：

yaml

name: specialist-compute
permission_mode: read-only
max_turns: 15
tools:
  - query_knowledge
  - get_host_load
  - get_host_processes
  - get_edge_summary
  - rank_edges
  - find_outlier_edges
  - query_promql
  - host_bash

coordinator 何时挑它

"node-X CPU 顶满了，谁吃的？"
"Load 飙了但 CPU 闲 —— 在等什么？"
"进程内存膨胀 / OOM 取证。"
"VM steal time —— 是不是吵闹邻居？"

persona 里烤进去的 recipe

load avg 高 + CPU% 低 → 找 D-state 进程（host_bash "ps -eo stat,pid,cmd | awk '$1 ~ /D/'"）。多半 IO / 网络等待 —— 告诉 coordinator 派 specialist-disk 或 specialist-network。
CPU% 高 → top 进程、用户态 vs 系统态时间、vmstat 的 st 列看 steal。
mem_used_pct 高 → node_memory_* 看 cache / buffer / swap，dmesg | grep -i 'oom-killer' 看 OOM 命中，单进程 RSS 离群带 PID + 名字报。

拒掉（会告诉 coordinator 重派）

"要重启 nginx 吗？" → specialist-ops（重启走 host_restart_service + reviewer）。
"磁盘要满了" → specialist-disk。
"网络丢包" → specialist-network。

`specialist-disk` —— 文件系统 / 容量

frontmatter 要点：

yaml

name: specialist-disk
permission_mode: read-only
max_turns: 15
tools:
  - query_knowledge
  - host_find_large_files
  - host_du_summary
  - host_stat_file
  - host_bash
  - query_promql
  - get_host_load

4 步 recipe

宏观确认：get_host_load 看 disk_used_pct + query_promql 看 node_filesystem_* 趋势。
分层下钻：host_du_summary(paths=["/", "/var", "/opt", "/home", "/tmp"], depth=1)。
文件定位：host_find_large_files(paths=[最大的顶层], top_n=20)。
需要时查 inode：host_bash "df -i"。

persona 拒掉的反模式

一条路径一次调（用数组，一次 4-8 个路径，快得多）。
在 /proc /sys /dev 上跑 —— sandbox 拒，persona 知道不该问。
任何 delete / mv / rm —— 只读。

`specialist-network` —— 包 / netns / iptables / OVS

frontmatter 要点：

yaml

name: specialist-network
permission_mode: read-only
max_turns: 15
tools:
  - query_knowledge
  - host_bash
  - host_probe_http
  - host_probe_dns
  - host_probe_tcp
  - host_netns_inspect
  - query_promql
  - get_host_load

recipe

先拓扑。 ip -j addr show + host_netns_inspect 摸清接口和 namespace 布局。
链路状态。 ethtool -i ethX 看驱动 + 速率；ss -tnp 看连接。
NAT / 防火墙。 nft list ruleset、iptables -L -n、conntrack -S。
OVS。 ovs-vsctl show、ovs-ofctl dump-flows br0。
eBPF。 bpftool prog show、bpftool net show。
连通性探测。 host_probe_tcp、host_probe_http、host_probe_dns。

每次 host_bash 带 cmdpolicy 白名单 —— OVS / nft / conntrack / bpftool / ethtool / ip netns 的 cmdpolicy 条目见 Layer-1 网络研发。

输出纪律

三行：现象（丢包 / 高 RTT / NAT 满 / 流表空 / 路由错），根因（判断 + 关键证据），下一步（建议动作：重启服务、改路由、加规则）。不放 ovs-ofctl 原始 dump —— coordinator 不读。

`specialist-ops` —— 服务运营

这是写的 specialist。frontmatter 要点：

yaml

name: specialist-ops
permission_mode: read-only
max_turns: 15
tools:
  - query_knowledge
  - host_bash
  - get_host_processes
  - get_host_load
  - host_restart_service   # mutating; gated by ReviewGate
  - query_promql
  - query_logql
  - get_edge_summary

特别之处

带 host_restart_service，Class: "write"。任意调用都触发 ReviewGate decorator：gate spawn reviewer worker；只有 approve 才真的重启。
不能绕开 gate 用 host_bash systemctl restart —— edge 的 cmdpolicy 一律拒绝写入式 systemctl 子命令，不管谁发的命令。

coordinator 何时挑它

服务特定的 status / 日志 / 重启（nginx、mysql、自家进程）。
systemd unit status、journalctl -u 报错。
最近重启次数、OOM 关联。
cron / timer schedule 检查。
包破损（dpkg / apt / yum）。
"要不要重启 X 清掉泄漏？" —— ops 提建议，reviewer 把门。

拒掉

集群趋势 / SLO 判断 → specialist-sre。
网络内核 → specialist-network。
深层文件级磁盘分析 → specialist-disk。

`specialist-sre` —— 黄金四信号 / 分诊

回答"系统健康吗" / "哪条 incident 最要紧" 的 persona。frontmatter 要点：

yaml

name: specialist-sre
permission_mode: read-only
max_turns: 15
tools:
  - query_knowledge
  - correlate_incident
  - get_active_incidents
  - get_incident_detail
  - get_edge_summary
  - query_promql
  - query_logql
  - find_outlier_edges
  - rank_edges
  - get_host_load

工作风格

先看 incident 列表。 get_active_incidents —— 现在在烧什么、严重度多少。
看趋势，不看单机指标。 query_promql 1h / 24h 窗口对比基线。
找离群主机。 find_outlier_edges / rank_edges 而非手写 PromQL 算 IQR。
说黄金四信号。 延迟（p50/p95/p99）/ 错误率 / 流量 / 饱和度 —— 每个都说"基线 → 当前 → 偏离方向"。
优先级判断。 P0（用户受损、还在恶化）、P1（用户受损、稳定）、P2（内部 / 趋势）、P3（噪声 / 误报）。
下派。 怀疑磁盘 → 告诉 coordinator 派 specialist-disk。怀疑网络 → specialist-network。它自己不深挖。

SRE 为什么不做 RCA

对需要完整根因归因的 incident，coordinator 挑 incident-investigator，不挑 specialist-sre。SRE persona 止步于"这是 payments 上一条真的 P1 饱和；建议派 incident-investigator 带 incident_id=1234"。investigator 沿因果链走；SRE 判断这条链值不值得走。

coordinator 怎么挑

catalog 由 buildAgentCatalog 渲染进 coordinator 的 system prompt。每个 persona 暴露：

name（subagent_type 的值）。
persona 的 description。
when_to_use 第一段非空行。

所以 LLM 看到的 system prompt 长这样：

text

## 可用的 specialist 助理（AgentTool 的 subagent_type）

- `specialist-compute` — 计算专家——CPU / 内存 / load / 进程调度…
- `specialist-disk` — 文件系统 / 磁盘容量专家——du / find / stat / inode…
- `specialist-network` — 网络问题专家——OVS / netfilter / netns…
- `specialist-ops` — 运维 / 服务运营专家——服务状态 / 启停重启…
- `specialist-sre` — SRE / 可观测性专家——告警响应 / 黄金四信号 / SLO…
- `incident-investigator` — 告警根因诊断 worker…

reviewer 和 default 有意排除 —— 前者归 ReviewGate，后者是顶层虚拟 coordinator persona，列进来会让 coordinator 递归 spawn 自己。

跨派交接模式

当答案是"这不归我管"时：

Specialist	常见交接
compute → disk	D-state 进程在 IO 等待
compute → net	D-state 进程阻塞在网络上（socket、recv）
compute → ops	"进程在泄漏；建议重启服务"
disk → ops	"日志爆炸；建议收紧 logrotate"
net → ops	"iptables 规则补了漏；建议重启服务以策安全"
sre → investigator	"P1 饱和、真的、需要 RCA"
sre → compute	"饱和告警 —— 钻 CPU/mem"

交接是口头的（specialist 在回复里说 "建议派 specialist-X because Y"），不是工具调用。coordinator 读建议然后派下一个 worker。specialist 不能 spawn worker —— 见 Agent 总览。

调优 specialist persona

fork 常见改动：

工具袋： 把自家 BaseTool（比如 query_clickhouse）加进相关 persona 的 tools:。worker filter 信 persona 白名单。
max_turns： 默认 15 算宽裕；成本敏感的 specialist 收紧到 10。
persona 正文： 加上你们团队的偏好命令（"这套机队上 journalctl --since '5m ago' 优先于 --since '1h ago'"）。
when_to_use： 收紧或放宽，按 coordinator 挑这个 persona 的频率调。

别改 name —— catalog 和 AgentTool 按精确名查 worker。见 Custom agents。

Specialists ​

KB-first 约定 ​

specialist-compute —— CPU / 内存 / 负载 / 进程 ​

coordinator 何时挑它 ​

persona 里烤进去的 recipe ​

拒掉（会告诉 coordinator 重派） ​

specialist-disk —— 文件系统 / 容量 ​

4 步 recipe ​

persona 拒掉的反模式 ​

specialist-network —— 包 / netns / iptables / OVS ​

recipe ​

输出纪律 ​

specialist-ops —— 服务运营 ​

特别之处 ​

coordinator 何时挑它 ​

拒掉 ​

specialist-sre —— 黄金四信号 / 分诊 ​

工作风格 ​

SRE 为什么不做 RCA ​

coordinator 怎么挑 ​

跨派交接模式 ​

调优 specialist persona ​

Specialists

KB-first 约定

`specialist-compute` —— CPU / 内存 / 负载 / 进程

coordinator 何时挑它

persona 里烤进去的 recipe

拒掉（会告诉 coordinator 重派）

`specialist-disk` —— 文件系统 / 容量

4 步 recipe

persona 拒掉的反模式

`specialist-network` —— 包 / netns / iptables / OVS

recipe

输出纪律

`specialist-ops` —— 服务运营

特别之处

coordinator 何时挑它

拒掉

`specialist-sre` —— 黄金四信号 / 分诊

工作风格

SRE 为什么不做 RCA

coordinator 怎么挑

跨派交接模式

调优 specialist persona