Specialists
Ongrid ships five specialist personas. Each one owns a narrow slice of host / cluster diagnostics, has a focused tool bag, and explicitly refuses tasks that belong to a peer. The coordinator dispatches based on the request shape; the specialist's when_to_use block tells the coordinator's LLM when it's the right fit.
| Persona | Owns | Doesn't own |
|---|---|---|
specialist-compute | CPU, memory, load, processes, OOM, scheduler | Disk, network, service restart |
specialist-disk | Filesystem, du / find / stat, inodes | Network, processes, business logs |
specialist-network | OVS, netfilter, netns, conntrack, bpf, routes | Filesystem, processes |
specialist-ops | Service start/stop/restart, journalctl, packages | Cluster trends, network internals |
specialist-sre | Golden four signals, SLOs, error budgets, triage | Single-host bash, deep RCA |
Why split fine-grained
Tighter tool bags = tighter system prompt = deeper reasoning per token. A specialist-network worker with 8 tools gives stronger answers about iptables / nft than a coordinator with 60 tools that happens to have iptables among them. The persona walls also let each domain carry domain-specific KB hints in its system prompt.
The KB-first convention
All five specialists begin with the same step 0: call query_knowledge once with a natural-language description of the problem. If the top result scores ≥ 0.6, follow the playbook and end the answer with (参考 KB: <title>). This is enforced by the persona prompt, not the runtime — but the prompts are direct enough that strong models follow consistently.
Why a uniform step 0:
- The vault carries team-specific playbooks (preferred commands, known traps, escalation paths). They beat the model's general knowledge for your fleet.
- It anchors the worker's first turn around a known-good plan rather than letting it improvise.
- Citation
(参考 KB: …)makes the provenance auditable.
See Knowledge base for what the vault contains by default and how to add your own playbooks.
specialist-compute — CPU / memory / load / processes
Frontmatter highlights:
name: specialist-compute
permission_mode: read-only
max_turns: 15
tools:
- query_knowledge
- get_host_load
- get_host_processes
- get_edge_summary
- rank_edges
- find_outlier_edges
- query_promql
- host_bashWhen the coordinator picks it
- "CPU pegged on node-X, who's eating it?"
- "Load spiking but CPU idle — what's blocked?"
- "Process memory growth / OOM forensics."
- "VM steal time — am I seeing noisy neighbor?"
The recipes baked into the persona
load avg high + CPU% low→ look for D-state processes (host_bash "ps -eo stat,pid,cmd | awk '$1 ~ /D/'"). Likely IO / network wait — tell the coordinator to dispatchspecialist-diskorspecialist-network.CPU% high→ top processes, user vs system time, vmstatstcolumn for steal.mem_used_pct high→node_memory_*for cached / buffers / swap,dmesg | grep -i 'oom-killer'for OOM hits, single-process RSS outliers with PID + name.
Rejects (will tell the coordinator to redispatch)
- "Should we restart nginx?" →
specialist-ops(because restart goes throughhost_restart_serviceand the reviewer). - "Disk filling up" →
specialist-disk. - "Network packets dropping" →
specialist-network.
specialist-disk — filesystem / capacity
Frontmatter highlights:
name: specialist-disk
permission_mode: read-only
max_turns: 15
tools:
- query_knowledge
- host_find_large_files
- host_du_summary
- host_stat_file
- host_bash
- query_promql
- get_host_loadThe 4-step recipe
- Macro confirm:
get_host_loadfordisk_used_pct+query_promqlnode_filesystem_*trend. - Layer drill-down:
host_du_summary(paths=["/", "/var", "/opt", "/home", "/tmp"], depth=1). - File pinpoint:
host_find_large_files(paths=[biggest top-level], top_n=20). - Inode check if needed:
host_bash "df -i".
Anti-patterns the persona refuses
- Per-path single-path calls (use an array of 4-8 paths per call — it's much faster).
- Running on
/proc /sys /dev— the sandbox rejects, the persona knows not to ask. - Any delete / mv / rm — read-only.
specialist-network — packets / netns / iptables / OVS
Frontmatter highlights:
name: specialist-network
permission_mode: read-only
max_turns: 15
tools:
- query_knowledge
- host_bash
- host_probe_http
- host_probe_dns
- host_probe_tcp
- host_netns_inspect
- query_promql
- get_host_loadThe recipes
- Topology first.
ip -j addr show+host_netns_inspectto understand interface and namespace layout. - Link state.
ethtool -i ethXdriver + speed;ss -tnpfor connections. - NAT / firewall.
nft list ruleset,iptables -L -n,conntrack -S. - OVS.
ovs-vsctl show,ovs-ofctl dump-flows br0. - eBPF.
bpftool prog show,bpftool net show. - Probes for connectivity.
host_probe_tcp,host_probe_http,host_probe_dns.
Each host_bash call carries a cmdpolicy allowlist — see Layer-1 network research for the OVS / nft / conntrack / bpftool / ethtool / ip netns cmdpolicy entries.
Output discipline
Three lines: 现象 (symptom: packet drop / high RTT / NAT table full / empty flow table / wrong route), 根因 (judgment + key evidence), 下一步 (recommended next action: restart service, update route, add rule). No raw ovs-ofctl dumps — the coordinator doesn't read them.
specialist-ops — services and operations
This is the mutating specialist. Frontmatter highlights:
name: specialist-ops
permission_mode: read-only
max_turns: 15
tools:
- query_knowledge
- host_bash
- get_host_processes
- get_host_load
- host_restart_service # mutating; gated by ReviewGate
- query_promql
- query_logql
- get_edge_summaryWhat's special
- It carries
host_restart_service, which isClass: "write". Any invocation triggers theReviewGatedecorator: the gate spawns thereviewerworker; only an approved decision actually restarts anything. - It cannot route around the gate via
host_bash systemctl restart— the edge cmdpolicy denies mutatingsystemctlsubcommands regardless of which agent issued the command.
When the coordinator picks it
- Service-specific status / logs / restart (
nginx,mysql, our own process). - systemd unit
status,journalctl -uerrors. - Recent restart count, OOM correlations.
- cron / timer schedule check.
- Package broken state (dpkg / apt / yum).
- "Should we restart X to clear the leak?" — ops is the persona that proposes; reviewer is the gate.
Rejects
- Cluster trends / SLO judgment →
specialist-sre. - Network internals →
specialist-network. - Deep file-level disk analysis →
specialist-disk.
specialist-sre — golden four signals / triage
The persona for "is the system healthy" / "which incident matters most". Frontmatter highlights:
name: specialist-sre
permission_mode: read-only
max_turns: 15
tools:
- query_knowledge
- correlate_incident
- get_active_incidents
- get_incident_detail
- get_edge_summary
- query_promql
- query_logql
- find_outlier_edges
- rank_edges
- get_host_loadWorking style
- Incident-list first.
get_active_incidents— what's currently firing and at what severity. - Trends, not single-host metrics.
query_promqlover 1h / 24h windows compared to baseline. - Find the outlier host.
find_outlier_edges/rank_edgesrather than hand-rolled IQR in PromQL. - Speak golden-four-signals. latency (p50/p95/p99) / error rate / traffic / saturation — each one as "baseline → current → deviation direction".
- Priority decisions. P0 (user impact, getting worse), P1 (user impact, stable), P2 (internal / trend), P3 (noise / false positive).
- Delegate down. Suspects disk → tell coordinator to dispatch
specialist-disk. Suspects network →specialist-network. Doesn't try to do the deep dive itself.
Why SRE doesn't do RCA
For an incident that needs full root-cause attribution, the coordinator picks incident-investigator, not specialist-sre. The SRE persona stops at "this is a real P1 saturation issue on payments; recommend dispatching incident-investigator with incident_id=1234". The investigator walks the causal chain; the SRE judges whether it's worth walking.
How the coordinator picks
The catalog is rendered into the coordinator's system prompt by buildAgentCatalog. For each persona it surfaces:
name(thesubagent_typevalue).- The persona's
description. - The first non-empty line of
when_to_use.
So an LLM reading the system prompt sees something like:
## 可用的 specialist 助理(AgentTool 的 subagent_type)
- `specialist-compute` — 计算专家——CPU / 内存 / load / 进程调度…
- `specialist-disk` — 文件系统 / 磁盘容量专家——du / find / stat / inode…
- `specialist-network` — 网络问题专家——OVS / netfilter / netns…
- `specialist-ops` — 运维 / 服务运营专家——服务状态 / 启停重启…
- `specialist-sre` — SRE / 可观测性专家——告警响应 / 黄金四信号 / SLO…
- `incident-investigator` — 告警根因诊断 worker…reviewer and default are deliberately excluded — the first is reserved for the ReviewGate; the second is the virtual top-level coordinator persona and listing it would let the coordinator recursively spawn itself.
Cross-handoff patterns
When the answer is "this isn't my domain":
| Specialist | Common handoff |
|---|---|
| compute → disk | D-state processes on IO wait |
| compute → net | D-state processes blocked on network (sockets, recv) |
| compute → ops | "Process is leaking; recommend restarting service" |
| disk → ops | "Logs are blowing up; recommend tighter logrotate" |
| net → ops | "iptables rule fixes the leak; recommend service restart for safety" |
| sre → investigator | "P1 saturation, real, needs RCA" |
| sre → compute | "Saturation alarm — drill CPU/mem" |
The handoff is verbal (the specialist says "建议派 specialist-X because Y" in its reply), not a tool call. The coordinator reads the recommendation and dispatches the next worker. Specialists cannot spawn workers — see Agents overview.
Tuning specialist personas
Common edits in a fork:
- Tool bag: add an in-house BaseTool (e.g.
query_clickhouse) to the relevant persona'stools:list. The worker filter trusts the persona whitelist. max_turns: the default 15 is generous; tighten to 10 for cost-sensitive specialists.- Persona body: add your team's preferred command preferences ("on this fleet, journalctl --since '5m ago' before --since '1h ago'").
when_to_use: tighten or widen to suit how often you want the coordinator to pick this persona.
Don't change name — the catalog and AgentTool look workers up by exact name. See Custom agents.