Specialists

Ongrid ships five specialist personas. Each one owns a narrow slice of host / cluster diagnostics, has a focused tool bag, and explicitly refuses tasks that belong to a peer. The coordinator dispatches based on the request shape; the specialist's when_to_use block tells the coordinator's LLM when it's the right fit.

Persona	Owns	Doesn't own
`specialist-compute`	CPU, memory, load, processes, OOM, scheduler	Disk, network, service restart
`specialist-disk`	Filesystem, du / find / stat, inodes	Network, processes, business logs
`specialist-network`	OVS, netfilter, netns, conntrack, bpf, routes	Filesystem, processes
`specialist-ops`	Service start/stop/restart, journalctl, packages	Cluster trends, network internals
`specialist-sre`	Golden four signals, SLOs, error budgets, triage	Single-host bash, deep RCA

Why split fine-grained

Tighter tool bags = tighter system prompt = deeper reasoning per token. A specialist-network worker with 8 tools gives stronger answers about iptables / nft than a coordinator with 60 tools that happens to have iptables among them. The persona walls also let each domain carry domain-specific KB hints in its system prompt.

The KB-first convention

All five specialists begin with the same step 0: call query_knowledge once with a natural-language description of the problem. If the top result scores ≥ 0.6, follow the playbook and end the answer with (参考 KB: <title>). This is enforced by the persona prompt, not the runtime — but the prompts are direct enough that strong models follow consistently.

Why a uniform step 0:

The vault carries team-specific playbooks (preferred commands, known traps, escalation paths). They beat the model's general knowledge for your fleet.
It anchors the worker's first turn around a known-good plan rather than letting it improvise.
Citation (参考 KB: …) makes the provenance auditable.

See Knowledge base for what the vault contains by default and how to add your own playbooks.

`specialist-compute` — CPU / memory / load / processes

Frontmatter highlights:

yaml

name: specialist-compute
permission_mode: read-only
max_turns: 15
tools:
  - query_knowledge
  - get_host_load
  - get_host_processes
  - get_edge_summary
  - rank_edges
  - find_outlier_edges
  - query_promql
  - host_bash

When the coordinator picks it

"CPU pegged on node-X, who's eating it?"
"Load spiking but CPU idle — what's blocked?"
"Process memory growth / OOM forensics."
"VM steal time — am I seeing noisy neighbor?"

The recipes baked into the persona

load avg high + CPU% low → look for D-state processes (host_bash "ps -eo stat,pid,cmd | awk '$1 ~ /D/'"). Likely IO / network wait — tell the coordinator to dispatch specialist-disk or specialist-network.
CPU% high → top processes, user vs system time, vmstat st column for steal.
mem_used_pct high → node_memory_* for cached / buffers / swap, dmesg | grep -i 'oom-killer' for OOM hits, single-process RSS outliers with PID + name.

Rejects (will tell the coordinator to redispatch)

"Should we restart nginx?" → specialist-ops (because restart goes through host_restart_service and the reviewer).
"Disk filling up" → specialist-disk.
"Network packets dropping" → specialist-network.

`specialist-disk` — filesystem / capacity

Frontmatter highlights:

yaml

name: specialist-disk
permission_mode: read-only
max_turns: 15
tools:
  - query_knowledge
  - host_find_large_files
  - host_du_summary
  - host_stat_file
  - host_bash
  - query_promql
  - get_host_load

The 4-step recipe

Macro confirm: get_host_load for disk_used_pct + query_promql node_filesystem_* trend.
Layer drill-down: host_du_summary(paths=["/", "/var", "/opt", "/home", "/tmp"], depth=1).
File pinpoint: host_find_large_files(paths=[biggest top-level], top_n=20).
Inode check if needed: host_bash "df -i".

Anti-patterns the persona refuses

Per-path single-path calls (use an array of 4-8 paths per call — it's much faster).
Running on /proc /sys /dev — the sandbox rejects, the persona knows not to ask.
Any delete / mv / rm — read-only.

`specialist-network` — packets / netns / iptables / OVS

Frontmatter highlights:

yaml

name: specialist-network
permission_mode: read-only
max_turns: 15
tools:
  - query_knowledge
  - host_bash
  - host_probe_http
  - host_probe_dns
  - host_probe_tcp
  - host_netns_inspect
  - query_promql
  - get_host_load

The recipes

Topology first. ip -j addr show + host_netns_inspect to understand interface and namespace layout.
Link state. ethtool -i ethX driver + speed; ss -tnp for connections.
NAT / firewall. nft list ruleset, iptables -L -n, conntrack -S.
OVS. ovs-vsctl show, ovs-ofctl dump-flows br0.
eBPF. bpftool prog show, bpftool net show.
Probes for connectivity. host_probe_tcp, host_probe_http, host_probe_dns.

Each host_bash call carries a cmdpolicy allowlist — see Layer-1 network research for the OVS / nft / conntrack / bpftool / ethtool / ip netns cmdpolicy entries.

Output discipline

Three lines: 现象 (symptom: packet drop / high RTT / NAT table full / empty flow table / wrong route), 根因 (judgment + key evidence), 下一步 (recommended next action: restart service, update route, add rule). No raw ovs-ofctl dumps — the coordinator doesn't read them.

`specialist-ops` — services and operations

This is the mutating specialist. Frontmatter highlights:

yaml

name: specialist-ops
permission_mode: read-only
max_turns: 15
tools:
  - query_knowledge
  - host_bash
  - get_host_processes
  - get_host_load
  - host_restart_service   # mutating; gated by ReviewGate
  - query_promql
  - query_logql
  - get_edge_summary

What's special

It carries host_restart_service, which is Class: "write". Any invocation triggers the ReviewGate decorator: the gate spawns the reviewer worker; only an approved decision actually restarts anything.
It cannot route around the gate via host_bash systemctl restart — the edge cmdpolicy denies mutating systemctl subcommands regardless of which agent issued the command.

When the coordinator picks it

Service-specific status / logs / restart (nginx, mysql, our own process).
systemd unit status, journalctl -u errors.
Recent restart count, OOM correlations.
cron / timer schedule check.
Package broken state (dpkg / apt / yum).
"Should we restart X to clear the leak?" — ops is the persona that proposes; reviewer is the gate.

Rejects

Cluster trends / SLO judgment → specialist-sre.
Network internals → specialist-network.
Deep file-level disk analysis → specialist-disk.

`specialist-sre` — golden four signals / triage

The persona for "is the system healthy" / "which incident matters most". Frontmatter highlights:

yaml

name: specialist-sre
permission_mode: read-only
max_turns: 15
tools:
  - query_knowledge
  - correlate_incident
  - get_active_incidents
  - get_incident_detail
  - get_edge_summary
  - query_promql
  - query_logql
  - find_outlier_edges
  - rank_edges
  - get_host_load

Working style

Incident-list first. get_active_incidents — what's currently firing and at what severity.
Trends, not single-host metrics. query_promql over 1h / 24h windows compared to baseline.
Find the outlier host. find_outlier_edges / rank_edges rather than hand-rolled IQR in PromQL.
Speak golden-four-signals. latency (p50/p95/p99) / error rate / traffic / saturation — each one as "baseline → current → deviation direction".
Priority decisions. P0 (user impact, getting worse), P1 (user impact, stable), P2 (internal / trend), P3 (noise / false positive).
Delegate down. Suspects disk → tell coordinator to dispatch specialist-disk. Suspects network → specialist-network. Doesn't try to do the deep dive itself.

Why SRE doesn't do RCA

For an incident that needs full root-cause attribution, the coordinator picks incident-investigator, not specialist-sre. The SRE persona stops at "this is a real P1 saturation issue on payments; recommend dispatching incident-investigator with incident_id=1234". The investigator walks the causal chain; the SRE judges whether it's worth walking.

How the coordinator picks

The catalog is rendered into the coordinator's system prompt by buildAgentCatalog. For each persona it surfaces:

name (the subagent_type value).
The persona's description.
The first non-empty line of when_to_use.

So an LLM reading the system prompt sees something like:

text

## 可用的 specialist 助理（AgentTool 的 subagent_type）

- `specialist-compute` — 计算专家——CPU / 内存 / load / 进程调度…
- `specialist-disk` — 文件系统 / 磁盘容量专家——du / find / stat / inode…
- `specialist-network` — 网络问题专家——OVS / netfilter / netns…
- `specialist-ops` — 运维 / 服务运营专家——服务状态 / 启停重启…
- `specialist-sre` — SRE / 可观测性专家——告警响应 / 黄金四信号 / SLO…
- `incident-investigator` — 告警根因诊断 worker…

reviewer and default are deliberately excluded — the first is reserved for the ReviewGate; the second is the virtual top-level coordinator persona and listing it would let the coordinator recursively spawn itself.

Cross-handoff patterns

When the answer is "this isn't my domain":

Specialist	Common handoff
compute → disk	D-state processes on IO wait
compute → net	D-state processes blocked on network (sockets, recv)
compute → ops	"Process is leaking; recommend restarting service"
disk → ops	"Logs are blowing up; recommend tighter logrotate"
net → ops	"iptables rule fixes the leak; recommend service restart for safety"
sre → investigator	"P1 saturation, real, needs RCA"
sre → compute	"Saturation alarm — drill CPU/mem"

The handoff is verbal (the specialist says "建议派 specialist-X because Y" in its reply), not a tool call. The coordinator reads the recommendation and dispatches the next worker. Specialists cannot spawn workers — see Agents overview.

Tuning specialist personas

Common edits in a fork:

Tool bag: add an in-house BaseTool (e.g. query_clickhouse) to the relevant persona's tools: list. The worker filter trusts the persona whitelist.
max_turns: the default 15 is generous; tighten to 10 for cost-sensitive specialists.
Persona body: add your team's preferred command preferences ("on this fleet, journalctl --since '5m ago' before --since '1h ago'").
when_to_use: tighten or widen to suit how often you want the coordinator to pick this persona.

Don't change name — the catalog and AgentTool look workers up by exact name. See Custom agents.

Specialists ​

The KB-first convention ​

specialist-compute — CPU / memory / load / processes ​

When the coordinator picks it ​

The recipes baked into the persona ​

Rejects (will tell the coordinator to redispatch) ​

specialist-disk — filesystem / capacity ​

The 4-step recipe ​

Anti-patterns the persona refuses ​

specialist-network — packets / netns / iptables / OVS ​

The recipes ​

Output discipline ​

specialist-ops — services and operations ​

What's special ​

When the coordinator picks it ​

Rejects ​

specialist-sre — golden four signals / triage ​

Working style ​

Why SRE doesn't do RCA ​

How the coordinator picks ​

Cross-handoff patterns ​

Tuning specialist personas ​

Specialists

The KB-first convention

`specialist-compute` — CPU / memory / load / processes

When the coordinator picks it

The recipes baked into the persona

Rejects (will tell the coordinator to redispatch)

`specialist-disk` — filesystem / capacity

The 4-step recipe

Anti-patterns the persona refuses

`specialist-network` — packets / netns / iptables / OVS

The recipes

Output discipline

`specialist-ops` — services and operations

What's special

When the coordinator picks it

Rejects

`specialist-sre` — golden four signals / triage

Working style

Why SRE doesn't do RCA

How the coordinator picks

Cross-handoff patterns

Tuning specialist personas