Skip to content

Specialists

Ongrid ships five specialist personas. Each one owns a narrow slice of host / cluster diagnostics, has a focused tool bag, and explicitly refuses tasks that belong to a peer. The coordinator dispatches based on the request shape; the specialist's when_to_use block tells the coordinator's LLM when it's the right fit.

PersonaOwnsDoesn't own
specialist-computeCPU, memory, load, processes, OOM, schedulerDisk, network, service restart
specialist-diskFilesystem, du / find / stat, inodesNetwork, processes, business logs
specialist-networkOVS, netfilter, netns, conntrack, bpf, routesFilesystem, processes
specialist-opsService start/stop/restart, journalctl, packagesCluster trends, network internals
specialist-sreGolden four signals, SLOs, error budgets, triageSingle-host bash, deep RCA

Why split fine-grained

Tighter tool bags = tighter system prompt = deeper reasoning per token. A specialist-network worker with 8 tools gives stronger answers about iptables / nft than a coordinator with 60 tools that happens to have iptables among them. The persona walls also let each domain carry domain-specific KB hints in its system prompt.

The KB-first convention

All five specialists begin with the same step 0: call query_knowledge once with a natural-language description of the problem. If the top result scores ≥ 0.6, follow the playbook and end the answer with (参考 KB: <title>). This is enforced by the persona prompt, not the runtime — but the prompts are direct enough that strong models follow consistently.

Why a uniform step 0:

  • The vault carries team-specific playbooks (preferred commands, known traps, escalation paths). They beat the model's general knowledge for your fleet.
  • It anchors the worker's first turn around a known-good plan rather than letting it improvise.
  • Citation (参考 KB: …) makes the provenance auditable.

See Knowledge base for what the vault contains by default and how to add your own playbooks.

specialist-compute — CPU / memory / load / processes

Frontmatter highlights:

yaml
name: specialist-compute
permission_mode: read-only
max_turns: 15
tools:
  - query_knowledge
  - get_host_load
  - get_host_processes
  - get_edge_summary
  - rank_edges
  - find_outlier_edges
  - query_promql
  - host_bash

When the coordinator picks it

  • "CPU pegged on node-X, who's eating it?"
  • "Load spiking but CPU idle — what's blocked?"
  • "Process memory growth / OOM forensics."
  • "VM steal time — am I seeing noisy neighbor?"

The recipes baked into the persona

  • load avg high + CPU% low → look for D-state processes (host_bash "ps -eo stat,pid,cmd | awk '$1 ~ /D/'"). Likely IO / network wait — tell the coordinator to dispatch specialist-disk or specialist-network.
  • CPU% high → top processes, user vs system time, vmstat st column for steal.
  • mem_used_pct highnode_memory_* for cached / buffers / swap, dmesg | grep -i 'oom-killer' for OOM hits, single-process RSS outliers with PID + name.

Rejects (will tell the coordinator to redispatch)

  • "Should we restart nginx?" → specialist-ops (because restart goes through host_restart_service and the reviewer).
  • "Disk filling up" → specialist-disk.
  • "Network packets dropping" → specialist-network.

specialist-disk — filesystem / capacity

Frontmatter highlights:

yaml
name: specialist-disk
permission_mode: read-only
max_turns: 15
tools:
  - query_knowledge
  - host_find_large_files
  - host_du_summary
  - host_stat_file
  - host_bash
  - query_promql
  - get_host_load

The 4-step recipe

  1. Macro confirm: get_host_load for disk_used_pct + query_promql node_filesystem_* trend.
  2. Layer drill-down: host_du_summary(paths=["/", "/var", "/opt", "/home", "/tmp"], depth=1).
  3. File pinpoint: host_find_large_files(paths=[biggest top-level], top_n=20).
  4. Inode check if needed: host_bash "df -i".

Anti-patterns the persona refuses

  • Per-path single-path calls (use an array of 4-8 paths per call — it's much faster).
  • Running on /proc /sys /dev — the sandbox rejects, the persona knows not to ask.
  • Any delete / mv / rm — read-only.

specialist-network — packets / netns / iptables / OVS

Frontmatter highlights:

yaml
name: specialist-network
permission_mode: read-only
max_turns: 15
tools:
  - query_knowledge
  - host_bash
  - host_probe_http
  - host_probe_dns
  - host_probe_tcp
  - host_netns_inspect
  - query_promql
  - get_host_load

The recipes

  • Topology first. ip -j addr show + host_netns_inspect to understand interface and namespace layout.
  • Link state. ethtool -i ethX driver + speed; ss -tnp for connections.
  • NAT / firewall. nft list ruleset, iptables -L -n, conntrack -S.
  • OVS. ovs-vsctl show, ovs-ofctl dump-flows br0.
  • eBPF. bpftool prog show, bpftool net show.
  • Probes for connectivity. host_probe_tcp, host_probe_http, host_probe_dns.

Each host_bash call carries a cmdpolicy allowlist — see Layer-1 network research for the OVS / nft / conntrack / bpftool / ethtool / ip netns cmdpolicy entries.

Output discipline

Three lines: 现象 (symptom: packet drop / high RTT / NAT table full / empty flow table / wrong route), 根因 (judgment + key evidence), 下一步 (recommended next action: restart service, update route, add rule). No raw ovs-ofctl dumps — the coordinator doesn't read them.

specialist-ops — services and operations

This is the mutating specialist. Frontmatter highlights:

yaml
name: specialist-ops
permission_mode: read-only
max_turns: 15
tools:
  - query_knowledge
  - host_bash
  - get_host_processes
  - get_host_load
  - host_restart_service   # mutating; gated by ReviewGate
  - query_promql
  - query_logql
  - get_edge_summary

What's special

  • It carries host_restart_service, which is Class: "write". Any invocation triggers the ReviewGate decorator: the gate spawns the reviewer worker; only an approved decision actually restarts anything.
  • It cannot route around the gate via host_bash systemctl restart — the edge cmdpolicy denies mutating systemctl subcommands regardless of which agent issued the command.

When the coordinator picks it

  • Service-specific status / logs / restart (nginx, mysql, our own process).
  • systemd unit status, journalctl -u errors.
  • Recent restart count, OOM correlations.
  • cron / timer schedule check.
  • Package broken state (dpkg / apt / yum).
  • "Should we restart X to clear the leak?" — ops is the persona that proposes; reviewer is the gate.

Rejects

  • Cluster trends / SLO judgment → specialist-sre.
  • Network internals → specialist-network.
  • Deep file-level disk analysis → specialist-disk.

specialist-sre — golden four signals / triage

The persona for "is the system healthy" / "which incident matters most". Frontmatter highlights:

yaml
name: specialist-sre
permission_mode: read-only
max_turns: 15
tools:
  - query_knowledge
  - correlate_incident
  - get_active_incidents
  - get_incident_detail
  - get_edge_summary
  - query_promql
  - query_logql
  - find_outlier_edges
  - rank_edges
  - get_host_load

Working style

  1. Incident-list first. get_active_incidents — what's currently firing and at what severity.
  2. Trends, not single-host metrics. query_promql over 1h / 24h windows compared to baseline.
  3. Find the outlier host. find_outlier_edges / rank_edges rather than hand-rolled IQR in PromQL.
  4. Speak golden-four-signals. latency (p50/p95/p99) / error rate / traffic / saturation — each one as "baseline → current → deviation direction".
  5. Priority decisions. P0 (user impact, getting worse), P1 (user impact, stable), P2 (internal / trend), P3 (noise / false positive).
  6. Delegate down. Suspects disk → tell coordinator to dispatch specialist-disk. Suspects network → specialist-network. Doesn't try to do the deep dive itself.

Why SRE doesn't do RCA

For an incident that needs full root-cause attribution, the coordinator picks incident-investigator, not specialist-sre. The SRE persona stops at "this is a real P1 saturation issue on payments; recommend dispatching incident-investigator with incident_id=1234". The investigator walks the causal chain; the SRE judges whether it's worth walking.

How the coordinator picks

The catalog is rendered into the coordinator's system prompt by buildAgentCatalog. For each persona it surfaces:

  • name (the subagent_type value).
  • The persona's description.
  • The first non-empty line of when_to_use.

So an LLM reading the system prompt sees something like:

text
## 可用的 specialist 助理(AgentTool 的 subagent_type)

- `specialist-compute` — 计算专家——CPU / 内存 / load / 进程调度…
- `specialist-disk` — 文件系统 / 磁盘容量专家——du / find / stat / inode…
- `specialist-network` — 网络问题专家——OVS / netfilter / netns…
- `specialist-ops` — 运维 / 服务运营专家——服务状态 / 启停重启…
- `specialist-sre` — SRE / 可观测性专家——告警响应 / 黄金四信号 / SLO…
- `incident-investigator` — 告警根因诊断 worker…

reviewer and default are deliberately excluded — the first is reserved for the ReviewGate; the second is the virtual top-level coordinator persona and listing it would let the coordinator recursively spawn itself.

Cross-handoff patterns

When the answer is "this isn't my domain":

SpecialistCommon handoff
compute → diskD-state processes on IO wait
compute → netD-state processes blocked on network (sockets, recv)
compute → ops"Process is leaking; recommend restarting service"
disk → ops"Logs are blowing up; recommend tighter logrotate"
net → ops"iptables rule fixes the leak; recommend service restart for safety"
sre → investigator"P1 saturation, real, needs RCA"
sre → compute"Saturation alarm — drill CPU/mem"

The handoff is verbal (the specialist says "建议派 specialist-X because Y" in its reply), not a tool call. The coordinator reads the recommendation and dispatches the next worker. Specialists cannot spawn workers — see Agents overview.

Tuning specialist personas

Common edits in a fork:

  • Tool bag: add an in-house BaseTool (e.g. query_clickhouse) to the relevant persona's tools: list. The worker filter trusts the persona whitelist.
  • max_turns: the default 15 is generous; tighten to 10 for cost-sensitive specialists.
  • Persona body: add your team's preferred command preferences ("on this fleet, journalctl --since '5m ago' before --since '1h ago'").
  • when_to_use: tighten or widen to suit how often you want the coordinator to pick this persona.

Don't change name — the catalog and AgentTool look workers up by exact name. See Custom agents.