Skip to content

Agents overview

Ongrid is a multi-agent ReAct system. The user always talks to the coordinator — the top-level persona on the chat surface. When a task fits a specialized domain (deep root-cause investigation, a mutating-action review, a network deep dive), the coordinator dispatches to a sub-agent (a "worker") via the AgentTool. Each worker runs the same graph kernel against a filtered tool bag and returns a final answer that the coordinator weaves into its reply.

This page is the map. The per-persona pages explain each agent in detail.

Personas, not chat threads

A persona is an on-disk file describing what an agent does. The canonical layout — same shape as Claude Code's agent format with snake_case keys:

yaml
---
name: specialist-disk
description: 文件系统 / 磁盘容量专家 — du / find / stat / inode / 挂载 / 大文件
when_to_use: |
  When the task is about disk / filesystem health:
    - Disk full / utilization climbing
    - Hunt for large files / large directories
    - inode exhaustion / mount point inspection
tools:
  - query_knowledge
  - host_find_large_files
  - host_du_summary
  - host_stat_file
  - host_bash
  - query_promql
  - get_host_load
permission_mode: read-only
max_turns: 15
---
[markdown body — this becomes the system prompt]

The body below the frontmatter is the SystemPrompt. Two persona files with the same name collide and the loader logs a warning. See persona format reference for every field.

Personas live under ./agents/ in the manager image (/app/agents/ inside the container). Ongrid ships built-in personas in the image and also reads user-authored personas from a mounted directory — see Custom agents.

Coordinator vs worker

AspectCoordinatorWorker
Who talks to userYes (via chat surface or IM)Never — worker output goes to coordinator
Tool bagWide: query_*, AgentTool, redirect stubsNarrow: persona tools: whitelist
Persona namedefault (or per-org override)specialist-*, incident-investigator, reviewer
Spawned byRuntime (one per chat session)AgentTool or review_gate decorator
Can spawnYes (workers)No (workers cannot nest by design)
SessionLong-lived, persistedNew session per spawn, scoped to one turn

The coordinator's job is dispatch + triage + synthesis. Deep-dive tools live on workers. The coordinator carries RedirectStub slots for tools the LLM is known to hallucinate (host_bash, get_host_load, …); calling one of them returns a redirect message that nudges the model to re-invoke via AgentTool.

AgentTool

The coordinator's dispatch primitive. Wire name AgentTool (PascalCase to align with Claude Code's tool catalog). The LLM-visible schema:

json
{
  "type": "object",
  "properties": {
    "description":   {"type": "string"},
    "subagent_type": {"type": "string"},
    "prompt":        {"type": "string"}
  },
  "required": ["description", "subagent_type", "prompt"]
}
  • description — 1-line task summary the SPA tile uses.
  • subagent_type — persona name. The catalog injected into the coordinator's system prompt lists valid values (the persona registry, minus reviewer and default).
  • prompt — full task brief. The worker cannot see the coordinator's context — pack every needed detail (incident_id, device_id, exact wording).

The call is synchronous from the coordinator's POV. It blocks until the worker reaches completed / failed / killed. An earlier revision had an async background: true flag; weak models took it, answered the user with the pending task_id, and never followed up. The flag is gone today (see agent_tool.go for the post-mortem comment). The only async path is the reviewer — and that's driven by the decorator, not the LLM.

Dedupe is built in

A 120s LRU keyed by sha256(subagent_type + canonical(prompt)) short-circuits the second of two identical AgentTool calls. The LLM sees the prior result with an explicit "you already dispatched this" hint. Cuts coordinator loop blowouts (E2E eval D1 saw 122 tool calls in 240s without dedupe).

How a system prompt is composed

ComposeSystemPrompt assembles the prompt the LLM receives, in order:

  1. basePrompt — the runtime's universal preamble. May be empty.
  2. agentProfile.SystemPrompt — the persona's markdown body. Coordinator may carry one (most do); workers always do.
  3. agentProfile.CriticalReminder — wrapped in <critical-reminder>...</critical-reminder>. This is the persona-level constant; the graph layer also re-injects it per-turn as a <system-reminder> so it survives long-session attention drift.
  4. For each active skill: a [能力: <name>] header + the skill's PromptBody.

Coordinator gets an extra block before the skill list: the agent catalog (buildAgentCatalog) — a markdown bullet list of "available specialists" with the description and the first line of when_to_use. The catalog deliberately excludes reviewer (only the ReviewGate decorator may spawn it) and default (the virtual top-level persona — listing it as a spawnable sub-agent would let the coordinator recursively spawn itself).

The agent registry

The AgentRegistry holds parsed personas in memory. Load once at startup; Reload() under a single sync.RWMutex swap so an in-flight coordinator turn never observes a half-loaded registry.

MethodWhen
Load(root)Startup. Walks <root>/**/*.md recursively.
Reload(root, ...)Marketplace install/uninstall. Atomic swap.
ByName(name)Lookup at spawn time. Returns (nil, false) on miss.
Replace(ag)User-agent edit. Upsert in place — live registry, no restart.
Remove(name)User-agent delete.

Per-file parse errors land as warnings rather than aborting the walk (same policy as the skill registry). Non-existent agent roots are not an error — you get an empty registry, not a startup failure.

Reviewer and the review gate

The reviewer is special: the coordinator cannot spawn it. It's gated by the ReviewGate decorator, which intercepts any tool whose Class is "write" or "destructive". The decorator:

  1. Builds a proposal payload (action, target, reason, blast_radius, operator).
  2. Spawns the reviewer worker (sync from the inner tool's POV) with its own 60s timeout, independent of the inner tool's 15s.
  3. If the reviewer returns Decision: approve, the call falls through to the inner tool. Otherwise the decorator returns ErrReviewRejected wrapping the reviewer's reason.

This is why the agent catalog drops reviewer: only the decorator gets to invoke it. Putting it in the catalog would let the coordinator dispatch ad-hoc "reviews" that don't gate anything.

See Reviewer for the full state machine.

Worker lifecycle

Runtime.SpawnWorker state machine:

text
pending  → running  → completed
                   ↘ failed
                   ↘ killed     (StopWorker called while running)
  • A chat_sessions row is created up-front so audit + parent → worker tree queries resolve even while the worker is still running.
  • Background spawns derive from the long-lived runtime context so a finishing HTTP request can't tear down the worker mid-run; sync spawns inherit the caller's ctx.
  • The session row is closed (closed_at set) on every terminal path, including panics — without this, orphan rows accumulate (the test env hit 161 before the fix landed).

Workers cannot spawn workers. SpawnWorker is exposed only on the Runtime, and the disabledForWorker filter strips AgentTool from any worker's tool bag. One coordinator, N parallel workers, no deeper nesting.

What's next

  • Coordinator — the default persona, its three control tools, and when to dispatch vs answer directly.
  • Incident investigator — root cause to the "patient zero", the 18-tool budget, and the F1 eval.
  • Specialists — compute / disk / network / ops / sre, what each owns, how the coordinator picks.
  • Reviewer — the SOP double-sign gate on mutating tools.
  • Custom agents — author your own persona, hot-reload, debug.