Skip to content

Coordinator

The coordinator is the persona the user actually talks to. Every chat session (web UI, Slack, Telegram, Feishu) is owned by one coordinator instance. It is not a worker — it's the long-lived ReAct loop that drives the conversation, decides when to answer directly, and decides when to dispatch to a specialist via AgentTool.

Identity

FieldValue
Persona namedefault
Spawnable as workerNo — the agent catalog excludes default
Session modelOne long-lived chat_sessions row per (user, chat thread)
Tool bagWide: query_*, AgentTool, SendMessage, TaskStop, redirect stubs
Persona bodyShips in the manager image; per-org override via mounted file

If you want to change the coordinator's behavior fleet-wide, mount a custom default.md over the built-in one (see Custom agents). The AgentRegistry.Replace seam is what powers the user-agent edit UI; an override file uses the same code path at startup.

What it can do directly

The coordinator's tool bag includes all read-only query tools that work at the manager scope:

  • query_promql, query_logql, query_traceql — telemetry. Stubbed on the coordinator (redirect to incident-investigator) for known hallucination cases — see below.
  • query_knowledge — RAG over the vault + uploads.
  • query_incidents, get_incident_detail, query_alert_rules, query_devices — alert / inventory.
  • query_change_events — recent config / rule / device mutations.
  • expand_topology, find_topology_node — service graph.
  • get_active_incidents — current open alerts.
  • BC handler tools — anything an org/user admin would do from the UI.

It deliberately does not carry host-shell or per-host inspection tools. Those live on specialists.

The control-tool trio

Three special tools live only on the coordinator. They control multi-agent orchestration; they don't query anything.

AgentTool

Dispatch a worker. Synchronous. The full schema, dedupe behavior, and "don't delegate trivia" guard rail are documented on Agents overview. The shape:

json
{
  "description": "Find which process is OOM-ing on node-01",
  "subagent_type": "specialist-compute",
  "prompt": "On device_id=7 we saw mem_used_pct=98 at 14:02. Find the top RSS processes and check dmesg for oom-killer hits. The user originally asked: 'who's eating memory on node-01?'"
}

The system reminder injects the persona catalog so the LLM knows the valid subagent_type values. The reminder is per-turn — even after attention drift in long sessions, the LLM never forgets which specialists exist.

SendMessage

Sends an interim message to the user without ending the turn. Used when the coordinator wants to say "I'm dispatching to specialist-network — give me a minute" before the dispatch settles. Workers don't carry this tool — only the coordinator talks to the user.

The IM bridge uses this to update the placeholder message in chat (chat.update on Slack, editMessageText on Telegram, PUT /messages on Feishu). The web UI uses it to stream interim text into the transcript before the final answer.

TaskStop

Politely cancels a running worker. The coordinator emits TaskStop when, e.g., the user changes the subject mid-dispatch and the running worker's answer is no longer relevant.

Internally it calls Runtime.StopWorker, which fires the worker's context.CancelFunc. The worker observes ctx.Done() on its next tool call and sets Status = killed.

When to dispatch vs answer directly

The coordinator's persona prompt encodes the rule:

AgentTool 不是默认选项。能用本地工具 1-2 步答出来的,自己答。复杂深挖 (需要 5+ 步 / 跨主机 / 跨信号面)才派。

In English: dispatch only when the task

  • needs deep iteration (5+ tool calls of focused inspection),
  • needs host-side tools (host_bash, host_probe_*, host_du_summary), or
  • benefits from a domain-specific tool bag the coordinator doesn't carry.

For "what's the load on node-01?" the coordinator answers directly with one get_host_load (well — with a redirect stub-mediated AgentTool(specialist-compute); the principle is the same). For "why is the order service throwing 502s and what should I do about it?" the coordinator dispatches to incident-investigator.

Redirect stubs (anti-hallucination)

The coordinator's bag carries RedirectStub entries for tool names the LLM is known to hallucinate (host_bash, host_du_summary, host_restart_service, get_host_load, correlate_incident, …). When the LLM picks a stubbed name, the stub returns:

json
{
  "status": "redirect",
  "hint": "This tool is not available in coordinator scope. Re-invoke via AgentTool to dispatch to specialist-disk.",
  "reason": "目录占用分析",
  "suggested_call": "AgentTool(description=\"\", subagent_type=\"specialist-disk\", prompt=\"<self-contained task>\", background=true)",
  "why_stub_exists": "Coordinator's job is dispatch + triage. Deep-dive tools live on specialist workers; calling them inline is the wrong pattern."
}

Without these stubs, eino's graph runtime would abort with [NodeRunError] tool X not found in toolsNode indexes and waste the turn. With them, the LLM learns from the result and tries again with the correct AgentTool call on the next iteration. The stub list lives in CoordinatorRedirectStubs; it's pure data — append over time as new hallucinations show up.

Don't shadow the entire specialist bag

Each stub eats a slot in the LLM-presented schema list. Stub too many tools and the prompt budget bloats; the LLM may start treating stubs as valid options to consider. Only tools observed to hallucinate in evals get a stub.

What the user sees

Each turn the coordinator can:

  1. Stream tokens directly — the answer is short and self-contained.
  2. Call read-only query tools inlinequery_incidents, expand_topology, query_knowledge.
  3. Dispatch a worker via AgentTool — the SPA renders a "Agent tile" with the persona name + the 1-line description while the worker runs. The final worker result lands as a child of the tile.
  4. Send an interim message via SendMessage — usually before a slow dispatch settles ("Looking into it; spawning specialist-network").
  5. Stop a worker via TaskStop — rare, usually only when the user redirects mid-dispatch.

The user never sees the worker's raw transcript. The coordinator synthesizes the worker's Result into its own reply.

Session and audit

  • Coordinator session lives in chat_sessions with agent_id = "default" and parent_session_id = NULL.
  • Worker sessions get parent_session_id pointing at the coordinator's row. The audit UI traverses this tree to render parent → child runs.
  • Every tool call gets an audit_logs row with the LLM's view of the args and the truncated result (see self-observability).

Customizing the coordinator

You can override the coordinator's persona file. Things you'd typically want to change:

  • The dispatch guidance (your team has a different "when to delegate" policy).
  • The output language / tone (use default_locale on the channel; or pin via persona body).
  • The list of read-only inline tools you want the coordinator to use before dispatching.

Things you should not change:

  • The agent catalog block — it's machine-generated. Editing it has no effect; the next reload regenerates it.
  • The redirect-stub semantics — they're code, not persona.
  • The default name — the runtime looks the coordinator up by name.

See Custom agents for the hot-reload story.