Authoring custom agents
Custom personas extend Ongrid with your own specialists. They live on disk as <name>.md files with YAML frontmatter, exactly like the built-ins — same loader, same registry, same dispatch path. Write one, mount it, and the coordinator can dispatch to it.
This page is the contract.
File layout
A persona is a single Markdown file with YAML frontmatter:
---
name: specialist-clickhouse
description: ClickHouse 查询性能 / 分区健康 / mutation backlog 专家
when_to_use: |
When the user asks about:
- ClickHouse query plan / scan / shuffle slow
- Partition merges / mutation backlog
- Replication lag between replicas
- System.parts / system.mutations inspection
tools:
- query_knowledge
- query_clickhouse_system # custom BaseTool you registered
- query_promql # for clickhouse_* metrics
- host_bash
- get_edge_summary
disallowed_tools:
- host_restart_service
permission_mode: read-only
max_turns: 12
model: anthropic/claude-sonnet-4-7
critical_reminder: |
You're read-only. Never propose direct ALTER / OPTIMIZE without
citing the system.mutations evidence first. Always check the
replication lag before recommending any maintenance command.
---
# specialist-clickhouse
You are Ongrid's ClickHouse specialist.
## Step 0: knowledge base check (mandatory)
Before any inspection, call `query_knowledge` once with a natural-
language description of the question. Hit (score >= 0.6) → follow
the playbook. Cite as `(参考 KB: <title>)` in your final reply.
## Working style
1. Start with `query_clickhouse_system` for system.parts /
system.mutations / system.replication_queue. One call, broad
snapshot.
2. If a specific table is suspect, drill into `system.parts` for
that table with bytes / rows / merge_state.
3. For replication: `system.replication_queue` for failures,
`clickhouse_replica_delay_seconds` PromQL series for trend.
4. For query perf: `system.query_log` with `query_duration_ms`
sort + `read_rows` to find the heavy query.
## Output
- 现状 (1-2 sentences): which table, which metric, what's wrong.
- 证据 (2-3 lines): system.* row excerpts + PromQL value.
- 建议 (1 line): observation only, or "recommend dispatching
specialist-ops to run OPTIMIZE/ALTER under reviewer".Frontmatter reference
The fields the parser understands (ParseAgentMd):
| Field | Required | Type | Purpose |
|---|---|---|---|
name | yes | string | Spawn key. Must be unique. snake_case or kebab-case. |
description | yes | string | Surfaced in the coordinator's agent catalog. |
when_to_use | yes | string | First line surfaces in catalog. Strict-required because the coordinator can't pick a persona without it. |
tools | no | []string | Whitelist of BaseTool names. Empty = inherit nothing. |
disallowed_tools | no | []string | Blacklist. Wins over whitelist; supports wildcards (*_skill). |
permission_mode | no | string | read-only / mutating-with-confirm / dual-sign-required. Today informational; future versions may auto-wire decorators based on this. |
max_turns | no | int | Hard ReAct loop cap. Default 15. |
model | no | string | LLM identifier (e.g. anthropic/claude-sonnet-4-7). Falls back to org default. |
critical_reminder | no | string | Wrapped in <critical-reminder>...</critical-reminder> in the system prompt. Also re-injected per-turn by the graph layer. |
initial_prompt | no | string | Prepended to the worker's first user turn. Rarely used. |
background | no | bool | true = async spawn (UI doesn't block). Used by reviewer. |
omit_claude_md | no | bool | Suppress the runtime's base prompt for this persona. |
metadata | no | map | Free-form. metadata.ongrid.{scope, min_ongrid_version} is read by the registry; everything else is pass-through. |
Unknown fields are preserved into Agent.UnknownFields so future Claude Code persona format additions (effort, isolation, mcp_servers, hooks, …) don't break loading.
tools vs disallowed_tools
Whitelist + blacklist, black wins. So:
tools: ["query_*", "host_bash"] # everything starting with query_, plus bash
disallowed_tools: ["query_devices"] # but not this oneleaves query_promql, query_logql, query_traceql, query_knowledge, … and host_bash, minus query_devices.
Wildcards: *_skill matches every tool name ending in _skill. This is how the reviewer blocks all skill executions in one line.
The AgentTool is also stripped from any worker's bag automatically — workers cannot spawn workers. You don't need to list it under disallowed_tools.
Where personas live
The runtime walks two roots:
- The image-baked root —
/app/agents/inside the manager container. Contains the six shipped personas. Read-only inside the image; survives container restart but not custom code. - The marketplace root —
/var/lib/ongrid/agents/(mounted volume). User-authored personas land here via the Settings → Agents UI or via the marketplace install path.
Both are merged into the same AgentRegistry. On a name collision the loader records a warning and keeps the first load. To override a built-in persona, save your version with the same name via the Settings UI — AgentRegistry.Replace upserts in place.
Where to start
The fastest path is to copy agents/specialist-disk.md into your editor, rename, and adjust the tool bag. The shape carries all the conventions (KB-first, 4-step recipe, output format) that work well with the coordinator.
Hot reload vs restart
| Action | Hot-reloadable? | How |
|---|---|---|
| Edit persona body (system prompt) | Yes | Settings → Agents → Save |
| Change tool whitelist | Yes | Same. Filter is applied per spawn. |
Change model / max_turns | Yes | Same. New spawns pick up the new values. |
| Add a new persona | Yes | Settings → Agents → New, or drop file + Reload |
| Delete a persona | Yes | Settings → Agents → Delete, or remove file + Reload |
Override a built-in (same name) | Yes | Replace upserts; coordinator uses new. |
| Change which tools exist in the bag | No | BaseTool registration is binary-side. |
| Add a new BaseTool | No | Requires code change + manager restart. |
Change default_locale semantics | No | That's runtime code. |
The lock around AgentRegistry is a sync.RWMutex. An in-flight coordinator turn that already fetched a persona pointer keeps using the snapshot; the next coordinator turn sees the new persona.
Debugging
"Coordinator never dispatches to my persona"
- Check the agent catalog in the coordinator's system prompt (the manager logs the rendered prompt at startup with
--log-level=debug). Your persona should appear with itsdescriptionand the first line ofwhen_to_use. - If the catalog is missing it: the loader recorded a warning. Check
AgentRegistry.Warnings()via the API (GET /api/v1/agents/warnings) or look forchatruntime: parse <path>lines in the manager log. - If the catalog has it but the LLM doesn't pick it: tighten
when_to_use. Lead with a concrete trigger pattern; the LLM is prompted to read the first line as the matching hint.
"Worker spawns but immediately fails"
Common causes:
- A whitelisted tool isn't in the bag. The runtime filters and silently drops anything not present; the worker can't call what isn't there. Check
GET /api/v1/skillsfor the active bag. - The model identifier is wrong. The chat model resolver maps
anthropic/<x>todefault_providerif not configured. Setdefault_providertoanthropicin Settings → LLM or pin a concrete provider+model in the persona. max_turnsis too low. A worker that runs out of turns before producing a final assistant message returns asfailed. Bump to 15+ for any non-trivial persona.
"Worker returns OK but the output is garbage"
The persona body is your system prompt. Tighten:
- Start with Step 0: a single forced KB call. Anchors the worker.
- Specify the output format verbatim in the body. The coordinator parses on this format.
- Use
critical_reminderfor hard constraints (read-only, no PII, output language). It's wrapped in<critical-reminder>AND re-injected per-turn — the LLM sees it on every iteration.
Testing your persona
Two integration points:
From the chat surface
Open /chat, ask a question that matches your persona's when_to_use. Watch the SPA — if the coordinator dispatches, an "Agent tile" appears with your persona's name + the AgentTool description. Click for the worker's transcript.
From the API
curl -X POST http://localhost:8080/api/v1/chat \
-H 'Authorization: Bearer <token>' \
-H 'Content-Type: application/json' \
-d '{"prompt": "<the question that should trigger your persona>"}'The streaming response surfaces:
textdeltas — the coordinator's prose.agent_tileenvelopes — every AgentTool dispatch.task_notificationenvelopes — worker completion.
If your persona is dispatched, the matching agent_tile.persona will be your name.
When to NOT write a custom persona
- The task is a 1-tool answer. Don't wrap "query my custom Prometheus" in a persona. Register a custom BaseTool and let the coordinator call it.
- The task is one-off. Personas are for repeated patterns. For a one-shot investigation, just ask the coordinator directly.
- The task needs to call across all 5 specialists. That's exactly what the coordinator is for; don't write a meta-specialist to recreate the coordinator's behavior.
A good rule: write a persona when the same shape of question recurs, the answer requires 5+ tool calls, and the tool bag is narrower than what the coordinator carries.
Sharing personas
- Drop the
.mdfile in youropsrepo. Mount it into the manager container under/var/lib/ongrid/agents/. The registry picks it up at startup (or on a Reload call). - For org-wide rollout, ship through the skill marketplace — the marketplace install bundles personas + skills together and triggers a
Reloadautomatically.
Related
- Agents overview — the full picture.
- Persona format — every field, no prose.
- Skill manifest — companion format for shipping personas alongside custom skills.