Authoring custom agents

Custom personas extend Ongrid with your own specialists. They live on disk as <name>.md files with YAML frontmatter, exactly like the built-ins — same loader, same registry, same dispatch path. Write one, mount it, and the coordinator can dispatch to it.

This page is the contract.

File layout

A persona is a single Markdown file with YAML frontmatter:

markdown

---
name: specialist-clickhouse
description: ClickHouse 查询性能 / 分区健康 / mutation backlog 专家
when_to_use: |
  When the user asks about:
    - ClickHouse query plan / scan / shuffle slow
    - Partition merges / mutation backlog
    - Replication lag between replicas
    - System.parts / system.mutations inspection

tools:
  - query_knowledge
  - query_clickhouse_system   # custom BaseTool you registered
  - query_promql              # for clickhouse_* metrics
  - host_bash
  - get_edge_summary

disallowed_tools:
  - host_restart_service

permission_mode: read-only
max_turns: 12
model: anthropic/claude-sonnet-4-7

critical_reminder: |
  You're read-only. Never propose direct ALTER / OPTIMIZE without
  citing the system.mutations evidence first. Always check the
  replication lag before recommending any maintenance command.
---

# specialist-clickhouse

You are Ongrid's ClickHouse specialist.

## Step 0: knowledge base check (mandatory)

Before any inspection, call `query_knowledge` once with a natural-
language description of the question. Hit (score >= 0.6) → follow
the playbook. Cite as `(参考 KB: <title>)` in your final reply.

## Working style

1. Start with `query_clickhouse_system` for system.parts /
   system.mutations / system.replication_queue. One call, broad
   snapshot.
2. If a specific table is suspect, drill into `system.parts` for
   that table with bytes / rows / merge_state.
3. For replication: `system.replication_queue` for failures,
   `clickhouse_replica_delay_seconds` PromQL series for trend.
4. For query perf: `system.query_log` with `query_duration_ms`
   sort + `read_rows` to find the heavy query.

## Output

- 现状 (1-2 sentences): which table, which metric, what's wrong.
- 证据 (2-3 lines): system.* row excerpts + PromQL value.
- 建议 (1 line): observation only, or "recommend dispatching
  specialist-ops to run OPTIMIZE/ALTER under reviewer".

Frontmatter reference

The fields the parser understands (ParseAgentMd):

Field	Required	Type	Purpose
`name`	yes	string	Spawn key. Must be unique. snake_case or kebab-case.
`description`	yes	string	Surfaced in the coordinator's agent catalog.
`when_to_use`	yes	string	First line surfaces in catalog. Strict-required because the coordinator can't pick a persona without it.
`tools`	no	[]string	Whitelist of BaseTool names. Empty = inherit nothing.
`disallowed_tools`	no	[]string	Blacklist. Wins over whitelist; supports wildcards (`*_skill`).
`permission_mode`	no	string	`read-only` / `mutating-with-confirm` / `dual-sign-required`. Today informational; future versions may auto-wire decorators based on this.
`max_turns`	no	int	Hard ReAct loop cap. Default 15.
`model`	no	string	LLM identifier (e.g. `anthropic/claude-sonnet-4-7`). Falls back to org default.
`critical_reminder`	no	string	Wrapped in `<critical-reminder>...</critical-reminder>` in the system prompt. Also re-injected per-turn by the graph layer.
`initial_prompt`	no	string	Prepended to the worker's first user turn. Rarely used.
`background`	no	bool	`true` = async spawn (UI doesn't block). Used by `reviewer`.
`omit_claude_md`	no	bool	Suppress the runtime's base prompt for this persona.
`metadata`	no	map	Free-form. `metadata.ongrid.{scope, min_ongrid_version}` is read by the registry; everything else is pass-through.

Unknown fields are preserved into Agent.UnknownFields so future Claude Code persona format additions (effort, isolation, mcp_servers, hooks, …) don't break loading.

`tools` vs `disallowed_tools`

Whitelist + blacklist, black wins. So:

yaml

tools: ["query_*", "host_bash"]    # everything starting with query_, plus bash
disallowed_tools: ["query_devices"] # but not this one

leaves query_promql, query_logql, query_traceql, query_knowledge, … and host_bash, minus query_devices.

Wildcards: *_skill matches every tool name ending in _skill. This is how the reviewer blocks all skill executions in one line.

The AgentTool is also stripped from any worker's bag automatically — workers cannot spawn workers. You don't need to list it under disallowed_tools.

Where personas live

The runtime walks two roots:

The image-baked root — /app/agents/ inside the manager container. Contains the six shipped personas. Read-only inside the image; survives container restart but not custom code.
The marketplace root — /var/lib/ongrid/agents/ (mounted volume). User-authored personas land here via the Settings → Agents UI or via the marketplace install path.

Both are merged into the same AgentRegistry. On a name collision the loader records a warning and keeps the first load. To override a built-in persona, save your version with the same name via the Settings UI — AgentRegistry.Replace upserts in place.

Where to start

The fastest path is to copy agents/specialist-disk.md into your editor, rename, and adjust the tool bag. The shape carries all the conventions (KB-first, 4-step recipe, output format) that work well with the coordinator.

Hot reload vs restart

Action	Hot-reloadable?	How
Edit persona body (system prompt)	Yes	Settings → Agents → Save
Change tool whitelist	Yes	Same. Filter is applied per spawn.
Change `model` / `max_turns`	Yes	Same. New spawns pick up the new values.
Add a new persona	Yes	Settings → Agents → New, or drop file + Reload
Delete a persona	Yes	Settings → Agents → Delete, or remove file + Reload
Override a built-in (same `name`)	Yes	`Replace` upserts; coordinator uses new.
Change which tools exist in the bag	No	BaseTool registration is binary-side.
Add a new BaseTool	No	Requires code change + manager restart.
Change `default_locale` semantics	No	That's runtime code.

The lock around AgentRegistry is a sync.RWMutex. An in-flight coordinator turn that already fetched a persona pointer keeps using the snapshot; the next coordinator turn sees the new persona.

Debugging

"Coordinator never dispatches to my persona"

Check the agent catalog in the coordinator's system prompt (the manager logs the rendered prompt at startup with --log-level=debug). Your persona should appear with its description and the first line of when_to_use.
If the catalog is missing it: the loader recorded a warning. Check AgentRegistry.Warnings() via the API (GET /api/v1/agents/warnings) or look for chatruntime: parse <path> lines in the manager log.
If the catalog has it but the LLM doesn't pick it: tighten when_to_use. Lead with a concrete trigger pattern; the LLM is prompted to read the first line as the matching hint.

"Worker spawns but immediately fails"

Common causes:

A whitelisted tool isn't in the bag. The runtime filters and silently drops anything not present; the worker can't call what isn't there. Check GET /api/v1/skills for the active bag.
The model identifier is wrong. The chat model resolver maps anthropic/<x> to default_provider if not configured. Set default_provider to anthropic in Settings → LLM or pin a concrete provider+model in the persona.
max_turns is too low. A worker that runs out of turns before producing a final assistant message returns as failed. Bump to 15+ for any non-trivial persona.

"Worker returns OK but the output is garbage"

The persona body is your system prompt. Tighten:

Start with Step 0: a single forced KB call. Anchors the worker.
Specify the output format verbatim in the body. The coordinator parses on this format.
Use critical_reminder for hard constraints (read-only, no PII, output language). It's wrapped in <critical-reminder> AND re-injected per-turn — the LLM sees it on every iteration.

Testing your persona

Two integration points:

From the chat surface

Open /chat, ask a question that matches your persona's when_to_use. Watch the SPA — if the coordinator dispatches, an "Agent tile" appears with your persona's name + the AgentTool description. Click for the worker's transcript.

From the API

bash

curl -X POST http://localhost:8080/api/v1/chat \
  -H 'Authorization: Bearer <token>' \
  -H 'Content-Type: application/json' \
  -d '{"prompt": "<the question that should trigger your persona>"}'

The streaming response surfaces:

text deltas — the coordinator's prose.
agent_tile envelopes — every AgentTool dispatch.
task_notification envelopes — worker completion.

If your persona is dispatched, the matching agent_tile.persona will be your name.

When to NOT write a custom persona

The task is a 1-tool answer. Don't wrap "query my custom Prometheus" in a persona. Register a custom BaseTool and let the coordinator call it.
The task is one-off. Personas are for repeated patterns. For a one-shot investigation, just ask the coordinator directly.
The task needs to call across all 5 specialists. That's exactly what the coordinator is for; don't write a meta-specialist to recreate the coordinator's behavior.

A good rule: write a persona when the same shape of question recurs, the answer requires 5+ tool calls, and the tool bag is narrower than what the coordinator carries.

Drop the .md file in your ops repo. Mount it into the manager container under /var/lib/ongrid/agents/. The registry picks it up at startup (or on a Reload call).
For org-wide rollout, ship through the skill marketplace — the marketplace install bundles personas + skills together and triggers a Reload automatically.

Agents overview — the full picture.
Persona format — every field, no prose.
Skill manifest — companion format for shipping personas alongside custom skills.

Authoring custom agents ​

File layout ​

Frontmatter reference ​

tools vs disallowed_tools ​

Where personas live ​

Hot reload vs restart ​

Debugging ​

"Coordinator never dispatches to my persona" ​

"Worker spawns but immediately fails" ​

"Worker returns OK but the output is garbage" ​

Testing your persona ​

From the chat surface ​

From the API ​

When to NOT write a custom persona ​

Sharing personas ​

Related ​