Skip to content

Authoring custom agents

Custom personas extend Ongrid with your own specialists. They live on disk as <name>.md files with YAML frontmatter, exactly like the built-ins — same loader, same registry, same dispatch path. Write one, mount it, and the coordinator can dispatch to it.

This page is the contract.

File layout

A persona is a single Markdown file with YAML frontmatter:

markdown
---
name: specialist-clickhouse
description: ClickHouse 查询性能 / 分区健康 / mutation backlog 专家
when_to_use: |
  When the user asks about:
    - ClickHouse query plan / scan / shuffle slow
    - Partition merges / mutation backlog
    - Replication lag between replicas
    - System.parts / system.mutations inspection

tools:
  - query_knowledge
  - query_clickhouse_system   # custom BaseTool you registered
  - query_promql              # for clickhouse_* metrics
  - host_bash
  - get_edge_summary

disallowed_tools:
  - host_restart_service

permission_mode: read-only
max_turns: 12
model: anthropic/claude-sonnet-4-7

critical_reminder: |
  You're read-only. Never propose direct ALTER / OPTIMIZE without
  citing the system.mutations evidence first. Always check the
  replication lag before recommending any maintenance command.
---

# specialist-clickhouse

You are Ongrid's ClickHouse specialist.

## Step 0: knowledge base check (mandatory)

Before any inspection, call `query_knowledge` once with a natural-
language description of the question. Hit (score >= 0.6) → follow
the playbook. Cite as `(参考 KB: <title>)` in your final reply.

## Working style

1. Start with `query_clickhouse_system` for system.parts /
   system.mutations / system.replication_queue. One call, broad
   snapshot.
2. If a specific table is suspect, drill into `system.parts` for
   that table with bytes / rows / merge_state.
3. For replication: `system.replication_queue` for failures,
   `clickhouse_replica_delay_seconds` PromQL series for trend.
4. For query perf: `system.query_log` with `query_duration_ms`
   sort + `read_rows` to find the heavy query.

## Output

- 现状 (1-2 sentences): which table, which metric, what's wrong.
- 证据 (2-3 lines): system.* row excerpts + PromQL value.
- 建议 (1 line): observation only, or "recommend dispatching
  specialist-ops to run OPTIMIZE/ALTER under reviewer".

Frontmatter reference

The fields the parser understands (ParseAgentMd):

FieldRequiredTypePurpose
nameyesstringSpawn key. Must be unique. snake_case or kebab-case.
descriptionyesstringSurfaced in the coordinator's agent catalog.
when_to_useyesstringFirst line surfaces in catalog. Strict-required because the coordinator can't pick a persona without it.
toolsno[]stringWhitelist of BaseTool names. Empty = inherit nothing.
disallowed_toolsno[]stringBlacklist. Wins over whitelist; supports wildcards (*_skill).
permission_modenostringread-only / mutating-with-confirm / dual-sign-required. Today informational; future versions may auto-wire decorators based on this.
max_turnsnointHard ReAct loop cap. Default 15.
modelnostringLLM identifier (e.g. anthropic/claude-sonnet-4-7). Falls back to org default.
critical_remindernostringWrapped in <critical-reminder>...</critical-reminder> in the system prompt. Also re-injected per-turn by the graph layer.
initial_promptnostringPrepended to the worker's first user turn. Rarely used.
backgroundnobooltrue = async spawn (UI doesn't block). Used by reviewer.
omit_claude_mdnoboolSuppress the runtime's base prompt for this persona.
metadatanomapFree-form. metadata.ongrid.{scope, min_ongrid_version} is read by the registry; everything else is pass-through.

Unknown fields are preserved into Agent.UnknownFields so future Claude Code persona format additions (effort, isolation, mcp_servers, hooks, …) don't break loading.

tools vs disallowed_tools

Whitelist + blacklist, black wins. So:

yaml
tools: ["query_*", "host_bash"]    # everything starting with query_, plus bash
disallowed_tools: ["query_devices"] # but not this one

leaves query_promql, query_logql, query_traceql, query_knowledge, … and host_bash, minus query_devices.

Wildcards: *_skill matches every tool name ending in _skill. This is how the reviewer blocks all skill executions in one line.

The AgentTool is also stripped from any worker's bag automatically — workers cannot spawn workers. You don't need to list it under disallowed_tools.

Where personas live

The runtime walks two roots:

  1. The image-baked root/app/agents/ inside the manager container. Contains the six shipped personas. Read-only inside the image; survives container restart but not custom code.
  2. The marketplace root/var/lib/ongrid/agents/ (mounted volume). User-authored personas land here via the Settings → Agents UI or via the marketplace install path.

Both are merged into the same AgentRegistry. On a name collision the loader records a warning and keeps the first load. To override a built-in persona, save your version with the same name via the Settings UI — AgentRegistry.Replace upserts in place.

Where to start

The fastest path is to copy agents/specialist-disk.md into your editor, rename, and adjust the tool bag. The shape carries all the conventions (KB-first, 4-step recipe, output format) that work well with the coordinator.

Hot reload vs restart

ActionHot-reloadable?How
Edit persona body (system prompt)YesSettings → Agents → Save
Change tool whitelistYesSame. Filter is applied per spawn.
Change model / max_turnsYesSame. New spawns pick up the new values.
Add a new personaYesSettings → Agents → New, or drop file + Reload
Delete a personaYesSettings → Agents → Delete, or remove file + Reload
Override a built-in (same name)YesReplace upserts; coordinator uses new.
Change which tools exist in the bagNoBaseTool registration is binary-side.
Add a new BaseToolNoRequires code change + manager restart.
Change default_locale semanticsNoThat's runtime code.

The lock around AgentRegistry is a sync.RWMutex. An in-flight coordinator turn that already fetched a persona pointer keeps using the snapshot; the next coordinator turn sees the new persona.

Debugging

"Coordinator never dispatches to my persona"

  1. Check the agent catalog in the coordinator's system prompt (the manager logs the rendered prompt at startup with --log-level=debug). Your persona should appear with its description and the first line of when_to_use.
  2. If the catalog is missing it: the loader recorded a warning. Check AgentRegistry.Warnings() via the API (GET /api/v1/agents/warnings) or look for chatruntime: parse <path> lines in the manager log.
  3. If the catalog has it but the LLM doesn't pick it: tighten when_to_use. Lead with a concrete trigger pattern; the LLM is prompted to read the first line as the matching hint.

"Worker spawns but immediately fails"

Common causes:

  • A whitelisted tool isn't in the bag. The runtime filters and silently drops anything not present; the worker can't call what isn't there. Check GET /api/v1/skills for the active bag.
  • The model identifier is wrong. The chat model resolver maps anthropic/<x> to default_provider if not configured. Set default_provider to anthropic in Settings → LLM or pin a concrete provider+model in the persona.
  • max_turns is too low. A worker that runs out of turns before producing a final assistant message returns as failed. Bump to 15+ for any non-trivial persona.

"Worker returns OK but the output is garbage"

The persona body is your system prompt. Tighten:

  • Start with Step 0: a single forced KB call. Anchors the worker.
  • Specify the output format verbatim in the body. The coordinator parses on this format.
  • Use critical_reminder for hard constraints (read-only, no PII, output language). It's wrapped in <critical-reminder> AND re-injected per-turn — the LLM sees it on every iteration.

Testing your persona

Two integration points:

From the chat surface

Open /chat, ask a question that matches your persona's when_to_use. Watch the SPA — if the coordinator dispatches, an "Agent tile" appears with your persona's name + the AgentTool description. Click for the worker's transcript.

From the API

bash
curl -X POST http://localhost:8080/api/v1/chat \
  -H 'Authorization: Bearer <token>' \
  -H 'Content-Type: application/json' \
  -d '{"prompt": "<the question that should trigger your persona>"}'

The streaming response surfaces:

  • text deltas — the coordinator's prose.
  • agent_tile envelopes — every AgentTool dispatch.
  • task_notification envelopes — worker completion.

If your persona is dispatched, the matching agent_tile.persona will be your name.

When to NOT write a custom persona

  • The task is a 1-tool answer. Don't wrap "query my custom Prometheus" in a persona. Register a custom BaseTool and let the coordinator call it.
  • The task is one-off. Personas are for repeated patterns. For a one-shot investigation, just ask the coordinator directly.
  • The task needs to call across all 5 specialists. That's exactly what the coordinator is for; don't write a meta-specialist to recreate the coordinator's behavior.

A good rule: write a persona when the same shape of question recurs, the answer requires 5+ tool calls, and the tool bag is narrower than what the coordinator carries.

Sharing personas

  • Drop the .md file in your ops repo. Mount it into the manager container under /var/lib/ongrid/agents/. The registry picks it up at startup (or on a Reload call).
  • For org-wide rollout, ship through the skill marketplace — the marketplace install bundles personas + skills together and triggers a Reload automatically.