编写自定义 agent

自定义 persona 让你用自己的 specialist 扩展 Ongrid。它们以 <name>.md 文件加 YAML frontmatter 的形式存在磁盘上，跟内置 persona 长得一模一样 —— 同一个加载器、同一个 registry、同一条派发链路。写一个、挂载它，coordinator 就能派发到它。

这一页是契约。

文件结构

一个 persona 就是一个带 YAML frontmatter 的 Markdown 文件：

markdown

---
name: specialist-clickhouse
description: ClickHouse 查询性能 / 分区健康 / mutation backlog 专家
when_to_use: |
  When the user asks about:
    - ClickHouse query plan / scan / shuffle slow
    - Partition merges / mutation backlog
    - Replication lag between replicas
    - System.parts / system.mutations inspection

tools:
  - query_knowledge
  - query_clickhouse_system   # custom BaseTool you registered
  - query_promql              # for clickhouse_* metrics
  - host_bash
  - get_edge_summary

disallowed_tools:
  - host_restart_service

permission_mode: read-only
max_turns: 12
model: anthropic/claude-sonnet-4-7

critical_reminder: |
  You're read-only. Never propose direct ALTER / OPTIMIZE without
  citing the system.mutations evidence first. Always check the
  replication lag before recommending any maintenance command.
---

# specialist-clickhouse

You are Ongrid's ClickHouse specialist.

## Step 0: knowledge base check (mandatory)

Before any inspection, call `query_knowledge` once with a natural-
language description of the question. Hit (score >= 0.6) → follow
the playbook. Cite as `(参考 KB: <title>)` in your final reply.

## Working style

1. Start with `query_clickhouse_system` for system.parts /
   system.mutations / system.replication_queue. One call, broad
   snapshot.
2. If a specific table is suspect, drill into `system.parts` for
   that table with bytes / rows / merge_state.
3. For replication: `system.replication_queue` for failures,
   `clickhouse_replica_delay_seconds` PromQL series for trend.
4. For query perf: `system.query_log` with `query_duration_ms`
   sort + `read_rows` to find the heavy query.

## Output

- 现状 (1-2 sentences): which table, which metric, what's wrong.
- 证据 (2-3 lines): system.* row excerpts + PromQL value.
- 建议 (1 line): observation only, or "recommend dispatching
  specialist-ops to run OPTIMIZE/ALTER under reviewer".

Frontmatter 字段参考

解析器认识的字段（ParseAgentMd）：

字段	必填	类型	用途
`name`	是	string	派发 key。必须唯一。snake_case 或 kebab-case。
`description`	是	string	出现在 coordinator 的 agent 目录里。
`when_to_use`	是	string	首行进目录。强制必填，没它 coordinator 选不了 persona。
`tools`	否	[]string	BaseTool 名字白名单。空 = 什么都不继承。
`disallowed_tools`	否	[]string	黑名单。压过白名单；支持通配符（`*_skill`）。
`permission_mode`	否	string	`read-only` / `mutating-with-confirm` / `dual-sign-required`。今天只是声明用；未来版本可能会基于它自动接装饰器。
`max_turns`	否	int	ReAct 循环硬上限。默认 15。
`model`	否	string	LLM 标识（例如 `anthropic/claude-sonnet-4-7`）。空时回退到组织默认。
`critical_reminder`	否	string	在 system prompt 里被 `<critical-reminder>...</critical-reminder>` 包起来。graph 层每轮还会重新注入。
`initial_prompt`	否	string	拼到 worker 第一条 user 消息前。极少用。
`background`	否	bool	`true` = 异步派发（UI 不阻塞）。`reviewer` 用这个。
`omit_claude_md`	否	bool	对这个 persona 抑制掉运行时的基础 prompt。
`metadata`	否	map	自由形态。`metadata.ongrid.{scope, min_ongrid_version}` 被 registry 读取；其余原样透传。

未知字段会被保留到 Agent.UnknownFields，所以未来 Claude Code persona 格式新增的字段（effort、isolation、mcp_servers、hooks……）不会破坏加载。

`tools` 和 `disallowed_tools`

白名单 + 黑名单，黑赢。所以：

yaml

tools: ["query_*", "host_bash"]    # everything starting with query_, plus bash
disallowed_tools: ["query_devices"] # but not this one

最后留下 query_promql、query_logql、query_traceql、 query_knowledge……以及 host_bash，去掉 query_devices。

通配符：*_skill 匹配每一个以 _skill 结尾的 tool 名。这就是 reviewer 一行屏蔽掉所有 skill 执行的办法。

AgentTool 也会自动从每个 worker 的工具包里剥掉 —— worker 不能再派 worker。你不需要把它写到 disallowed_tools 里。

persona 在哪儿

运行时会扫两个根：

镜像内置根 —— manager 容器内的 /app/agents/。装了出厂的 6 个 persona。镜像内只读；容器重启后还在，但加不了自定义代码。
marketplace 根 —— /var/lib/ongrid/agents/（挂载卷）。用户写的 persona 通过 Settings → Agents UI 或 marketplace 安装流程落到这里。

两者会合并进同一个 AgentRegistry。名字撞了，加载器记一条 warning，保留先加载的那个。要覆盖内置 persona，通过 Settings UI 用同样的 name 保存你自己的版本 —— AgentRegistry.Replace 会原地 upsert。

从哪里入手

最快的路径：把 agents/specialist-disk.md 拷到你的编辑器，改名、调工具包。骨架里带齐了所有约定（KB 先查、4 步法、输出格式），跟 coordinator 配合得很好。

热加载 vs 重启

操作	能热加载？	怎么做
改 persona 正文（system prompt）	可以	Settings → Agents → Save
改 tool 白名单	可以	同上。过滤器每次派发时应用。
改 `model` / `max_turns`	可以	同上。新派发会拿到新值。
加新 persona	可以	Settings → Agents → New，或者放文件 + Reload
删 persona	可以	Settings → Agents → Delete，或者删文件 + Reload
覆盖内置（同 `name`）	可以	`Replace` upsert；coordinator 用新的。
改工具包里有哪些 tool	不行	BaseTool 注册在二进制侧。
加新 BaseTool	不行	需要改代码 + manager 重启。
改 `default_locale` 语义	不行	那是运行时代码。

AgentRegistry 的锁是 sync.RWMutex。正在跑的 coordinator turn 如果已经取到了 persona 指针，就继续用那份快照；下一个 coordinator turn 看到新 persona。

调试

"coordinator 从不派发到我的 persona"

看 coordinator system prompt 里的 agent 目录（manager 启动时用 --log-level=debug 会把渲染后的 prompt 打出来）。你的 persona 应该出现，带 description 和 when_to_use 的第一行。
如果目录里没有：加载器记了 warning。通过 API （GET /api/v1/agents/warnings）查 AgentRegistry.Warnings()，或者在 manager 日志里找 chatruntime: parse <path> 这种行。
如果目录里有但 LLM 不选它：把 when_to_use 写紧。开头放一个具体的触发模式；LLM 被提示把第一行当成匹配 hint 读。

"worker 派出来但马上失败"

常见原因：

白名单里的 tool 不在工具包里。 运行时会过滤，把不存在的悄悄丢掉； worker 调不了不存在的东西。查 GET /api/v1/skills 看当前工具包。
模型标识写错了。 chat 模型 resolver 在没配置时会把 anthropic/<x> 映射到 default_provider。在 Settings → LLM 里把 default_provider 设成 anthropic，或者在 persona 里钉死一个具体的 provider+model。
max_turns 太低了。 worker 在写出最终 assistant 消息前用完了 turn，就会返回 failed。任何非平凡的 persona 都至少给 15。

"worker 返回 OK 但输出是垃圾"

persona 正文就是你的 system prompt。写紧：

用 Step 0 开头：强制调一次 KB。把 worker 锚定住。
在正文里逐字指定输出格式。coordinator 按这个格式解析。
用 critical_reminder 写硬约束（read-only、no PII、输出语言）。它会被 <critical-reminder> 包起来，并且每轮都重新注入 —— LLM 每次迭代都能看到。

测试你的 persona

两个接入点：

从 chat 入口

打开 /chat，问一个匹配你 persona when_to_use 的问题。盯着 SPA —— 如果 coordinator 派发了，就会出现一个 "Agent tile"，带你 persona 的 name + AgentTool 的 description。点开看 worker 的对话记录。

从 API

bash

curl -X POST http://localhost:8080/api/v1/chat \
  -H 'Authorization: Bearer <token>' \
  -H 'Content-Type: application/json' \
  -d '{"prompt": "<the question that should trigger your persona>"}'

流式响应里会出现：

text 增量 —— coordinator 的正文。
agent_tile 信封 —— 每次 AgentTool 派发。
task_notification 信封 —— worker 完成。

如果你的 persona 被派发了，匹配的 agent_tile.persona 就是你的 name。

什么时候不写自定义 persona

任务一个 tool 就够了。 别用 persona 包装"查我自己的 Prometheus"。注册一个自定义 BaseTool，让 coordinator 直接调。
任务是一次性的。 persona 是给重复模式准备的。一次性调查直接问 coordinator 就行。
任务需要跨调全部 5 个 specialist。 这正是 coordinator 的活；别写一个元 specialist 来重新实现 coordinator 的行为。

好规则：当同样形态的问题反复出现、答案需要 5+ 次 tool 调用、并且工具包比 coordinator 带的更窄时，写一个 persona。

分享 persona

把 .md 文件放到你的 ops 仓库里。挂到 manager 容器的 /var/lib/ongrid/agents/ 下。registry 启动时（或 Reload 调用时）会捡到。
全组织铺开走 skill marketplace —— marketplace 安装会把 persona + skill 一起打包，并自动触发一次 Reload。

编写自定义 agent ​

文件结构 ​

Frontmatter 字段参考 ​

tools 和 disallowed_tools ​

persona 在哪儿 ​

热加载 vs 重启 ​

调试 ​

"coordinator 从不派发到我的 persona" ​

"worker 派出来但马上失败" ​

"worker 返回 OK 但输出是垃圾" ​

测试你的 persona ​

从 chat 入口 ​

从 API ​

什么时候不写自定义 persona ​

分享 persona ​

相关 ​