커스텀 에이전트 작성

커스텀 persona 는 자체 specialist 로 Ongrid 를 확장합니다. 디스크에 <name>.md 파일로 YAML frontmatter 와 함께 살며 — 내장과 정확히 동일 — 같은 로더, 같은 레지스트리, 같은 디스패치 경로입니다. 하나 작성하고 마운트하면 coordinator 가 그것으로 디스패치할 수 있습니다.

이 페이지는 계약입니다.

파일 레이아웃

persona 는 YAML frontmatter 가 있는 단일 Markdown 파일:

markdown

---
name: specialist-clickhouse
description: ClickHouse 查询性能 / 分区健康 / mutation backlog 专家
when_to_use: |
  When the user asks about:
    - ClickHouse query plan / scan / shuffle slow
    - Partition merges / mutation backlog
    - Replication lag between replicas
    - System.parts / system.mutations inspection

tools:
  - query_knowledge
  - query_clickhouse_system   # custom BaseTool you registered
  - query_promql              # for clickhouse_* metrics
  - host_bash
  - get_edge_summary

disallowed_tools:
  - host_restart_service

permission_mode: read-only
max_turns: 12
model: anthropic/claude-sonnet-4-7

critical_reminder: |
  You're read-only. Never propose direct ALTER / OPTIMIZE without
  citing the system.mutations evidence first. Always check the
  replication lag before recommending any maintenance command.
---

# specialist-clickhouse

You are Ongrid's ClickHouse specialist.

## Step 0: knowledge base check (mandatory)

Before any inspection, call `query_knowledge` once with a natural-
language description of the question. Hit (score >= 0.6) → follow
the playbook. Cite as `(参考 KB: <title>)` in your final reply.

## Working style

1. Start with `query_clickhouse_system` for system.parts /
   system.mutations / system.replication_queue. One call, broad
   snapshot.
2. If a specific table is suspect, drill into `system.parts` for
   that table with bytes / rows / merge_state.
3. For replication: `system.replication_queue` for failures,
   `clickhouse_replica_delay_seconds` PromQL series for trend.
4. For query perf: `system.query_log` with `query_duration_ms`
   sort + `read_rows` to find the heavy query.

## Output

- 现状 (1-2 sentences): which table, which metric, what's wrong.
- 证据 (2-3 lines): system.* row excerpts + PromQL value.
- 建议 (1 line): observation only, or "recommend dispatching
  specialist-ops to run OPTIMIZE/ALTER under reviewer".

Frontmatter 참조

파서가 이해하는 필드 (ParseAgentMd):

Field	필수	Type	목적
`name`	예	string	스폰 키. 고유해야 함. snake_case 또는 kebab-case.
`description`	예	string	coordinator 의 에이전트 카탈로그에 표면화.
`when_to_use`	예	string	첫 줄이 카탈로그에 표면화. coordinator 가 persona 를 고를 수 없으므로 엄격 필수.
`tools`	아니오	[]string	BaseTool 이름 화이트리스트. 비어 있음 = 아무것도 상속하지 않음.
`disallowed_tools`	아니오	[]string	블랙리스트. 화이트리스트보다 우선; 와일드카드 (`*_skill`) 지원.
`permission_mode`	아니오	string	`read-only` / `mutating-with-confirm` / `dual-sign-required`. 오늘 정보 제공; 향후 버전은 이에 기반해 데코레이터를 자동 배선할 수 있음.
`max_turns`	아니오	int	하드 ReAct 루프 상한. 기본 15.
`model`	아니오	string	LLM 식별자 (예: `anthropic/claude-sonnet-4-7`). 조직 기본으로 폴백.
`critical_reminder`	아니오	string	system prompt 에서 `<critical-reminder>...</critical-reminder>` 로 래핑. 그래프 레이어도 턴별로 재주입.
`initial_prompt`	아니오	string	worker 의 첫 사용자 턴에 prepend. 드물게 사용.
`background`	아니오	bool	`true` = 비동기 스폰 (UI 블록 안 됨). `reviewer` 가 사용.
`omit_claude_md`	아니오	bool	이 persona 에 대한 런타임의 base prompt 억제.
`metadata`	아니오	map	자유 형식. `metadata.ongrid.{scope, min_ongrid_version}` 는 레지스트리가 읽음; 그 외는 통과.

알 수 없는 필드는 Agent.UnknownFields 로 보존되어 향후 Claude Code persona 포맷 추가 (effort, isolation, mcp_servers, hooks, …) 가 로딩을 깨지 않습니다.

`tools` vs `disallowed_tools`

화이트리스트 + 블랙리스트, 블랙이 우선. 그래서:

yaml

tools: ["query_*", "host_bash"]    # everything starting with query_, plus bash
disallowed_tools: ["query_devices"] # but not this one

는 query_promql, query_logql, query_traceql, query_knowledge, … 와 host_bash 를 남기고 query_devices 를 제외.

와일드카드: *_skill 은 _skill 로 끝나는 모든 도구 이름 매칭. reviewer 가 한 줄로 모든 기능 실행을 블록하는 방법.

AgentTool 도 모든 worker 의 가방에서 자동으로 제거됩니다 — worker 는 worker 를 스폰할 수 없습니다. disallowed_tools 아래 나열할 필요 없습니다.

Persona 위치

런타임은 두 루트를 워크:

이미지 베이크 루트 — 매니저 컨테이너 내부의 /app/agents/. 출하된 6 persona 포함. 이미지 내부 읽기 전용; 컨테이너 재시작 후에도 살아남지만 커스텀 코드는 아님.
마켓플레이스 루트 — /var/lib/ongrid/agents/ (마운트 볼륨). 사용자 작성 persona 가 Settings → Agents UI 또는 마켓플레이스 설치 경로를 통해 안착.

둘 다 같은 AgentRegistry 로 병합. 이름 충돌 시 로더가 경고 기록하고 첫 로드를 유지. 내장 persona 를 오버라이드 하려면 Settings UI 에서 같은 name 으로 자신의 버전을 저장 — AgentRegistry.Replace 가 제자리 upsert.

어디서 시작할까

가장 빠른 경로는 agents/specialist-disk.md 를 에디터로 복사하고, 이름 변경, 도구 가방 조정. 형태는 coordinator 와 잘 동작하는 모든 컨벤션 (KB-first, 4 단계 레시피, 출력 포맷) 을 운반합니다.

핫 리로드 vs 재시작

Action	핫 리로드 가능?	방법
Persona body (system prompt) 편집	예	Settings → Agents → Save
도구 화이트리스트 변경	예	동일. 스폰별로 필터 적용.
`model` / `max_turns` 변경	예	동일. 새 스폰이 새 값 픽업.
새 persona 추가	예	Settings → Agents → New, 또는 파일 + Reload
Persona 삭제	예	Settings → Agents → Delete, 또는 파일 제거 + Reload
내장 오버라이드 (같은 `name`)	예	`Replace` upsert; coordinator 가 새것 사용.
가방에 존재하는 도구 변경	아니오	BaseTool 등록은 바이너리 측.
새 BaseTool 추가	아니오	코드 변경 + 매니저 재시작 필요.
`default_locale` 의미 변경	아니오	그것은 런타임 코드.

AgentRegistry 주변 잠금은 sync.RWMutex. 이미 persona 포인터를 가져온 진행 중 coordinator 턴은 스냅샷을 계속 사용; 다음 coordinator 턴은 새 persona 를 봅니다.

디버깅

"Coordinator 가 내 persona 로 절대 디스패치하지 않음"

coordinator 의 system prompt 의 에이전트 카탈로그 확인 (매니저는 --log-level=debug 와 함께 시작 시 렌더링된 prompt 를 로깅). 당신의 persona 가 description 과 when_to_use 의 첫 줄과 함께 나타나야 합니다.
카탈로그에 없다면: 로더가 경고를 기록함. API (GET /api/v1/agents/warnings) 로 AgentRegistry.Warnings() 확인 또는 매니저 로그에서 chatruntime: parse <path> 라인 찾기.
카탈로그에 있지만 LLM 이 고르지 않으면: when_to_use 를 조이세요. 구체적 트리거 패턴으로 시작; LLM 은 첫 줄을 매칭 힌트로 읽도록 prompted 됩니다.

"Worker 스폰되지만 즉시 실패"

흔한 원인:

화이트리스트 도구가 가방에 없음. 런타임이 필터링하고 존재하지 않는 것을 조용히 떨어뜨림; worker 는 없는 것을 호출할 수 없음. 활성 가방용 GET /api/v1/skills 확인.
모델 식별자가 잘못됨. 설정되지 않으면 채팅 모델 resolver 는 anthropic/<x> 를 default_provider 로 매핑. Settings → LLM 에 default_provider 를 anthropic 으로 설정하거나 persona 에 구체 provider+model 고정.
max_turns 가 너무 낮음. 최종 assistant 메시지 생성 전 턴이 소진된 worker 는 failed 로 반환. 비사소한 persona 는 15+ 로 올리세요.

"Worker 가 OK 반환하지만 출력이 쓰레기"

persona body 가 당신의 system prompt 입니다. 조이세요:

Step 0 으로 시작: 단일 강제 KB 호출. worker 를 고정.
body 에 출력 포맷 을 그대로 명시. coordinator 가 이 포맷으로 파싱.
하드 제약 (읽기 전용, PII 없음, 출력 언어) 용 critical_reminder 사용. <critical-reminder> 로 래핑되며 턴별 재주입 — LLM 은 매 반복마다 봅니다.

Persona 테스팅

두 통합 지점:

채팅 표면에서

/chat 을 열고 persona 의 when_to_use 와 매칭되는 질문을 합니다. SPA 를 봄 — coordinator 가 디스패치하면 persona 의 name + AgentTool description 과 함께 "Agent tile" 이 나타남. 클릭하여 worker 의 대화 로그 확인.

API 에서

bash

curl -X POST http://localhost:8080/api/v1/chat \
  -H 'Authorization: Bearer <token>' \
  -H 'Content-Type: application/json' \
  -d '{"prompt": "<the question that should trigger your persona>"}'

스트리밍 응답이 표면화:

text 델타 — coordinator 의 산문.
agent_tile envelope — 모든 AgentTool 디스패치.
task_notification envelope — worker 완료.

persona 가 디스패치되면 매칭 agent_tile.persona 가 당신의 name 이 됩니다.

커스텀 persona 를 작성하지 말아야 할 때

작업이 1 도구 답. "내 커스텀 Prometheus 쿼리" 를 persona 로 래핑하지 마세요. 커스텀 BaseTool 을 등록하고 coordinator 가 호출하게.
작업이 일회성. Persona 는 반복 패턴용. 일회성 investigation 은 coordinator 에게 직접 물으세요.
작업이 5 specialist 전반에 걸쳐 호출해야 함. 그것이 정확히 coordinator 의 역할입니다; coordinator 의 동작을 재현하는 메타 specialist 를 작성하지 마세요.

좋은 규칙: 같은 형태의 질문이 반복 되고, 답이 5+ 도구 호출 을 필요로 하며, 도구 가방이 coordinator 가 운반하는 것보다 좁을 때 persona 를 작성하세요.

Persona 공유

ops 저장소에 .md 파일을 두세요. 매니저 컨테이너의 /var/lib/ongrid/agents/ 아래로 마운트. 레지스트리가 시작 시 (또는 Reload 호출 시) 픽업.
조직 전반 롤아웃은 기능 마켓플레이스 를 통해 출하 — 마켓플레이스 설치가 persona + 기능을 함께 번들하고 자동으로 Reload 트리거.

커스텀 에이전트 작성 ​

파일 레이아웃 ​

Frontmatter 참조 ​

tools vs disallowed_tools ​

Persona 위치 ​

핫 리로드 vs 재시작 ​

디버깅 ​

"Coordinator 가 내 persona 로 절대 디스패치하지 않음" ​

"Worker 스폰되지만 즉시 실패" ​

"Worker 가 OK 반환하지만 출력이 쓰레기" ​

Persona 테스팅 ​

채팅 표면에서 ​

API 에서 ​

커스텀 persona 를 작성하지 말아야 할 때 ​

Persona 공유 ​

관련 ​