First-boot checklist
Right after install.sh finishes and you've logged in as the bootstrap admin, walk this list. None of it is required to "start seeing things" — the Quickstart works without any of it — but each item closes a gap before you hand the system to a team.
1. Set ONGRID_PUBLIC_URL correctly
Probably the most important item. This URL is what your edges use for the data plane — logs push to <url>/loki/api/v1/push, traces push to <url>/v1/traces.
Check what install.sh filled in:
sudo grep '^ONGRID_PUBLIC_URL=' /opt/ongrid/.envIf you accepted an internal address but your edges live on the public internet, logs and traces will silently fail — the tunnel still works (it has its own port), so the edge looks healthy.
To fix:
# Edit
sudo sed -i 's|^ONGRID_PUBLIC_URL=.*|ONGRID_PUBLIC_URL=https://ops.example.com|' /opt/ongrid/.env
# Restart the affected services
sudo docker compose -f /opt/ongrid/docker-compose.yml --env-file /opt/ongrid/.env up -d ongrid nginxYou don't need to redeploy edges; the agent re-reads the data-plane endpoint from the manager periodically.
See ONGRID_PUBLIC_URL.
2. Configure a default LLM provider
Out of the box, no provider is configured. The agent will refuse to think.
- Settings → Models. Pick one of:
- OpenAI (
gpt-5.4default) - Anthropic (
claude-opus-4-7default) - Zhipu (
glm-4.7default) - DeepSeek (
deepseek-v4-flashdefault) - Gemini (
gemini-2.5-prodefault) - Kimi (
kimi-k2.6default) - Custom OpenAI-compatible (vLLM, Ollama, OpenRouter, corporate relay…)
- OpenAI (
- Paste the API key. Optional: override the default model in "Advanced".
- Save. Pre-registration is hot — no restart.
- On the same page set Default provider to the one you just wired.
Don't mix default_provider and per-route models
The "default" drives back-end LLM calls (alert investigation, translate, summarize). The model picker in the chat header is a per-thread override — useful for "try Opus on this one question" but the site default is what runs for cron jobs and incidents.
See the Routing & default and Budget & limits entries under Models in the sidebar.
3. Configure a notification channel
Even if you don't have alert rules yet, wire one channel so future incidents have somewhere to land.
The recommended starter pair:
- Webhook channel to a generic incoming-webhook collector (you can always remove it later) — proves the notification path works.
- One IM channel — Telegram is the easiest because you only need a bot token and a chat ID; Slack/Lark/DingTalk/WeCom take more setup.
See channels overview.
4. Set the manager timezone
Time stamps in incident timelines and alert events follow the manager's timezone. Default is UTC inside the container; for a UI that matches your team:
# Set TZ in compose env. Edit /opt/ongrid/docker-compose.yml or drop a
# /opt/ongrid/docker-compose.override.yml with:
services:
ongrid:
environment:
TZ: Asia/Shanghaisudo docker compose -f /opt/ongrid/docker-compose.yml --env-file /opt/ongrid/.env up -d ongridFor the AI output locale (does the LLM answer in English or Chinese), set ONGRID_DEFAULT_LOCALE. Default en; valid values match your UI translations (en, zh-CN, ja, …). Channels can override per-channel; manual UI requests follow Accept-Language.
5. Decide Prometheus retention
The compose defaults to 90 days / 20 GB cap. To change:
# /opt/ongrid/docker-compose.override.yml
services:
prometheus:
command:
- --storage.tsdb.path=/prometheus
- --storage.tsdb.retention.time=30d
- --storage.tsdb.retention.size=10GB
- --web.enable-remote-write-receiver
- --web.enable-lifecycle
- --web.external-url=/prometheus/
- --web.route-prefix=/prometheus/
- --config.file=/etc/prometheus/prometheus.ymlsudo docker compose -f /opt/ongrid/docker-compose.yml --env-file /opt/ongrid/.env up -d prometheusFor Loki retention edit /opt/ongrid/loki-config.yaml and restart the loki container. For Tempo edit /opt/ongrid/tempo-config.yaml.
6. Decide whether to keep built-in Loki / Tempo
If you already run managed log/trace backends (Grafana Cloud, Honeycomb, Splunk, your own VictoriaLogs / VictoriaTraces…), you can:
- Keep the embedded ones as the data sink the agent queries through (cheapest path, no extra infra).
- Swap them out by pointing the agent at the managed URL via
ONGRID_LOG_QUERY_URL/ONGRID_TRACE_QUERY_URL, and reconfiguring each edge'spromtail/otelcolto push there directly.
For a hybrid setup (edges push to both), drop a custom promtail.yaml / otelcol.yaml in /etc/ongrid-edge/ on each edge and the agent will pick it up.
See logs capability and traces capability.
7. Replace the self-signed TLS cert
For trial use it's fine. For prod the cert is in /opt/ongrid/certs/:
sudo cp fullchain.pem /opt/ongrid/certs/tls.crt
sudo cp privkey.pem /opt/ongrid/certs/tls.key
sudo chmod 600 /opt/ongrid/certs/tls.key
sudo chmod 644 /opt/ongrid/certs/tls.crt
sudo docker compose -f /opt/ongrid/docker-compose.yml restart nginxinstall.sh and upgrade.sh never overwrite operator certs.
8. Back up /var/lib/ongrid and /opt/ongrid/.env
Everything stateful lives under those two paths:
/opt/ongrid/.env— secrets (JWT, MySQL, admin password, embed keys)./var/lib/ongrid/mysql/— all operational state. Anything you can't lose lives here: edges, alert rules, incidents, channel configs, audit log, custom skills, knowledge metadata./var/lib/ongrid/qdrant/— vector embeddings (rebuildable from source docs, but expensive)./var/lib/ongrid/prometheus/,loki/,tempo/— telemetry; back up only if you need long retention.
A simple cron + rsync (or restic) of those two roots gets you disaster recovery. Restore = stop the stack, replace the dirs, start the stack.
9. Set up a real admin account
The bootstrap admin's email is whatever you (or install.sh) put in ONGRID_ADMIN_EMAIL. For a real team:
- Settings → Identity → Users → Invite user for each real operator.
- Assign each a role:
admin,user, orviewer(ADR-022 RBAC).admin— full control.user— can chat with the agent, view incidents, mute alerts. Toolbag filtered toClassSafe.viewer— read-only chat (no write skills), read-only incidents.
- Demote the bootstrap admin if you want, or just stop using it.
10. Trigger the first incident (smoke test)
Force one of the built-in rules to fire. Simplest: stop one of your edges.
sudo systemctl stop ongrid-edgeWithin ONGRID_ALERT_EDGE_OFFLINE_THRESHOLD (default 90s) plus the evaluator interval (default 5m), the edge_offline rule fires. The UI:
- Alerts — new event.
- Incidents — new incident grouping that event.
- Channels — your wired channel receives a card.
If you set up an IM channel, reply to the bot with "investigate this". The incident investigator runs end-to-end and posts back a report.
Restart the edge:
sudo systemctl start ongrid-edgeIncident auto-moves to mitigated when the rule stops firing; you mark it resolved from the UI.
What's next
- Channels overview — wire your real on-call channels.
- Alerts capability — author custom rules beyond the 6 built-ins.
- Upgrade — when v0.7.X+1 ships.
- Reference / env — every
ONGRID_*tunable.