Skip to content

First-boot checklist

Right after install.sh finishes and you've logged in as the bootstrap admin, walk this list. None of it is required to "start seeing things" — the Quickstart works without any of it — but each item closes a gap before you hand the system to a team.

1. Set ONGRID_PUBLIC_URL correctly

Probably the most important item. This URL is what your edges use for the data plane — logs push to <url>/loki/api/v1/push, traces push to <url>/v1/traces.

Check what install.sh filled in:

bash
sudo grep '^ONGRID_PUBLIC_URL=' /opt/ongrid/.env

If you accepted an internal address but your edges live on the public internet, logs and traces will silently fail — the tunnel still works (it has its own port), so the edge looks healthy.

To fix:

bash
# Edit
sudo sed -i 's|^ONGRID_PUBLIC_URL=.*|ONGRID_PUBLIC_URL=https://ops.example.com|' /opt/ongrid/.env

# Restart the affected services
sudo docker compose -f /opt/ongrid/docker-compose.yml --env-file /opt/ongrid/.env up -d ongrid nginx

You don't need to redeploy edges; the agent re-reads the data-plane endpoint from the manager periodically.

See ONGRID_PUBLIC_URL.

2. Configure a default LLM provider

Out of the box, no provider is configured. The agent will refuse to think.

  1. Settings → Models. Pick one of:
    • OpenAI (gpt-5.4 default)
    • Anthropic (claude-opus-4-7 default)
    • Zhipu (glm-4.7 default)
    • DeepSeek (deepseek-v4-flash default)
    • Gemini (gemini-2.5-pro default)
    • Kimi (kimi-k2.6 default)
    • Custom OpenAI-compatible (vLLM, Ollama, OpenRouter, corporate relay…)
  2. Paste the API key. Optional: override the default model in "Advanced".
  3. Save. Pre-registration is hot — no restart.
  4. On the same page set Default provider to the one you just wired.

Don't mix default_provider and per-route models

The "default" drives back-end LLM calls (alert investigation, translate, summarize). The model picker in the chat header is a per-thread override — useful for "try Opus on this one question" but the site default is what runs for cron jobs and incidents.

See the Routing & default and Budget & limits entries under Models in the sidebar.

3. Configure a notification channel

Even if you don't have alert rules yet, wire one channel so future incidents have somewhere to land.

The recommended starter pair:

  • Webhook channel to a generic incoming-webhook collector (you can always remove it later) — proves the notification path works.
  • One IM channel — Telegram is the easiest because you only need a bot token and a chat ID; Slack/Lark/DingTalk/WeCom take more setup.

See channels overview.

4. Set the manager timezone

Time stamps in incident timelines and alert events follow the manager's timezone. Default is UTC inside the container; for a UI that matches your team:

bash
# Set TZ in compose env. Edit /opt/ongrid/docker-compose.yml or drop a
# /opt/ongrid/docker-compose.override.yml with:
services:
  ongrid:
    environment:
      TZ: Asia/Shanghai
bash
sudo docker compose -f /opt/ongrid/docker-compose.yml --env-file /opt/ongrid/.env up -d ongrid

For the AI output locale (does the LLM answer in English or Chinese), set ONGRID_DEFAULT_LOCALE. Default en; valid values match your UI translations (en, zh-CN, ja, …). Channels can override per-channel; manual UI requests follow Accept-Language.

5. Decide Prometheus retention

The compose defaults to 90 days / 20 GB cap. To change:

bash
# /opt/ongrid/docker-compose.override.yml
services:
  prometheus:
    command:
      - --storage.tsdb.path=/prometheus
      - --storage.tsdb.retention.time=30d
      - --storage.tsdb.retention.size=10GB
      - --web.enable-remote-write-receiver
      - --web.enable-lifecycle
      - --web.external-url=/prometheus/
      - --web.route-prefix=/prometheus/
      - --config.file=/etc/prometheus/prometheus.yml
bash
sudo docker compose -f /opt/ongrid/docker-compose.yml --env-file /opt/ongrid/.env up -d prometheus

For Loki retention edit /opt/ongrid/loki-config.yaml and restart the loki container. For Tempo edit /opt/ongrid/tempo-config.yaml.

6. Decide whether to keep built-in Loki / Tempo

If you already run managed log/trace backends (Grafana Cloud, Honeycomb, Splunk, your own VictoriaLogs / VictoriaTraces…), you can:

  • Keep the embedded ones as the data sink the agent queries through (cheapest path, no extra infra).
  • Swap them out by pointing the agent at the managed URL via ONGRID_LOG_QUERY_URL / ONGRID_TRACE_QUERY_URL, and reconfiguring each edge's promtail / otelcol to push there directly.

For a hybrid setup (edges push to both), drop a custom promtail.yaml / otelcol.yaml in /etc/ongrid-edge/ on each edge and the agent will pick it up.

See logs capability and traces capability.

7. Replace the self-signed TLS cert

For trial use it's fine. For prod the cert is in /opt/ongrid/certs/:

bash
sudo cp fullchain.pem /opt/ongrid/certs/tls.crt
sudo cp privkey.pem   /opt/ongrid/certs/tls.key
sudo chmod 600        /opt/ongrid/certs/tls.key
sudo chmod 644        /opt/ongrid/certs/tls.crt
sudo docker compose -f /opt/ongrid/docker-compose.yml restart nginx

install.sh and upgrade.sh never overwrite operator certs.

8. Back up /var/lib/ongrid and /opt/ongrid/.env

Everything stateful lives under those two paths:

  • /opt/ongrid/.env — secrets (JWT, MySQL, admin password, embed keys).
  • /var/lib/ongrid/mysql/ — all operational state. Anything you can't lose lives here: edges, alert rules, incidents, channel configs, audit log, custom skills, knowledge metadata.
  • /var/lib/ongrid/qdrant/ — vector embeddings (rebuildable from source docs, but expensive).
  • /var/lib/ongrid/prometheus/, loki/, tempo/ — telemetry; back up only if you need long retention.

A simple cron + rsync (or restic) of those two roots gets you disaster recovery. Restore = stop the stack, replace the dirs, start the stack.

9. Set up a real admin account

The bootstrap admin's email is whatever you (or install.sh) put in ONGRID_ADMIN_EMAIL. For a real team:

  1. Settings → Identity → Users → Invite user for each real operator.
  2. Assign each a role: admin, user, or viewer (ADR-022 RBAC).
    • admin — full control.
    • user — can chat with the agent, view incidents, mute alerts. Toolbag filtered to ClassSafe.
    • viewer — read-only chat (no write skills), read-only incidents.
  3. Demote the bootstrap admin if you want, or just stop using it.

10. Trigger the first incident (smoke test)

Force one of the built-in rules to fire. Simplest: stop one of your edges.

bash
sudo systemctl stop ongrid-edge

Within ONGRID_ALERT_EDGE_OFFLINE_THRESHOLD (default 90s) plus the evaluator interval (default 5m), the edge_offline rule fires. The UI:

  • Alerts — new event.
  • Incidents — new incident grouping that event.
  • Channels — your wired channel receives a card.

If you set up an IM channel, reply to the bot with "investigate this". The incident investigator runs end-to-end and posts back a report.

Restart the edge:

bash
sudo systemctl start ongrid-edge

Incident auto-moves to mitigated when the rule stops firing; you mark it resolved from the UI.

What's next