Upgrade

Two binaries to upgrade: the manager (docker-compose stack) and edges (one per host). Both follow the same conceptual flow: stage the new artefacts, swap them in atomically, fall back on the previous version automatically if the new one doesn't come up healthy.

TL;DR

Manager:

bash

VER=v0.10.1
gh release download "$VER" --repo ongridio/ongrid \
    -p 'ongrid-*-linux-amd64.tar.xz*'
sha256sum -c "ongrid-${VER}-linux-amd64.tar.xz.sha256"
tar xf "ongrid-${VER}-linux-amd64.tar.xz"
cd     "ongrid-${VER}-linux-amd64"
sudo ./upgrade.sh

Edges (all of them, in parallel, from the UI):

Edges → Upgrade all — manager pushes MethodFetchPackage to every connected edge, each downloads the staged bundle from https://<manager>/edge/, stages it, restarts, swaps, runs. Failures auto-roll-back on the next boot.

Or for a single edge from the shell:

bash

sudo systemctl restart ongrid-edge
# The ExecStartPre runs apply-pending-upgrade.sh which picks up any
# staged bundle and applies it before the new agent starts.

The rest of this page explains how all of that actually works, so you can debug it when it doesn't.

What you don't need to do


Edit `.env`	`upgrade.sh` appends new keys but never overwrites yours
Manage certs	`certs/` is untouched; your real TLS cert survives
Pull docker images	The tarball ships them — upgrade works offline
SSH into edges	Edge upgrade goes through the manager UI (ADR-024)
Run a schema migration	gorm `AutoMigrate` handles diff on manager boot

When the upgrade needs help

Situation	What to do
Pre-v0.7.45 install with named docker volumes	Re-run with `bash upgrade.sh --migrate-volumes` — the script uses an alpine container to copy the legacy volumes into the bind-mount layout
`/healthz` doesn't return 200 within 90s	The script WARNs but doesn't roll back. Inspect `docker compose -f /opt/ongrid/docker-compose.yml logs ongrid`
Want to roll back	Extract the previous version's tarball and re-run its own `upgrade.sh`. No schema migration usually means a clean downgrade

Downtime is ~30s to 2min depending on DB size — gorm AutoMigrate's schema diff dominates on large chat_messages / audit_log tables.

Manager upgrade: `upgrade.sh`

upgrade.sh lives in every release tarball and is meant to be run inside the freshly extracted newer tarball. It assumes there's already an install at ${ONGRID_INSTALL_DIR:-/opt/ongrid}/.

In order:

Preflight. Verifies docker + compose v2, finds the existing .env. Bails if there is none.
Determines old vs new version from the existing .env and the new tarball's VERSION file. Logs both — useful for support tickets.
docker compose down to release the data volumes for any migration step.
Re-asserts data dir ownership under ${ONGRID_DATA_DIR}/ (mysql, prom, loki, tempo, grafana, embeddings, qdrant). Image uids haven't changed in a long time but this is the only safe default — re-asserting is idempotent.
Re-stages configs from the new tarball into /opt/ongrid/: docker-compose.yml, prometheus.yml, prometheus-rules.yml, loki-config.yaml, tempo-config.yaml, grafana/, searxng/, nginx.conf, frontier.yaml, the new edge/ artefacts, the new VERSION file. .env is preserved — upgrade.sh never overwrites operator-edited env.
Loads new docker images (ongrid.tar, ongrid-web.tar, frontier.tar if present).
Bumps ONGRID_VERSION= in .env so the next compose-up picks the new image tag.
docker compose up -d with the new env.
Polls /healthz for 60s.
Banner with old → new version, the Web URL, useful next-step commands.

If anything fails before step 8, the operator can re-run after fixing the underlying issue — upgrade.sh is idempotent. If step 9 fails (/healthz never returns 200), the manager logs from docker compose logs ongrid will tell you what's broken.

For zero-downtime, run a blue/green setup with two manager hosts in front of a load balancer; the OSS single-host install is brief downtime during upgrade (~30-60s).

Where the upgrade artefacts come from: `make package`

This is the operator path — how you build the tarball if you're building from source instead of pulling a release. From a checkout of ongridio/ongrid:

bash

make package

Order matters; this is what make package does:

fetch-promtail — download promtail-<os>-<arch>.zip from the upstream Grafana release, extract into bin/<os>-<arch>/promtail. Cached.
fetch-otelcol — same for otelcol-contrib.
fetch-node-exporter — same for node_exporter.
fetch-process-exporter — same for process_exporter.
build-edge-all — cross-compile ongrid-edge for linux/amd64, linux/arm64, darwin/amd64, darwin/arm64.
docker-build — build the ongrid:<version> image.
docker-build-broker — build singchia/frontier:<ver> from $FRONTIER_SRC (or skip if already in the local image store).
docker-build-web — build ongrid-web:<version> (frontend SPA + nginx).
build-edge-bundle — assemble edge-bundle-linux-amd64-<ver>.tar.gz from the loose binaries.
dist/package.sh — stages everything into dist/stage/ongrid-<version>-linux-amd64/, docker saves the three images, tar.xzs the whole thing, computes the sha256 sidecar, drops it under dist/out/.

The output:

dist/out/ongrid-v0.10.1-linux-amd64.tar.xz
dist/out/ongrid-v0.10.1-linux-amd64.tar.xz.sha256

That's the artefact you scp to the prod host and feed to ./upgrade.sh.

Offline RAG model

The BGE embedding model isn't a package dep — it's slow to fetch over CN networks, so it stays a one-off step. Run make fetch-embedding-model once before make package if you want ONGRID_EMBEDDING_PROVIDER=local to work out of the box on a restore.

Edge upgrade: ADR-024 stage-then-swap

The user path for upgrading an edge is "click Upgrade in the UI" or systemctl restart ongrid-edge after a bundle's been staged. What actually happens, top to bottom:

1. Manager triggers, edge fetches

On "Upgrade all edges" (or a single edge):

Manager reads dist/out/edge-bundle-<arch>-<ver>.tar.gz.sha256 from /opt/ongrid/edge/ (placed there by install.sh / upgrade.sh).
Manager sends MethodFetchPackage(url=https://<manager>/edge/<bundle>, sha256=<sha>, version=<ver>) over the tunnel to the target edges.
nginx serves the bundle bytes from the same /edge/ dir.

2. Edge stages

The receiving edge:

Downloads the bundle to /tmp/.
Verifies the sha256 against the manifest the manager sent.
Untars into /var/lib/ongrid-edge/.upgrade/incoming/.
Validates each <sha> <mode> <src> <dest> line of MANIFEST.txt (sha matches, src exists).
Writes a "ready" marker.
Exits clean. systemd restarts the unit per Restart=always.

3. systemd's ExecStartPre runs the swap

The unit ships with:

ini

ExecStartPre=-+/usr/local/lib/ongrid-edge/apply-pending-upgrade.sh
ExecStart=/usr/local/bin/ongrid-edge

The + prefix runs the hook as root despite User=ongrid-edge; the - lets a missing/failing script exit 0 without blocking the unit so the pre-upgrade binary always starts.

apply-pending-upgrade.sh does, in order:

Mode 1: auto-rollback if a prior upgrade ran but never wrote healthy_marker matching last_upgrade_ver. Restores every <dest>.previous over <dest>. Clears the staging dir.
Mode 2: bundle apply — for each MANIFEST.txt line:
- re-verify the sha256,
- cp -p $dest $dest.previous (snapshot for rollback),
- cp $src $dest.new on the same filesystem so the final rename is POSIX-atomic,
- chmod $mode $dest.new,
- mv -f $dest.new $dest.
Mode 3: legacy single-file apply for back-compat with edges that haven't been bundle-upgraded yet (just the agent binary, no manifest).
Records last_upgrade_at and last_upgrade_ver for the next-boot health check.

Then ExecStart=/usr/local/bin/ongrid-edge runs the freshly-swapped binary.

4. The new agent reports healthy (or doesn't)

The new agent, on successful connect:

Writes /var/lib/ongrid-edge/.upgrade/healthy_marker with the version it's running.
Reports register over the tunnel with its new version.

If that file exists and matches last_upgrade_ver by the next boot, apply-pending-upgrade.sh declares success, prunes every .previous to free disk, clears the staging dir.

If it doesn't (agent crashed, can't reach manager, anything else), the next boot triggers auto-rollback: every .previous is restored, the staging dir is wiped, the edge runs the previous working version again. The operator sees this in the UI (the edge keeps reporting an "old" version after a triggered upgrade — a clue that something's wrong with the new bundle).

Bundle invariants

Every file the edge swaps is in the bundle; every file in the bundle is a self-contained binary or script. This is non-negotiable:

The agent binary (ongrid-edge)
The plugin binaries (promtail, otelcol-contrib, node_exporter, process_exporter)
The hook script (apply-pending-upgrade.sh)

Plugin configs (promtail.yaml, otelcol.yaml) are not in the bundle — they live in /etc/ongrid-edge/ and are delivered over the tunnel as live config. That way an agent upgrade doesn't clobber a config the operator hand-edited.

Don't bake the bundle back into the docker image

If you're building from source, the bundle is built by dist/build-edge-bundle.sh on the host at package time, not in the container. ADR-032 enforces this — re-baking it into the image double-packs ~120 MB of incompressible bytes and breaks the sha chain.

Where rollbacks live

text

/usr/local/bin/ongrid-edge              # current
/usr/local/bin/ongrid-edge.previous     # last known good

/usr/local/lib/ongrid-edge/promtail
/usr/local/lib/ongrid-edge/promtail.previous

...etc.

/var/lib/ongrid-edge/.upgrade/
  last_upgrade_at                       # ISO timestamp
  last_upgrade_ver                      # version string
  healthy_marker                        # written by the new agent

A boot sequence after a triggered upgrade looks like:

text

boot 1 (pre-upgrade):
  apply-pending-upgrade.sh sees nothing staged; exit 0
  ongrid-edge v0.7.159 runs

(manager pushes bundle, edge stages it, edge exits clean)

boot 2 (apply):
  apply-pending-upgrade.sh
    mode 1: no healthy_marker yet, but no last_upgrade_at either → skip
    mode 2: applies bundle. /usr/local/bin/ongrid-edge.previous = v0.7.159
                            /usr/local/bin/ongrid-edge        = v0.10.1
            writes last_upgrade_at, last_upgrade_ver=v0.10.1
            clears healthy_marker
  ongrid-edge v0.10.1 runs
  v0.10.1 connects, writes healthy_marker=v0.10.1

(reboot, e.g. host updates)

boot 3 (settle):
  apply-pending-upgrade.sh
    mode 1: last_upgrade_ver=v0.10.1, healthy_marker=v0.10.1 → success
            prunes all .previous files
            clears last_upgrade_at, last_upgrade_ver
  ongrid-edge v0.10.1 runs

A failure scenario:

text

boot 2 (apply):
  apply-pending-upgrade.sh: applies bundle as above
  ongrid-edge v0.10.1 crashes immediately (e.g. config drift)
  systemd restarts repeatedly per Restart=always

boot 3 (rollback):
  apply-pending-upgrade.sh
    mode 1: last_upgrade_ver=v0.10.1, healthy_marker missing
            → roll back: restore .previous over each dest
              (ongrid-edge.previous → ongrid-edge, etc.)
            : > last_upgrade_at, rm -rf incoming/
  ongrid-edge v0.7.159 runs again (working)

Downgrading

Downgrading the manager: extract the older tarball, run ./upgrade.sh inside it. As long as no DB schema migration has happened in the intervening releases, this works. Check the release notes for any "breaking schema change" callouts before downgrading.

Downgrading edges: trigger an "upgrade" from the UI to the older bundle version. Manager logic doesn't care which direction the version is moving.

What's next

Reference / env — env var names whose defaults change between releases.
Air-gapped install — how to mirror artefacts to a private webserver so apply-pending-upgrade.sh can still pull bundles when the manager itself isn't on the internet.

Upgrade ​

TL;DR ​

What you don't need to do ​

When the upgrade needs help ​

Manager upgrade: upgrade.sh ​

Where the upgrade artefacts come from: make package ​

Edge upgrade: ADR-024 stage-then-swap ​

1. Manager triggers, edge fetches ​

2. Edge stages ​

3. systemd's ExecStartPre runs the swap ​

4. The new agent reports healthy (or doesn't) ​

Bundle invariants ​

Where rollbacks live ​

Downgrading ​

What's next ​

Upgrade

TL;DR

What you don't need to do

When the upgrade needs help

Manager upgrade: `upgrade.sh`

Where the upgrade artefacts come from: `make package`

Edge upgrade: ADR-024 stage-then-swap

1. Manager triggers, edge fetches

2. Edge stages

3. systemd's ExecStartPre runs the swap

4. The new agent reports healthy (or doesn't)

Bundle invariants

Where rollbacks live

Downgrading

What's next