Upgrade
Two binaries to upgrade: the manager (docker-compose stack) and edges (one per host). Both follow the same conceptual flow: stage the new artefacts, swap them in atomically, fall back on the previous version automatically if the new one doesn't come up healthy.
TL;DR
Manager:
VER=v0.7.160
gh release download "$VER" --repo ongridio/ongrid \
-p 'ongrid-*-linux-amd64.tar.xz*'
sha256sum -c "ongrid-${VER}-linux-amd64.tar.xz.sha256"
tar xf "ongrid-${VER}-linux-amd64.tar.xz"
cd "ongrid-${VER}-linux-amd64"
sudo ./upgrade.shEdges (all of them, in parallel, from the UI):
- Edges → Upgrade all — manager pushes
MethodFetchPackageto every connected edge, each downloads the staged bundle fromhttps://<manager>/edge/, stages it, restarts, swaps, runs. Failures auto-roll-back on the next boot.
Or for a single edge from the shell:
sudo systemctl restart ongrid-edge
# The ExecStartPre runs apply-pending-upgrade.sh which picks up any
# staged bundle and applies it before the new agent starts.The rest of this page explains how all of that actually works, so you can debug it when it doesn't.
Manager upgrade: upgrade.sh
upgrade.sh lives in every release tarball and is meant to be run inside the freshly extracted newer tarball. It assumes there's already an install at ${ONGRID_INSTALL_DIR:-/opt/ongrid}/.
In order:
- Preflight. Verifies docker + compose v2, finds the existing
.env. Bails if there is none. - Determines old vs new version from the existing
.envand the new tarball'sVERSIONfile. Logs both — useful for support tickets. docker compose downto release the data volumes for any migration step.- Re-asserts data dir ownership under
${ONGRID_DATA_DIR}/(mysql, prom, loki, tempo, grafana, embeddings, qdrant). Image uids haven't changed in a long time but this is the only safe default — re-asserting is idempotent. - Re-stages configs from the new tarball into
/opt/ongrid/:docker-compose.yml,prometheus.yml,prometheus-rules.yml,loki-config.yaml,tempo-config.yaml,grafana/,searxng/,nginx.conf,frontier.yaml, the newedge/artefacts, the newVERSIONfile..envis preserved —upgrade.shnever overwrites operator-edited env. - Loads new docker images (
ongrid.tar,ongrid-web.tar,frontier.tarif present). - Bumps
ONGRID_VERSION=in.envso the next compose-up picks the new image tag. docker compose up -dwith the new env.- Polls
/healthzfor 60s. - Banner with old → new version, the Web URL, useful next-step commands.
If anything fails before step 8, the operator can re-run after fixing the underlying issue — upgrade.sh is idempotent. If step 9 fails (/healthz never returns 200), the manager logs from docker compose logs ongrid will tell you what's broken.
For zero-downtime, run a blue/green setup with two manager hosts in front of a load balancer; the OSS single-host install is brief downtime during upgrade (~30-60s).
Where the upgrade artefacts come from: make package
This is the operator path — how you build the tarball if you're building from source instead of pulling a release. From a checkout of ongridio/ongrid:
make packageOrder matters; this is what make package does:
fetch-promtail— downloadpromtail-<os>-<arch>.zipfrom the upstream Grafana release, extract intobin/<os>-<arch>/promtail. Cached.fetch-otelcol— same forotelcol-contrib.fetch-node-exporter— same fornode_exporter.fetch-process-exporter— same forprocess_exporter.build-edge-all— cross-compileongrid-edgeforlinux/amd64,linux/arm64,darwin/amd64,darwin/arm64.docker-build— build theongrid:<version>image.docker-build-broker— buildsingchia/frontier:<ver>from$FRONTIER_SRC(or skip if already in the local image store).docker-build-web— buildongrid-web:<version>(frontend SPA + nginx).build-edge-bundle— assembleedge-bundle-linux-amd64-<ver>.tar.gzfrom the loose binaries.dist/package.sh— stages everything intodist/stage/ongrid-<version>-linux-amd64/,docker saves the three images,tar.xzs the whole thing, computes the sha256 sidecar, drops it underdist/out/.
The output:
dist/out/ongrid-v0.7.160-linux-amd64.tar.xz
dist/out/ongrid-v0.7.160-linux-amd64.tar.xz.sha256That's the artefact you scp to the prod host and feed to ./upgrade.sh.
Offline RAG model
The BGE embedding model isn't a package dep — it's slow to fetch over CN networks, so it stays a one-off step. Run make fetch-embedding-model once before make package if you want ONGRID_EMBEDDING_PROVIDER=local to work out of the box on a restore.
Edge upgrade: ADR-024 stage-then-swap
The user path for upgrading an edge is "click Upgrade in the UI" or systemctl restart ongrid-edge after a bundle's been staged. What actually happens, top to bottom:
1. Manager triggers, edge fetches
On "Upgrade all edges" (or a single edge):
- Manager reads
dist/out/edge-bundle-<arch>-<ver>.tar.gz.sha256from/opt/ongrid/edge/(placed there byinstall.sh/upgrade.sh). - Manager sends
MethodFetchPackage(url=https://<manager>/edge/<bundle>, sha256=<sha>, version=<ver>)over the tunnel to the target edges. - nginx serves the bundle bytes from the same
/edge/dir.
2. Edge stages
The receiving edge:
- Downloads the bundle to
/tmp/. - Verifies the sha256 against the manifest the manager sent.
- Untars into
/var/lib/ongrid-edge/.upgrade/incoming/. - Validates each
<sha> <mode> <src> <dest>line ofMANIFEST.txt(sha matches, src exists). - Writes a "ready" marker.
- Exits clean. systemd restarts the unit per
Restart=always.
3. systemd's ExecStartPre runs the swap
The unit ships with:
ExecStartPre=-+/usr/local/lib/ongrid-edge/apply-pending-upgrade.sh
ExecStart=/usr/local/bin/ongrid-edgeThe + prefix runs the hook as root despite User=ongrid-edge; the - lets a missing/failing script exit 0 without blocking the unit so the pre-upgrade binary always starts.
apply-pending-upgrade.sh does, in order:
- Mode 1: auto-rollback if a prior upgrade ran but never wrote
healthy_markermatchinglast_upgrade_ver. Restores every<dest>.previousover<dest>. Clears the staging dir. - Mode 2: bundle apply — for each
MANIFEST.txtline:- re-verify the sha256,
cp -p $dest $dest.previous(snapshot for rollback),cp $src $dest.newon the same filesystem so the final rename is POSIX-atomic,chmod $mode $dest.new,mv -f $dest.new $dest.
- Mode 3: legacy single-file apply for back-compat with edges that haven't been bundle-upgraded yet (just the agent binary, no manifest).
- Records
last_upgrade_atandlast_upgrade_verfor the next-boot health check.
Then ExecStart=/usr/local/bin/ongrid-edge runs the freshly-swapped binary.
4. The new agent reports healthy (or doesn't)
The new agent, on successful connect:
- Writes
/var/lib/ongrid-edge/.upgrade/healthy_markerwith the version it's running. - Reports
registerover the tunnel with its new version.
If that file exists and matches last_upgrade_ver by the next boot, apply-pending-upgrade.sh declares success, prunes every .previous to free disk, clears the staging dir.
If it doesn't (agent crashed, can't reach manager, anything else), the next boot triggers auto-rollback: every .previous is restored, the staging dir is wiped, the edge runs the previous working version again. The operator sees this in the UI (the edge keeps reporting an "old" version after a triggered upgrade — a clue that something's wrong with the new bundle).
Bundle invariants
Every file the edge swaps is in the bundle; every file in the bundle is a self-contained binary or script. This is non-negotiable:
- The agent binary (
ongrid-edge) - The plugin binaries (
promtail,otelcol-contrib,node_exporter,process_exporter) - The hook script (
apply-pending-upgrade.sh)
Plugin configs (promtail.yaml, otelcol.yaml) are not in the bundle — they live in /etc/ongrid-edge/ and are delivered over the tunnel as live config. That way an agent upgrade doesn't clobber a config the operator hand-edited.
Don't bake the bundle back into the docker image
If you're building from source, the bundle is built by dist/build-edge-bundle.sh on the host at package time, not in the container. ADR-032 enforces this — re-baking it into the image double-packs ~120 MB of incompressible bytes and breaks the sha chain.
Where rollbacks live
/usr/local/bin/ongrid-edge # current
/usr/local/bin/ongrid-edge.previous # last known good
/usr/local/lib/ongrid-edge/promtail
/usr/local/lib/ongrid-edge/promtail.previous
...etc.
/var/lib/ongrid-edge/.upgrade/
last_upgrade_at # ISO timestamp
last_upgrade_ver # version string
healthy_marker # written by the new agentA boot sequence after a triggered upgrade looks like:
boot 1 (pre-upgrade):
apply-pending-upgrade.sh sees nothing staged; exit 0
ongrid-edge v0.7.159 runs
(manager pushes bundle, edge stages it, edge exits clean)
boot 2 (apply):
apply-pending-upgrade.sh
mode 1: no healthy_marker yet, but no last_upgrade_at either → skip
mode 2: applies bundle. /usr/local/bin/ongrid-edge.previous = v0.7.159
/usr/local/bin/ongrid-edge = v0.7.160
writes last_upgrade_at, last_upgrade_ver=v0.7.160
clears healthy_marker
ongrid-edge v0.7.160 runs
v0.7.160 connects, writes healthy_marker=v0.7.160
(reboot, e.g. host updates)
boot 3 (settle):
apply-pending-upgrade.sh
mode 1: last_upgrade_ver=v0.7.160, healthy_marker=v0.7.160 → success
prunes all .previous files
clears last_upgrade_at, last_upgrade_ver
ongrid-edge v0.7.160 runsA failure scenario:
boot 2 (apply):
apply-pending-upgrade.sh: applies bundle as above
ongrid-edge v0.7.160 crashes immediately (e.g. config drift)
systemd restarts repeatedly per Restart=always
boot 3 (rollback):
apply-pending-upgrade.sh
mode 1: last_upgrade_ver=v0.7.160, healthy_marker missing
→ roll back: restore .previous over each dest
(ongrid-edge.previous → ongrid-edge, etc.)
: > last_upgrade_at, rm -rf incoming/
ongrid-edge v0.7.159 runs again (working)Downgrading
Downgrading the manager: extract the older tarball, run ./upgrade.sh inside it. As long as no DB schema migration has happened in the intervening releases, this works. Check the release notes for any "breaking schema change" callouts before downgrading.
Downgrading edges: trigger an "upgrade" from the UI to the older bundle version. Manager logic doesn't care which direction the version is moving.
What's next
- Reference / env — env var names whose defaults change between releases.
- Air-gapped install — how to mirror artefacts to a private webserver so
apply-pending-upgrade.shcan still pull bundles when the manager itself isn't on the internet.