Introduction
Ongrid is an open-source, self-hostable AI agent for operations. Put a lightweight ongrid-edge agent on every host; the cloud reasons over your metrics, logs, traces, topology, and source code to pinpoint root cause — in plain language.
It is built for SRE, DevOps, and platform teams who already have signals (Prometheus, Loki, Tempo, journald, k8s) but spend their day stitching them together by hand.
What it solves
- High troubleshooting bar. Describe the symptom ("why is load spiking?", "who's dropping packets?"). The agent figures out which metric to look at, which logs to grep, which trace to walk, and runs the query for you.
- Alerts disconnected from root cause. On an alert the agent walks the topology for blast radius, correlates logs and traces, and pins down the source-code location behind the "why" — not just the symptom.
- Scattered signals. Metrics (Prometheus), logs (Loki), traces (Tempo), a vector knowledge base, and your source repos are unified and analyzed in a single session — no copy-pasting between five tabs.
- No exposed intranet. Every edge dials outbound on one tunnel; zero inbound ports on the host. The telemetry data plane is intentionally separated from the control plane (see architecture).
- Self-hostable. One
docker composebrings up the full stack; point the model at any OpenAI-compatible endpoint. Air-gapped install bundle available — see air-gapped install.
Who it's for
| If you are… | Ongrid gives you… |
|---|---|
| SRE on call | "Why did order-service start dropping at 14:02?" answered with the PromQL run, the LogQL run, the trace span, and the file:line in the repo that caused it. |
| Platform engineer | A single agent surface across host + k8s + your own services, with skills you can extend. Read-only by default; signed actions opt-in. |
| DevOps lead | Two-way conversations on Slack / Telegram / Larksuite / DingTalk / WeCom. Same agent reasoning on every channel. |
| Security-conscious operator | Edge → frontier → manager over an outbound geminio tunnel. Telemetry data plane carries Loki / OTLP push separately. Audit log on every tool call. |
| Self-hosting / privacy team | All state on your own filesystem. Bring your own model (OpenAI, Anthropic, GLM, DeepSeek, Gemini, Kimi, vLLM, OpenRouter…). Air-gapped supported. |
How it differs from…
…a chat dashboard
A chat dashboard wraps an LLM around a search box. Ongrid is a graph-kernel ReAct agent: the coordinator decomposes your question, calls 30+ host / observability / knowledge skills, spawns specialist sub-agents (incident-investigator, sre, network, compute, disk, ops), and returns a structured report — not just a transcript.
Ask: "Why did the order service start dropping requests at 14:02?"
Agent:
1. expand_topology(order-service) → 3 upstream, 5 downstream services
2. query_promql(rate(http_500[2m])) by service → spike in payments
3. search_logs(payments, 14:00..14:05) → "circuit breaker open"
4. query_traceql(payments.error_rate) → 412 errors from cardholder-api
5. read_repo(payments/circuit_breaker.go) → 5xx threshold = 3 in 30s
6. Conclusion: payments tripped circuit on cardholder-api 5xx burst.
Source: payments/circuit_breaker.go:42. Fix: bump threshold or
fix cardholder-api retry budget.…a notebook agent
Notebook agents reason inside a sandbox. Ongrid reasons inside your infrastructure. The skills are real tools — bash, host_probe_*, query_promql, expand_topology, search_logs, read_repo — bound to specific hosts via the edge tunnel. Every tool call is audit-logged and (for write actions) gated behind an approval workflow.
…a hosted SaaS
Ongrid is single-binary + docker-compose. You run it on one VPS, in your VPC, or fully air-gapped. There's no usage telemetry to a vendor; no per-host pricing; no data egress. License is Apache 2.0.
What's inside
- Cloud manager — Go service. MySQL persistence. Geminio service-end SDK to the frontier broker. Graph-kernel ReAct runtime (
ONGRID_AGENT_KERNEL=graph). Roughly 10 bounded-context handlers. - Edge agent (
ongrid-edge) — single static Go binary plus sub-plugins (promtailfor logs,otelcol-contribfor traces,node_exporter+process_exporterfor metrics). All outbound. - Web — React + Vite + TanStack Query SPA. Per-org / per-role gating. Built-in Grafana embed for Monitor panels.
- Observability — Prometheus, Loki, Tempo, Grafana, Qdrant ship in the compose. Swap any of them for managed services from Settings.
The data plane vs. control plane
This is the architectural commitment that makes the security story work, so we say it twice:
- Control plane = the tunnel from edge to manager. One outbound TCP connection per host to
frontier:40012. Multiplexed request/response overgeminio. No inbound port on the host. - Data plane = log + trace ingestion. Loki push (
/loki/api/v1/push) and OTLP push (/v1/traces) go throughnginxon the manager's public URL. Each request is auth-gated bynginx auth_request → manager edgeauthso unenrolled hosts can't push.
Metrics currently still ride the tunnel as a push_host_metrics RPC; the migration to direct remote_write is on the roadmap. See the Telemetry data plane entry under Reference in the sidebar.
What this site covers
- Quickstart — 10-min install on a single Linux box; sign in; register your first edge; see metrics.
- Architecture — the 4-layer model, edge → frontier → manager flow, container map.
- Concepts — edge, device, alert rule, incident, investigation, channel, persona, skill, knowledge.
- Install — full install paths: docker compose, edge curl-pipe, first-boot checklist, upgrade, air-gapped.
- Channels — Slack, Telegram, Larksuite, DingTalk, WeCom, raw webhook.
- Capabilities — what skills the agent has out of the box (alerts, RCA, monitoring, logs, traces, topology, knowledge, WebShell).
- Models — provider matrix; routing rules; budget caps (see the Models section in the sidebar).
- Reference — every
ONGRID_*env var; REST endpoints; CLI; alert rule schema; skill manifest format.
License & source
- Source: github.com/ongridio/ongrid
- License: Apache 2.0
- Latest release: GitHub Releases
- Issues / PRs welcome.
Next step
Run through the Quickstart — it takes about 10 minutes on a fresh Linux box and gets you to a real edge checking in.