Introduction

Ongrid is an open-source, self-hostable AI agent for operations. Put a lightweight ongrid-edge agent on every host; the cloud reasons over your metrics, logs, traces, topology, and source code to pinpoint root cause — in plain language.

It is built for SRE, DevOps, and platform teams who already have signals (Prometheus, Loki, Tempo, journald, k8s) but spend their day stitching them together by hand.

What it solves

High troubleshooting bar. Describe the symptom ("why is load spiking?", "who's dropping packets?"). The agent figures out which metric to look at, which logs to grep, which trace to walk, and runs the query for you.
Alerts disconnected from root cause. On an alert the agent walks the topology for blast radius, correlates logs and traces, and pins down the source-code location behind the "why" — not just the symptom.
Scattered signals. Metrics (Prometheus), logs (Loki), traces (Tempo), a vector knowledge base, and your source repos are unified and analyzed in a single session — no copy-pasting between five tabs.
No exposed intranet. Every edge dials outbound on one tunnel; zero inbound ports on the host. The telemetry data plane is intentionally separated from the control plane (see architecture).
Self-hostable. One docker compose brings up the full stack; point the model at any OpenAI-compatible endpoint. Air-gapped install bundle available — see air-gapped install.

Who it's for

If you are…	Ongrid gives you…
SRE on call	"Why did `order-service` start dropping at 14:02?" answered with the PromQL run, the LogQL run, the trace span, and the file:line in the repo that caused it.
Platform engineer	A single agent surface across host + k8s + your own services, with skills you can extend. Read-only by default; signed actions opt-in.
DevOps lead	Two-way conversations on Slack / Telegram / Larksuite / DingTalk / WeCom. Same agent reasoning on every channel.
Security-conscious operator	Edge → frontier → manager over an outbound geminio tunnel. Telemetry data plane carries Loki / OTLP push separately. Audit log on every tool call.
Self-hosting / privacy team	All state on your own filesystem. Bring your own model (OpenAI, Anthropic, GLM, DeepSeek, Gemini, Kimi, vLLM, OpenRouter…). Air-gapped supported.

How it differs from…

…a chat dashboard

A chat dashboard wraps an LLM around a search box. Ongrid is a graph-kernel ReAct agent: the coordinator decomposes your question, calls 30+ host / observability / knowledge skills, spawns specialist sub-agents (incident-investigator, sre, network, compute, disk, ops), and returns a structured report — not just a transcript.

text

Ask:  "Why did the order service start dropping requests at 14:02?"

Agent:
  1. expand_topology(order-service) → 3 upstream, 5 downstream services
  2. query_promql(rate(http_500[2m])) by service → spike in payments
  3. search_logs(payments, 14:00..14:05) → "circuit breaker open"
  4. query_traceql(payments.error_rate) → 412 errors from cardholder-api
  5. read_repo(payments/circuit_breaker.go) → 5xx threshold = 3 in 30s
  6. Conclusion: payments tripped circuit on cardholder-api 5xx burst.
     Source: payments/circuit_breaker.go:42. Fix: bump threshold or
     fix cardholder-api retry budget.

…a notebook agent

Notebook agents reason inside a sandbox. Ongrid reasons inside your infrastructure. The skills are real tools — bash, host_probe_*, query_promql, expand_topology, search_logs, read_repo — bound to specific hosts via the edge tunnel. Every tool call is audit-logged and (for write actions) gated behind an approval workflow.

…a hosted SaaS

Ongrid is single-binary + docker-compose. You run it on one VPS, in your VPC, or fully air-gapped. There's no usage telemetry to a vendor; no per-host pricing; no data egress. License is Apache 2.0.

What's inside

Cloud manager — Go service. MySQL persistence. Geminio service-end SDK to the frontier broker. Graph-kernel ReAct runtime (ONGRID_AGENT_KERNEL=graph). Roughly 10 bounded-context handlers.
Edge agent (ongrid-edge) — single static Go binary plus sub-plugins (promtail for logs, otelcol-contrib for traces, node_exporter + process_exporter for metrics). All outbound.
Web — React + Vite + TanStack Query SPA. Per-org / per-role gating. Built-in Grafana embed for Monitor panels.
Observability — Prometheus, Loki, Tempo, Grafana, Qdrant ship in the compose. Swap any of them for managed services from Settings.

The data plane vs. control plane

This is the architectural commitment that makes the security story work, so we say it twice:

Control plane = the tunnel from edge to manager. One outbound TCP connection per host to frontier:40012. Multiplexed request/response over geminio. No inbound port on the host.
Data plane = log + trace ingestion. Loki push (/loki/api/v1/push) and OTLP push (/v1/traces) go through nginx on the manager's public URL. Each request is auth-gated by nginx auth_request → manager edgeauth so unenrolled hosts can't push.

Metrics currently still ride the tunnel as a push_host_metrics RPC; the migration to direct remote_write is on the roadmap. See the Telemetry data plane entry under Reference in the sidebar.

What this site covers

Quickstart — 10-min install on a single Linux box; sign in; register your first edge; see metrics.
Architecture — the 4-layer model, edge → frontier → manager flow, container map.
Concepts — edge, device, alert rule, incident, investigation, channel, persona, skill, knowledge.
Install — full install paths: docker compose, edge curl-pipe, first-boot checklist, upgrade, air-gapped.
Channels — Slack, Telegram, Larksuite, DingTalk, WeCom, raw webhook.
Capabilities — what skills the agent has out of the box (alerts, RCA, monitoring, logs, traces, topology, knowledge, WebShell).
Models — provider matrix; routing rules; budget caps (see the Models section in the sidebar).
Reference — every ONGRID_* env var; REST endpoints; CLI; alert rule schema; skill manifest format.

License & source

Source: github.com/ongridio/ongrid
License: Apache 2.0
Latest release: GitHub Releases
Issues / PRs welcome.

Next step

Run through the Quickstart — it takes about 10 minutes on a fresh Linux box and gets you to a real edge checking in.

Introduction ​

What it solves ​

Who it's for ​

How it differs from… ​

…a chat dashboard ​

…a notebook agent ​

…a hosted SaaS ​

What's inside ​

The data plane vs. control plane ​

What this site covers ​

License & source ​