Concepts
This page introduces the nouns Ongrid uses, in dependency order. If you've done the Quickstart the concrete shape will already be familiar; here we put names on it.
Edge
An edge is one running ongrid-edge process — one per host. It is the unit of enrollment, telemetry, and remote control.
Each edge has:
- a server-generated access key (public-ish, identifies the edge) and secret key (shown once, used to authenticate the tunnel),
- a long-lived outbound TCP connection to the frontier broker on port
40012, - a fleet of supervised plugin subprocesses (
promtail,node_exporter,process_exporter,otelcol-contrib), - a heartbeat that ticks every few seconds — when it stops, the edge is marked offline after
ONGRID_ALERT_EDGE_OFFLINE_THRESHOLD(default 90s).
Edges are addressable by integer edge_id (used in URLs and in agent tool calls) or by name (used in the UI and by find_topology_node). You can list them via list_edges, run a command on a specific one via bash with edge_id=N, or scope a PromQL query to one via query_promql with a host label.
Importantly: an edge is read-only by default. The agent's host-side skills are inspection (bash, host_probe_*, query_processes, query_logs_tail, host_read_file) — they don't mutate. Write actions go through the WebShell flow with audit logging.
See edge install and platforms / linux-edge for the concrete deploy paths.
Device
A device is a softer concept than an edge. Where an edge maps 1:1 to an ongrid-edge process, a device is anything Ongrid knows about that participates in a topology — including services, virtual hosts, k8s pods, and external endpoints discovered through SD or imported manually.
For now the UI surfaces devices implicitly through the Topology view. A device is a node in the topology graph; edges, services, and discovered external systems are all nodes. The expand_topology and find_topology_node skills operate over devices — blast radius reasoning depends on this graph.
TIP
If you only have edges so far, your topology graph is a single layer of host nodes. As skills like discover_services and integrations with k8s SD populate it, you'll see service nodes appear above the hosts and external endpoints to the side.
Alert rule
An alert rule is a declarative spec that fires an alert event when a condition holds on a stream of telemetry.
Ongrid's rule model is two-dimensional: a rule has a kind (what data type it evaluates) and a scope (where it evaluates).
The 14 built-in kinds, grouped by data plane:
| Plane | Kinds |
|---|---|
| Host metrics | host_cpu, host_mem, host_disk, host_load, edge_offline, prom_ingest_fail |
| Metric/PromQL | promql_threshold, promql_burn_rate, promql_absence |
| Logs | log_match, log_volume |
| Traces | trace_latency, trace_error_rate |
| Composite | composite_and (any of the above, AND-ed) |
Rules carry:
- a condition in the kind's native query language,
- a threshold + for duration (how long the condition must hold before firing),
- a severity (
critical/warning/info), - a routing — which channels receive notifications, with optional cooldown.
The 6 built-in host rules (host_cpu, host_mem, host_disk, host_load, edge_offline, prom_ingest_fail) are seeded with sensible defaults from ONGRID_ALERT_* env vars and you can flip them on/off from the UI. Custom rules live in MySQL and can be edited freely.
See alerts capability; the rule schema is documented under Alert rule schema in the Reference sidebar.
Incident
An incident is a higher-order grouping of alert events that belong to the same operational story. Where an alert fires every time the condition holds, an incident is created once and updated as related alerts arrive.
Incidents are the unit of:
- investigation — one incident, one or more investigations.
- paging — channel routing happens at incident-creation time, not per-alert.
- timeline — the incident detail page is the chronological record of every alert event, agent action, and operator note tied to the same story.
- closure — incidents move through
open → investigating → mitigated → resolved(orfalse_positive). State changes are audit-logged.
The grouping logic is intentionally simple right now: by {rule_id, edge_id} tuple with a 1-hour activity window. The roadmap is to make this configurable (group_by) per rule.
Investigation
An investigation is a structured agent run attached to an incident. The output is a Markdown investigation report — a ranked list of likely root causes plus evidence.
A typical investigation walks five sub-agents:
- incident-investigator (coordinator) — decomposes the incident into hypotheses.
- sre specialist — checks SLO burn, recent deploys, alert correlation.
- compute / disk / network specialists — host-level probing on the affected edges.
- ops specialist — knowledge base lookup, runbook matching.
- reviewer (critic loop) — re-reads the draft report and asks for missing evidence before signoff.
Each step is recorded in the reasoning timeline in the UI — every tool call, every model token spent, every sub-agent dispatch. Auditability is the point.
You can also launch an investigation manually from any incident or from the chat ("investigate the order-service drop at 14:02"). The output goes to the same place: an investigation report tied to an incident.
See RCA capability; the persona that runs investigations is documented under Incident investigator in the Agents sidebar.
Channel
A channel is a configured outbound destination for notifications. "Channel" covers two slightly different shapes:
- Webhook channels — Slack incoming webhooks, Larksuite (Feishu) webhooks, DingTalk webhooks, WeCom group robots, generic webhook. These are one-way — Ongrid posts a card and that's it.
- IM channels — Telegram bot, Larksuite app, DingTalk app, WeCom app, Slack app. These are two-way — the same channel both delivers alerts AND lets the user reply, ask the agent a question, trigger an investigation.
Each channel has:
- a name (free-form), type (slack / feishu / dingtalk / wecom / webhook / telegram), and endpoint material (URL + secret, or app token + chat ID, or bot token + allow-from list…).
- a scope — which org / which roles can see notifications routed here. Plus an optional allow_from sender allowlist (Telegram specifically) so random people can't talk to your bot.
- a default locale — what language the agent replies in on this channel, independent of the UI locale.
Channels are first-class: alert rules reference them by name, the agent can send proactive messages to them, and you can wire one channel to multiple rules without re-pasting webhook URLs.
See channels overview.
IM channel (two-way)
A subtype of "channel" worth calling out separately. An IM channel turns a chat surface (Telegram, Larksuite, DingTalk, WeCom, Slack) into a full agent interface.
What two-way means concretely:
- The agent can post — alerts, investigation reports, scheduled digests.
- The user can reply — questions ("why was the disk full?"), commands ("/list edges"), follow-ups ("investigate that").
- Every message is bound to the same
user_agentandorgas the Web UI session — same RBAC, same audit log, same skill registry. - A
sender_allowlist(Telegram) or app-permission gate (the others) decides who's allowed to talk to the bot in a group.
Per-channel locale matters here. The user-facing UI locale might be English; a Telegram group might still want replies in Chinese; you set that per-channel.
See channels / telegram for the most flexible example.
Persona / agent
A persona is a configurable agent identity — a YAML/JSON declaration of:
- which model the agent prefers (with site default as fallback),
- which skills are allowed (skills carry a
scope:host,manager,org, and aclass:safe,payload_read,payload_write— see RBAC), - a system prompt in the agent's voice,
- optional sub-agent declarations (a coordinator persona can spawn specialist personas).
Ongrid ships several personas out of the box:
- coordinator — the default; decomposes user questions, routes to specialists or runs skills directly.
- incident-investigator — incident-mode coordinator; walks topology, correlates signals, drafts a report.
- sre, network, compute, disk, ops — specialists invoked by the coordinators.
- reviewer — critic loop for incident-investigator's draft.
You can author your own. The persona format is documented under Agent persona format in the Reference sidebar. Custom personas land in MySQL; they're picked up at the next agent run.
Skill
A skill is a callable tool the agent can invoke. Every skill declares:
- a name (e.g.
query_promql,bash,expand_topology), - a JSON schema for arguments and result,
- a scope —
host(runs on a specific edge over the tunnel),manager(runs in the manager process), ororg(cross-edge manager skill), - a class —
safe,payload_read,payload_write. Read-only classes are unrestricted; write classes go through approval. - an optional activation keyword — if set, the skill stays out of the prompt unless the user's query mentions a keyword. This is the toolbag deferral mechanism that keeps the skill registry from blowing past the model's context window.
Skills are stored in a registry. The agent kernel (graph kernel by default) resolves which skills are visible for the current turn based on scope, RBAC, and activation keywords. Approximately 30 skills ship in the box (bash, host_probe_*, query_promql, query_logs, query_traceql, expand_topology, find_topology_node, read_repo, search_knowledge, web_search, …).
See skills capability; the manifest format is documented under Skill manifest in the Reference sidebar. Custom skills can be loaded from ONGRID_SKILLS_EXTERNAL_DIRS.
Knowledge base
A knowledge base is a collection of organisation documents that the agent can search through search_knowledge. Documents are embedded once and stored in Qdrant; retrieval is hybrid (vector + BM25).
Sources are layered:
- vault — the built-in, read-only knowledge bundled by Ongrid. Roughly 100 Markdown playbooks covering network diagnostics, Linux internals, Prometheus / Loki / Tempo recipes, common incident patterns. The vault syncs from the public ongridio/vault repo on demand.
- upload — Markdown, TXT, PDF, DOCX files uploaded by an org admin. Goes through
docextractand lands in the upload tree. - manual — entries authored directly in the UI editor.
- repo — entire Git repositories synced with read-only SSH keys (ADR-023). Useful for ingesting your own runbook repo.
The agent calls search_knowledge(query, k=N) and gets back ranked chunks with metadata. Embeddings are computed by an offline ONNX-bundled BGE model by default (fast-bge-small-zh-v1.5); you can swap in a hosted embedder (OpenAI, Zhipu, GLM…) via Settings.
See knowledge base capability.
Putting it together
A typical operational loop:
edge ─▶ telemetry ─▶ alert rule ─▶ alert event ─▶ incident
│
┌────────┴────────┐
▼ ▼
investigation channel
(agent + skills) (Slack / TG /
│ IM …)
▼
investigation
report- An edge ships telemetry.
- An alert rule evaluates it and produces an alert event.
- The event is grouped into an incident (or attaches to an open one).
- The incident routes to channels for notification.
- An investigation is auto-started (or operator-triggered); the agent uses skills and the knowledge base to produce a report.
- A persona decides the voice + tool palette the investigation runs with.
The whole pipeline is observable: every step shows up in the UI timeline, every skill call is in the audit log, every token spend is in the LLM budget panel.