Topology

Topology is the graph that turns "an incident on host edge-prod-04" into "an incident that takes out payments and search and email." Without it the LLM can only reason about the single firing series.

ADR-025

The current topology BaseTools (expand_topology, find_topology_node) landed 2026-05-18. Before that, the LLM had a flat device list and no way to walk relationships; "what depends on X" required the operator to know the answer already.

The data model

Four tables, all in MySQL:

Table	Holds
`topology_nodes`	One row per node. Has a `type` (`device` / `service` / `cluster` / ...), a `name`, and a free-form `props_json`.
`topology_node_types`	The closed set of allowed `node.type` values.
`topology_relations`	Directed edge `src_node_id -> dst_node_id` with a `type`.
`topology_relation_types`	Closed set of edge types, each tagged with a `semantics` field (`hard_dep` / `runtime_dep` / `traffic` / `annotation` / `observation`).

The semantics tag is the key idea. An edge tagged hard_dep propagates failure (if src dies, dst is affected); an edge tagged annotation does not. The blast-radius walk uses this to filter.

See internal/manager/biz/topology/.

Where the data comes from

Three sources, layered:

Auto from spans — Tempo's service_graph processor emits service_a -> service_b edges with a routes_to semantics tag. The manager mirrors these into topology_relations on a sync tick.
Auto from edges — every registered edge becomes a type=device node; the node id is back-linked to the host_devices.node_id column so expand_topology(device_id=X) resolves through.
Manual — operators add type=service / type=cluster nodes and edges through the SPA's /topology page. Used for nodes you want to address but that aren't directly observed (a managed database, a third-party API).

Tools

`expand_topology`

Walk outward from a node, return every reachable node plus how it was reached. Default BFS depth 2, cap 5. Default direction both (blast radius is symmetric — what could break this, AND what does this break).

json

{
  "node_id": 142,
  "depth": 2,
  "only_propagating": true,
  "direction": "downstream"
}

Or, when you start from a device id:

json

{ "device_id": 17, "depth": 3 }

The tool resolves device_id → device.node_id automatically. only_propagating=true (default) walks only hard_dep / runtime_dep / traffic edges; flip to false to include observation / annotation edges (useful for the "show me everything related to X" cases).

Returned hits carry the path metadata the LLM needs to reason about impact:

json

{
  "center":  { "node_id": 142, "node_name": "payments-api", "node_type": "service", "hops": 0, "propagates_failure": false },
  "max_hops": 2,
  "reachable_count": 7,
  "reachable": [
    { "node_id": 71, "node_name": "edge-prod-04", "node_type": "device", "hops": 1,
      "relation_type": "deployed_on", "semantics_tag": "runtime_dep", "reached_via": "downstream",
      "propagates_failure": true,
      "via_node_id": 142, "via_node_name": "payments-api" }
  ]
}

The flat list (no nested per-neighbor struct) is intentional — keeps the JSON cheap to embed in the prompt. See expand_topology_basetool.go.

`find_topology_node`

The "I have a human-given name, get me a node_id" pre-step. The persona runs this before expand_topology whenever the prompt mentions a service / host by name:

text

User: "what does loki-write depend on?"
Agent:
  → find_topology_node{ name: "loki-write" }
    ← { node_id: 219, node_type: "service", name: "loki-write" }
  → expand_topology{ node_id: 219, direction: "upstream" }
    ← { reachable_count: 4, ... }

Both are registered as ScopeManager BaseTools (no edge_id argument) — the topology DB lives manager-side.

Blast-radius walk in practice

The investigator persona's prompt includes an explicit "after you identify the firing service, call expand_topology to see what else is affected." This is how the report's related_alerts and the "业务影响 / Business impact" section get populated — the agent walks the graph from the firing device upward / downward and cross-checks which other incidents have fired in the same window on those nodes.

The relevant code path:

correlate_incident returns metric + log + trace summaries plus the incident's device_id.
expand_topology { device_id, direction: both, depth: 2 } returns the reachable set.
Pass-2 extraction reads the worker's narrative and pulls pinpointed_target (the zero patient) + related_alerts (the cascade).

Topology ​

The data model ​

Where the data comes from ​

Tools ​

expand_topology ​

find_topology_node ​

Blast-radius walk in practice ​

See also ​

Topology

The data model

Where the data comes from

Tools

`expand_topology`

`find_topology_node`

Blast-radius walk in practice

See also