Skip to content

Budget & limits

Ongrid enforces a global per-UTC-day token cap across every provider. The default is unlimited; one env var turns it on:

bash
ONGRID_LLM_DAILY_TOKEN_LIMIT=2000000   # 2 million tokens per UTC day

<=0 disables the cap. Single value, not per-provider — this is the MVP scope. When tenants land it moves to per-org settings; this knob stays as a safety-net global cap.

How it's wired

Three pieces, in internal/pkg/llm/:

go
// 1. The interface
type BudgetChecker interface {
    Check(ctx context.Context, userID uint64, estPromptTokens int) error
    Record(ctx context.Context, userID uint64, usage Usage) error
}

// 2. The MVP implementation
budget := llm.NewInMemoryBudget(cfg.LLM.DailyTokenLimit)

// 3. The eino callback that bridges to the graph kernel
handler := llm.NewBudgetCallbackHandler(budget, userID)

The graph-kernel runtime installs the callback handler in its eino callbacks chain. On every ChatModel OnStart:

  1. Estimate prompt tokens: len(text)/4 (conservative).
  2. BudgetChecker.Check(ctx, userID, estPromptTokens).
  3. On rejection — store ErrBudgetExceeded in the context so the downstream node can short-circuit; subsequent code surfaces it.

On OnEnd, the actual Usage.TotalTokens is recorded against the current UTC-day bucket.

ErrBudgetExceeded

go
// internal/pkg/llm/budget.go:37
func (b *InMemoryBudget) Check(ctx context.Context, userID uint64, estPromptTokens int) error {
    if b.dailyLimit <= 0 {
        return nil
    }
    b.mu.Lock()
    defer b.mu.Unlock()
    key := b.dayKey()
    if b.used[key]+estPromptTokens > b.dailyLimit {
        return ErrBudgetExceeded
    }
    return nil
}

The error propagates to:

  • The chat send endpoint — returns HTTP 429 with a { "error": "budget_exceeded", "message": "..." } body the chat UI renders in-line.
  • The RCA investigator worker — the report row lands as status=failed with status_reason="budget_exceeded".
  • The translate path — falls back to "translation unavailable (budget exceeded)" and the original text is shown.

InMemoryBudget caveats

The MVP implementation is in-memory:

go
type InMemoryBudget struct {
    mu         sync.Mutex
    dailyLimit int            // tokens per UTC day; <=0 means unlimited
    used       map[string]int // key = "YYYY-MM-DD" (UTC)
    now        func() time.Time
}

Consequences:

  • No persistence — a manager restart resets the day's counter. If you actually want a hard daily cap that survives restarts, swap the implementation. The BudgetChecker interface is the seam.
  • Single-process — if you run multiple managers behind a load balancer (you shouldn't yet, but if), each has its own counter.
  • Global, not per-useruserID flows through the interface so a future MySQL usage_daily table is a drop-in, but today the cap is the same number for everyone.

The pivot to single-tenant deferred the per-user backend; the interface is forward-compatible for when multi-user comes back.

Token estimation

BudgetCallbackHandler.OnStart estimates prompt tokens by character count / 4. This is intentionally conservative — real tokenisation varies by provider / model, and the budget is supposed to err on the side of refusing borderline calls rather than going over.

On OnEnd, the actual Usage.TotalTokens returned by the provider is recorded — so the budget tracks ground truth even when the estimate was off.

If the provider doesn't return token counts (some custom endpoints don't), the callback falls back to a response-meta heuristic; see OnEndUsesResponseMetaFallback in the tests.

Observing the budget

bash
curl -s localhost:9100/metrics | grep llm_budget
# llm_budget_daily_limit_tokens 2000000
# llm_budget_used_tokens_today 412847
# llm_budget_rejections_total 3

The metrics are wired by BudgetCallbackHandler.Stats(). The self-obs Prom dashboard renders them as a daily-spend graph plus an alert at 80% of the cap.

Disabling for one workload

There's no "disable budget for the investigator" knob. If RCA is hitting the cap and you'd rather it kept running than chat, raise the cap — that's what it's there for. The alternative (per-workload quotas) is parked along with multi-tenancy.

See also