RFC 0005: Hot reload — snapshot semantics, fail-forward

Status: Accepted, implemented. Author: Andrii Tsok Depends on: RFC 0001 §4.3, RFC 0004.

1. Problem

Long-running serve-mode processes accumulate operational pressure: TLS certificates expire, webhook secrets and JWKS rotate, policies tighten, MCP children wedge. Restarting drops in-flight requests and resets rate-limit state; not restarting means running with stale credentials. We need in-place reload for the rotating parts without turning the runtime into a dynamic-everything system.

2. Decision

Trigger. SIGHUP on Unix; an mtime bump on --reload-file everywhere (Windows has no console signal — a touched file is the portable channel, and it composes with k8s ConfigMap projections).

Scope — what reloads: TLS cert/key (+ client-auth CA), prepared auth (bearer/HMAC bindings, OIDC JWKS re-parse), the policy manifest (including Rego recompile), per-server MCP allowlists and child respawn, the intel client (bearer re-read), and the HTTP route + rate-limit-bucket table.

Scope — what deliberately does not: the workflow graph itself, the handler registry, bind address, feature set. Structural change is a deploy, with validation and signing in the path. Reload rotates parameters of the declared structure, never the structure.

Consistency: per-accept snapshots. Each reloadable component lives behind an atomic-swap cell (ArcSwap). A connection captures its snapshot at accept; a reload mid-request completes against the old state, new connections see the new state. No locks on the hot path, no torn reads (routes and their buckets swap as one unit — observing new routes with old buckets would be a correctness bug, so they share a cell).

Fail-forward. Every stage of a reload validates before it swaps. A bad Rego program, an unreadable cert, a JWKS that doesn't parse — each logs reload.failed with its stage and keeps the old state live. SIGHUP is "try this", not "commit this". The single per-server exception: an MCP child that respawns fine gets its new allowlist even if a sibling fails — servers are independent trust domains by RFC 0004, and holding one server's rotation hostage to another's spawn failure helps nobody.

Rate-limit counters reset on swap, by design: an operator tightening limits mid-flood should not find the flood grandfathered in.

3. Alternatives considered

4. Consequences

Reload state plumbing (ReloadHandles on the engine, swap cells in the HTTP server) is the price; it bought us credential rotation with zero dropped requests, and the audit stream narrates every reload stage (reload.startedreload.succeeded / reload.failed).