agentd — Operations Guide

Audience: operators rolling agentd into production or pre-production environments. Gives you the deployment shapes we support, how to build the right artifact for each, how auth / TLS / logging are wired, and what to expect at runtime (signals, exit codes, healthchecks, drain).

Status: authoritative as of current release. Matches the binary produced from crates/agentd/ at this commit. If something in the code disagrees with this doc, the code wins — please file a doc fix.

See also:


1. Deployment shapes at a glance

There are two axes: how the workflow gets into the binary and how the binary is invoked.

1.1 How the workflow reaches the runtime

ShapeArtifactHowBest for
ExternalGeneric agentd binary + a separate .tomlagentd --config /etc/agentd/wf.tomlDev, iterating on a workflow without rebuilds, multi-tenant shells where the same binary fronts several workflows over time.
EmbeddedPurpose-built binary (wf baked in)AGENTD_EMBED_CONFIG=/abs/path cargo build --release produces a binary that holds the TOML as include_str!Single-purpose appliances, containers where the config lives in the image, reproducible deploys where "the binary IS the workflow".

Embedded mode is validated twice: once at cargo build (fast structural check in build.rsname present, no duplicate / dangling node ids, edges and HTTP routes point at real nodes, start-node entry_nodes exist), and once at startup (the full workflow::validate — acyclicity, reachability, fan-in/fan-out, start-node source constraints, policy references). External mode skips the build-time pass and runs only the startup pass.

1.2 How the binary is invoked

The mode is inferred from the workflow unless you override it:

ModeInferred whenBehaviour
one-shotNo [[http_routes]] entriesPick a start node (auto or --start), read --input FILE as the trigger payload, run the DAG, emit the outcome JSON to stdout, exit with 0 (Success) or 5 (any non-success).
serveAt least one [[http_routes]]Bind --bind (default 127.0.0.1:8080), accept requests, dispatch matching routes to the engine, stay up until SIGTERM / SIGINT.

Override with --mode once|serve or AGENTD_MODE=once|serve.

No CLI subcommands exist — agent serve is not a command. The single entry point (runtime.rs) decides what to do based on flags + workflow shape. This was an explicit R1 pivot; the help text in agentd --help is the authoritative list of flags.


2. Build modes

The binary's capability surface is frozen at compile time by Cargo features. Four common recipes:

A. Default — tooling for a standalone deploy, HTTP auth on

cargo build --release -p agentd

Features: tools-fs + tools-env + tools-data + trigger-http + auth. No HTTP-outbound, no shell, no MCP, no TLS. Good for local and for behind-a-reverse-proxy deployments where TLS terminates at the proxy.

B. Hardened webhook receiver — TLS + HMAC + full tool-less posture

cargo build --release -p agentd \
  --no-default-features \
  --features "tools-fs,tools-data,trigger-http,auth,server-tls"

Drops tools-env (no env-reads) and tools-http / tools-shell. This is the canonical shape for an externally reachable agent that accepts signed webhooks and writes fixtures. Add server-tls for in-process mTLS (see §4.4).

C. Kitchen sink — everything wired

cargo build --release -p agentd --features "tools-http,tools-shell,tools-mcp,server-tls"

Pulls every tool family in. Use sparingly — every feature you don't need is code that can't be in the binary.

D. Embedded appliance — workflow baked in + minimal features

AGENTD_EMBED_CONFIG=/abs/path/to/wf.toml \
  cargo build --release -p agentd --no-default-features \
  --features "tools-fs,trigger-http,auth,server-tls"

The resulting binary runs the baked-in workflow if invoked with no --config. Operators can still point --config /alt.toml at an external file to override for debugging without rebuilding.

The build-time validator speaks the same error dictionary as the runtime validator — any failure there means cargo build exits with a cargo:warning=agent: … line and a panic message describing the first offending issue. Nothing else in build.rs runs on a dirty tree: the embedded path sets cargo:rerun-if-env-changed=AGENTD_EMBED_CONFIG and cargo:rerun-if-changed=<abs path>, so incremental builds only re-validate when the file or env actually changes.

2.1 Feature reference

FeaturePulls inEnables
tools-fsread_file, write_file, create_dir
tools-envread_env
tools-dataparse_json, template_render, json_select
tools-httphttp_request outbound
tools-shellshell_run (allowlisted commands only)
tools-mcpcall_mcp_tool, read_mcp_resource, trigger-mcp plumbing
trigger-httpHttpServer, routes, one-shot-vs-serve switch
trigger-mcpMCP triggers
intel-unixIntelligence JSON-RPC Unix client (LLM adapter)
intel-httpIntelligence JSON-RPC HTTP client
authsha2, hmacBearer + HMAC-SHA256 webhook verification
server-tlsrustls, rustls-pemfile (implies auth)In-process TLS termination + mTLS client-cert verification
schemajsonschemaValidate llm_infer output against a JSON Schema file (off → JSON-only check)
intel-remoteureqRemote LLM providers (Anthropic / OpenAI / Gemini / openai-compatible)

The default feature set is tools-fs + tools-env + tools-data + trigger-http + auth. Everything else is opt-in.


3. Runtime artifacts

3.1 Binary

One statically-linked bin: target/release/agent. Linux x86_64 and aarch64 both supported. No libc calls outside libc for sigaction and the standard Rust prelude, so it runs on distroless / scratch images provided glibc (or musl if you cross-compile) is present.

Release profile: opt-level="s", lto=true, strip=true, panic="abort". Expect ~6–8 MB on x86_64.

3.2 Filesystem footprint

No state files. The runtime is stateless — no cache, no spool, no DB. Restart is free. Whatever the workflow itself writes via write_file is governed by the manifest's policy.fs allowlist.

3.3 Processes

3.4 Network

3.5 Healthcheck + metrics

Two always-live endpoints in serve mode (no auth, not rate-limited):

3.6 Run records + inspection

Metrics tell you the aggregate; a run record tells you what one run actually did. In one-shot mode, --record PATH (or AGENTD_RECORD) writes a structured JSON account of the run — the per-node trace with each node's output and timing, the cost (llm calls / tokens / policy denials), the wall-clock, and the outcome (or the error that aborted it). It is written whether the run completed, failed, timed out, or errored.

agentd --config wf.toml --input event.json --record /tmp/run.json
agentd inspect /tmp/run.json

agentd inspect renders the record as a readable timeline:

run exec-00000001  workflow=demo  status=completed
  start=main  3 ms  1 llm call(s) / 128 tokens  0 policy denial(s)
  path:
     1. classify [llm_infer] continue →alpha  2 ms
        output: {"content":"…","parsed":{"decision":"alpha"}}
     2. done [terminate] terminate  0 ms

The record is plain JSON keyed for machine consumption. Its execution_id matches the execution_id field in the audit log, so a record and its audit events line up. A browser inspector at agentd.dev/inspect renders the same file visually (paste or upload — it runs entirely client-side, nothing is uploaded). Records may contain node outputs verbatim — treat a record file with the same care as the data it processed.

3.7 Human-in-the-loop + durable execution

A pause_for_approval node suspends a run for a person. When the engine reaches one it writes a checkpoint under --state-dir (or AGENTD_STATE_DIR) — the accumulated node outputs and where to resume — and stops with a paused outcome and exit code 7. A human reviews (the record / audit trail), then continues the run by id:

# Runs until the approval gate, then checkpoints and pauses.
agentd --config deploy.toml --state-dir /var/lib/agentd/state --input change.json
#   → {"status":"paused","run_id":"exec-…","last_node":"approve"};  exit 7

# Later, after review — continue from the node after the pause.
agentd --config deploy.toml --state-dir /var/lib/agentd/state --resume exec-…
#   → {"status":"completed", …};  the checkpoint is retired

The resume re-enters the same traversal at the pause node's successor with the checkpoint's node outputs restored; it gets a fresh deadline. Resuming the same workflow is enforced (a checkpoint records its workflow name). A pause_for_approval node without a --state-dir is a configuration error — there is nowhere to persist the run. Run ids are unique across processes, so concurrent paused runs never collide.

Crash-recovery (--checkpoint-each-node). The pauses above are declared. For unattended durability, --checkpoint-each-node (with --state-dir) writes a progress checkpoint after every node, recording the accumulated outputs and the next node to run. A clean terminal (completed / declared-failed) retires the checkpoint; a timeout, an errored abort, or a crash leaves it, so the run is recoverable:

agentd --config wf.toml --state-dir /var/lib/agentd/state --checkpoint-each-node --input ev.json
# … process is killed mid-run …
agentd --state-dir /var/lib/agentd/state --list-checkpoints
#   exec-…  recoverable  workflow=wf  resume_node=enrich
agentd --config wf.toml --state-dir /var/lib/agentd/state --resume-incomplete
#   resumes every recoverable run for this workflow from its last node

Recovery is at-least-once for the interrupted node: a crash during a side-effecting node (before its checkpoint) re-runs that node on resume, repeating its side effect. Design nodes whose retry is safe, or put an idempotency key in the work. Checkpointing after every node is an I/O cost paid only by runs that opt in.

Checkpoints contain node outputs verbatim: a state directory deserves the same protection as the data the workflow handles. Exit code 7 lets a supervisor (systemd, a queue worker) distinguish "awaiting approval" from success (0) or failure (5).


4. Security posture

4.0 Workflow signing (supply-chain)

With the signing feature compiled in, the runtime verifies a detached Ed25519 signature over the workflow TOML before the DAG validator runs. Fail-closed when [signing].required = true (or --signing-required / AGENTD_SIGNING_REQUIRED=1).

openssl genpkey -algorithm Ed25519 -out agent-signing.key
openssl pkey -in agent-signing.key -pubout -out agent-signing.pub
openssl pkeyutl -sign -inkey agent-signing.key \
    -rawin -in workflow.toml | base64 -w 0 > workflow.toml.sig

[signing] block:

[signing]
required = true
public_key_file = "/etc/agentd/signing.pub"
algorithm = "ed25519"                   # default; only value supported in v1

Audit events on agentd::audit: signing.verified, signing.sig_missing, signing.sig_malformed, signing.pubkey_malformed, signing.verification_failed, signing.bypassed (warn), signing.unsupported. Every event carries key_fingerprint = "<16-hex>" when the pubkey is loadable, so log readers can pin the key without seeing the PEM.

Embedded workflows: pair AGENTD_EMBED_CONFIG=/abs/workflow.toml with AGENTD_EMBED_CONFIG_SIG=/abs/workflow.toml.sig at build time; build.rs decodes the base64 once and bakes raw signature bytes into the binary.

Full design: rfcs/0002-signed-workflows.md.

4.05 Resource budgets

The [budget] block caps resources process-wide (agent is a micro-agent — one workflow per process, so per-workflow and per-process are the same unit). Applied at startup via POSIX setrlimit + a runtime counter.

[budget]
max_memory_mb     = 512     # RLIMIT_AS; SIGKILL on breach
max_cpu_secs      = 300     # RLIMIT_CPU; SIGXCPU then SIGKILL
max_run_time_secs = 60      # clamps --timeout-secs
max_fs_write_mb   = 200     # cumulative write_file bytes

Notes:

4.0b Secret injection model (env vars only)

The harness has one supported secret-injection mechanism: environment variables, read at request time. Every secret-carrying auth field has a *_env variant that names an env var:

SurfaceConfig fieldRe-read when
Static bearer tokens[auth.bearer.<name>].tokens_envEvery request
HMAC webhook secret[auth.hmac.<name>].secret_envEvery request
Intelligence HTTP bearer--intel-http-bearer-file / AGENTD_INTEL_HTTP_BEARERStartup + SIGHUP

Rotating an env-var secret is SIGHUP-free: the request-path code calls std::env::var(...) on every check, so replacing the env var takes effect for the next request. Fleet-rotating an env var is an orchestrator-native operation — systemd systemctl daemon-reload && systemctl restart if it lives in EnvironmentFile, k8s rolling-update if it's in a Secret.

Non-env secret surfaces (TLS cert/key, OIDC JWKS) read from files. These are SIGHUP-refreshed — replace the file on disk, send SIGHUP, the harness re-reads atomically.

No vendor SDKs inside the harness. Any KMS / HashiCorp Vault / AWS Secrets Manager / Azure Key Vault / GCP Secret Manager integration lives in the orchestrator, not in-process:

What's explicitly NOT supported: a [secrets] TOML block. If the operator adds one, TOML parse rejects it at startup with an "unknown field secrets" error pointing at this doc. That's deliberate — we don't want a situation where some workflows pull secrets via the harness and others via the orchestrator, because that creates two secret-rotation paths with different semantics.

4.1 Authentication

Three mechanisms, each opt-in per route:

Routes attach auth via auth = "bearer:prod" or auth = "hmac:webhooks" in the [[http_routes]] block. An unknown binding name fails the server at spawn time — not at first request. This is intentional: misconfigurations should take down the serve loop immediately, not silently accept unauth'd traffic.

After successful verification the engine receives the principal as trigger.principal = { kind: "bearer"|"hmac", name: "<binding>" }. Workflows can branch on this via json_select.

See configuration.md §Auth for full grammar. Verifier semantics are covered in capabilities.md §Auth.

4.2 TLS (single-direction, termination in-process)

Requires --features server-tls. Minimal, operator-driven:

[server.tls]
cert_file = "/etc/agentd/tls/server.pem"
key_file  = "/etc/agentd/tls/server.key"

Supported cert formats: any PEM that rustls-pemfile understands — PKCS1, PKCS8, SEC1 keys; RSA and ECDSA certs. Chain support is "whatever PEM you concatenate". The cert file must contain at least one -----BEGIN CERTIFICATE----- block.

Crypto provider: aws-lc-rs (installed once per process via OnceLock). No runtime cipher-suite selection — we take rustls 0.23's safe default (TLS 1.2 and 1.3, server cipher preference).

Rotation: there is no hot reload. Swap the PEM files on disk and restart the process. SIGTERM drains cleanly (see §5), so a rolling restart in k8s behind a Service with terminationGracePeriodSeconds set above your drain_timeout_secs finishes without dropping in-flight requests.

Failure modes:

4.3 mTLS (client-cert verification)

[server.tls.client_auth]
mode    = "required"
ca_file = "/etc/agentd/tls/client-ca.pem"

Only mode = "required" is wired today. Clients without a valid cert chained to ca_file get their TLS handshake rejected — no HTTP layer is reached. Successful mTLS attaches a principal = { kind: "mtls", name: "sha256:<64-hex>" } to the trigger context, where the fingerprint is SHA-256 of the peer cert's DER bytes (the leaf, not the CA). Workflows can pin by fingerprint via json_select or condition nodes on trigger.principal.name.

mode = "optional" is reserved but intentionally rejected today — the loader returns tls.client_auth.mode: only 'required' is supported in this build. Add it when there's a real use case.

Peer identity extraction is fingerprint-only — we don't ship an x509 parser. If you need CN / SAN, add x509-parser and extend accept_tls in http_tls.rs; one to two screenfuls of code.

4.4 Cert management

We don't ship a cert-gen path in-tree — rcgen is a dev-dep for tests, not a runtime API. Bring your own PKI. The file shapes we consume:

For throwaway dev certs, the openssl req one-liner or mkcert both produce output agentd accepts.

4.5 Rate limiting

Per-route token bucket — capacity tokens, refills at refill_per_sec:

[[http_routes]]
path = "/webhook/noisy"
# …
[http_routes.rate_limit]
capacity        = 20
refill_per_sec  = 5

Requests that exhaust the bucket get 429 Too Many Requests. The bucket is in-process and per-route, so horizontal scaling doesn't share state. At ingress volumes where a fleet-wide limiter matters, put an upstream rate-limit (nginx, cloud LB) in front and treat these as a backstop. Numbers are validated at spawn: capacity > 0, refill_per_sec > 0, both ≤ a sanity ceiling; bad numbers fail the server start.

4.55 MCP servers (multi-server registry)

Workflows can compose multiple MCP stdio backends. Declare each under [[mcp_servers]]:

[[mcp_servers]]
name = "github"
command = ["/usr/local/bin/mcp-github", "--repo", "agentd-dev/agentd"]
allow_tools = ["create_issue", "comment_on_*"]
allow_resources = ["issue://**"]

[[mcp_servers]]
name = "linear"
command = ["/usr/local/bin/mcp-linear"]
allow_tools = ["create_ticket"]
allow_resources = ["linear://projects/*"]

Nodes route to a server by name:

[[nodes]]
id = "file_issue"
type = "call_mcp_tool"
server = "github"               # names the target entry
tool   = "create_issue"
args_from = "build.payload"

Resolution rules (enforced by the validator + runtime):

Node server fieldDeclared serversBehaviour
Some("name")name existsRoute to that server.
Some("name")name missingValidation error, runtime error.
Noneexactly oneRoute to it (back-compat).
Nonezero or >1Validation error.

Legacy --mcp-stdio CMD ARGS flag still works. It maps to an implicit { name = "default", command = [...], allow_tools = ["*"], allow_resources = ["*"] } entry — the pre-registry "single server with permissive allowlist" semantic preserved. Mixing --mcp-stdio with a TOML entry named default is a conflict and fails at startup.

Per-server allowlists. Each [[mcp_servers]] entry carries its own allow_tools + allow_resources. Empty allowlists deny-by-default per the fail-closed stance. The global [policy.mcp] block is still applied to the legacy --mcp-stdio "default" server only — new TOML-declared servers source their allowlist from their own entry.

Reload semantics (SIGHUP or --reload-file):

4.6 Policy (the tool allowlist)

The manifest's [policy] section is a fail-closed allowlist: policy.fs.read, policy.fs.write, policy.env, policy.http.allow, policy.shell.allow, policy.mcp. Omitting [policy] entirely means "allow everything" — only appropriate for dev. In production, declare the block; matchers support *, prefix/**, prefix/*, and literal paths.

A policy denial returns node status denied, logs an audit event, and terminates that execution branch. The engine does not retry denied nodes.


5. Lifecycle — startup, serving, shutdown

5.1 Startup sequence

  1. Parse argv + AGENTD_* env vars (argv wins). Bad flag → exit 2.
  2. Resolve workflow (--config > embedded). Missing → exit 2.
  3. Merge [logging] block + env + CLI into ResolvedLogging; install tracing subscriber. Any install error → exit 5.
  4. Full validation pass. Fail → emit validation report JSON to stdout, exit 5.
  5. Build engine, register tools, register intelligence / MCP clients if flagged.
  6. If --validate-only, print {ok: true} and exit 0.
  7. Infer mode → once or serve.
    • once: pick start node, run, print outcome, exit 0 / 5.
    • serve: validate auth refs, build rate-limit buckets, load TLS certs, bind TCP listener, install signal handlers, enter accept loop.

Any pre-subscriber error goes to plain stderr. Post-subscriber errors flow through tracing at the level / target you configured.

5.2 Serve-mode request lifecycle

TCP accept
  → (if TLS) rustls handshake, extract peer-cert fingerprint
  → parse HTTP/1.1 request (max 16 KiB headers, 1 MiB body)
  → route lookup (METHOD, PATH); miss → 404/405
  → auth verify (bearer / HMAC / mTLS identity)        ← route-specific
  → rate-limit bucket take                             ← route-specific
  → input_schema validate                              ← route-specific
  → engine.run(workflow, start_node, trigger_payload)
  → map ExecutionOutcome → HTTP status + JSON body
  → close connection

One OS thread per connection (no keep-alive). An InFlightGuard increments / decrements the in-flight counter around the engine call so graceful drain knows when it's safe to exit.

5.3 Shutdown + hot reload

Install: crate::signals::install_shutdown_handlers() sets up POSIX sigaction for three signals. Handlers are signal-safe (only flip AtomicBool flags). SA_RESTART is not set, so a blocked accept() returns EINTR and the accept loop immediately observes the flags.

SignalEffect
SIGTERM / SIGINTBegin graceful drain (see below).
SIGHUPHot reload — see §5.4. Keeps serving; no drain.

5.4 Hot reload (SIGHUP)

kill -HUP $PID re-reads --config and swaps the reloadable subsystems atomically without dropping in-flight requests. The first reload pass shipped TLS + auth; a follow-up extended the surface to cover everything a live-config rotation could reasonably want to change.

What reloads:

What still needs a restart:

Changes to any of the above require a rolling restart.

Failure modes. Reload is fail-forward: if a single stage fails (Rego compile error, MCP child won't start, bad new JWKS), the old value for that stage stays live, an audit event records the specific failure, and the rest of the reload continues. The process does not exit. Stages emit reload.tls / reload.auth / reload.policy / reload.mcp_allowlist / reload.mcp_respawn / reload.intel / reload.routes on success, and reload.failed stage=<name> / reload.mcp_respawn_failed on failure. Top-level: reload.started and reload.succeeded.

Embedded builds (AGENTD_EMBED_CONFIG=...) have no on-disk source to re-read — SIGHUP emits reload.skipped and is a no-op.

# Rotate TLS certs + OIDC JWKS + policy bundle without dropping traffic:
cp new-server.pem /etc/agentd/tls/server.pem
cp new-server.key /etc/agentd/tls/server.key
curl -s https://jwks-provider/jwks > /etc/agentd/jwks.json
vim /etc/agentd/workflow.toml         # edit [policy], [[http_routes]], etc.
systemctl kill -s HUP agentd.service
# or: kill -HUP $(pgrep agent)

# Rotate the intelligence bearer (operator writes the new token,
# no workflow.toml change):
vault read -field=token secret/intel > /etc/agentd/intel.bearer
systemctl kill -s HUP agentd.service

5.5 Graceful shutdown

Sequence on first signal:

  1. Shutdown flag flips to true.
  2. Accept loop exits; listener is dropped; no new connections.
  3. Server waits up to drain_timeout_secs (default 30; override via --drain-timeout-secs or AGENTD_DRAIN_TIMEOUT_SECS) for the in-flight counter to reach zero.
  4. Drain complete: log drain complete, exit 0.
  5. Drain timed out: log drain timed out (forced exit), exit 5.

kill -9 / crash: nothing to clean up (stateless). A crashed process loses only its in-flight requests. If you need at-least-once semantics, put a durable queue upstream.

5.6 Exit codes

CodeConstantMeaning
0EXIT_OKSuccess. One-shot completed successfully; serve-mode drained cleanly.
2EXIT_USAGEArgv / env error, missing workflow, unknown flag, invalid bind address, serve mode without [[http_routes]].
5EXIT_SEMANTICValidation failure; engine error; tracing install failure; serve-mode drain timed out; one-shot returned a non-success outcome (failed / timeout / cancelled / denied).

These match runtime::EXIT_OK / EXIT_USAGE / EXIT_SEMANTIC — if you script around the binary, read the constants from there, not from this table.


6. Logging & observability

6.1 Precedence

Resolved at startup in this order, last non-empty wins:

workflow [logging]  →  AGENTD_LOG* env vars  →  --log-* flags  →  default

Defaults: level = "warn", format = "text", target = "stderr", enabled = true. --quiet / AGENTD_QUIET=1 forces enabled = false regardless.

6.2 Targets

6.3 Formats

6.4 Audit sink with redaction

When [logging.audit] is declared, audit events (target agentd::audit) also flow to a dedicated JSONL sink with field-level redaction — separate from the main log stream so compliance retention / shipping can diverge. Built-in redaction always masks: token, secret, password, authorization, api_key, bearer, jwt, cookie, session, and reason (the latter because auth-denial reasons frequently echo the bad token prefix).

[logging.audit]
target = "file:/var/log/agent/audit.jsonl"  # default if omitted
redact_fields = ["custom_sensitive_field"]  # add to built-in list
include_reason = false                       # default; flip to pass reason through

Parent dirs are created on first open (mkdir -p). Redaction is case-insensitive on field names. The emitted records match the shape tracing-subscriber::fmt::Json uses (timestamp, level, target, fields) so downstream collectors don't need a separate parser.

6.45 Direct OTLP exporter

With the otel Cargo feature compiled in, declare [otel] to push spans over OTLP gRPC to an OpenTelemetry collector (Tempo, Jaeger, otelcol, Datadog agent, Honeycomb, etc.):

[otel]
endpoint = "http://otel-collector:4317"     # required
service_name = "agent"                       # default
protocol = "grpc"                            # only value today
sample_ratio = 1.0                           # 0.0..1.0, default 1.0
[otel.resource_attrs]
"deployment.environment" = "prod"
region = "eu-west-1"

Notes:

Dep footprint: enabling otel pulls ~50 crates (tokio, tonic, hyper, opentelemetry_sdk, prost, etc.). Pick per deployment.

6.5 Trace-context propagation (W3C)

If an inbound HTTP request carries a traceparent header matching the W3C Trace Context spec, agentd parses it and emits trace_id, parent_id, trace_flags, sampled as structured fields on the request span. Every downstream event (workflow.run, per-node spans, audit events) inherits them under the JSON log format.

This is the recommended integration for OTLP-backed observability stacks today: pipe --log-format json into your collector's filelog receiver with the trace-id fields mapped to OTLP trace attributes. A dedicated in-process OTLP exporter is tracked in maturity.md §2.10.

traceparent without a valid 32-hex trace-id, 16-hex parent-id, or with the all-zero sentinel is silently ignored (logs proceed without the fields) rather than rejected — matches the spec's "pass through unknown versions" requirement.

6.6 Events you should know about


7. Configuration precedence (quick reference)

KnobSources (highest wins)
Workflow source--configAGENTD_CONFIG → embedded → error
Mode--modeAGENTD_MODE → inferred from [[http_routes]]
Bind address--bindAGENTD_HTTP_BIND127.0.0.1:8080
Run timeout--timeout-secsAGENTD_TIMEOUT_SECS120
Drain timeout--drain-timeout-secsAGENTD_DRAIN_TIMEOUT_SECS30
Log level--log-levelAGENTD_LOG → workflow [logging].levelwarn
Log format--log-formatAGENTD_LOG_FORMAT → workflow [logging].formattext
Log target--log-targetAGENTD_LOG_TARGET → workflow [logging].targetstderr
Logging enabled--quiet/AGENTD_QUIET=1 → workflow [logging].enabledtrue
Start node (one-shot)--startAGENTD_START → sole manual start → sole start overall → error
Input (one-shot)--inputAGENTD_INPUTValue::Null

Full flag list: agentd --help. Full env-var list: configuration.md.


8. Canonical deployments

8.1 Local one-shot

agentd --config wf.toml --input payload.json

No server. Reads payload.json as the trigger, runs, prints outcome JSON, exits.

8.2 Plain HTTP behind a reverse proxy

agentd --config wf.toml --bind 127.0.0.1:8080 \
      --log-level info --log-format json

Put nginx / Caddy / a cloud LB in front for TLS. Inside the workflow, use auth = "bearer:prod" for simple token auth; rate-limit per-route as needed.

8.3 Publicly reachable hardened webhook (Shape B)

agentd --config wf.toml --bind 0.0.0.0:8443 \
      --log-level info --log-format json --log-target stderr \
      --drain-timeout-secs 60

Workflow:

[server.tls]
cert_file = "/etc/agentd/tls/server.pem"
key_file  = "/etc/agentd/tls/server.key"

[server.tls.client_auth]
mode    = "required"
ca_file = "/etc/agentd/tls/client-ca.pem"

[auth.hmac.webhooks]
secret          = "${WEBHOOK_SECRET}"
header          = "X-Hub-Signature-256"
prefix          = "sha256="
timestamp_header = "X-Timestamp"
tolerance_secs  = 300

[[http_routes]]
method     = "POST"
path       = "/webhook/github"
start_node = "on_push"
auth       = "hmac:webhooks"
[http_routes.rate_limit]
capacity       = 60
refill_per_sec = 10

At rest: TLS terminates in-process; mTLS restricts the client surface to certs signed by the CA; HMAC verifies webhook payloads; rate-limit throttles storms.

8.4 Container image (GHCR)

Pre-built multi-arch images (linux/amd64 + linux/arm64) publish to ghcr.io/agentd-dev/agentd:

TagMeaning
latestLatest v* tagged release.
1.0.0, 1.0, 1Specific semver (matches tag v1.0.0).

Released tags are cosign-signed (keyless OIDC from GitHub Actions) and carry an SPDX SBOM attestation. Verify:

cosign verify ghcr.io/agentd-dev/agentd:1.0.0 \
  --certificate-identity-regexp 'https://github.com/agentd-dev/source-code/.*' \
  --certificate-oidc-issuer https://token.actions.githubusercontent.com

cosign verify-attestation --type spdxjson ghcr.io/agentd-dev/agentd:1.0.0 \
  --certificate-identity-regexp 'https://github.com/agentd-dev/source-code/.*' \
  --certificate-oidc-issuer https://token.actions.githubusercontent.com

Image base: gcr.io/distroless/cc-debian12:nonroot (uid:gid 65532:65532, no shell). Release binary built with --all-features — narrower surfaces require a purpose-built image.

8.5 Kubernetes pod

Rough shape:

spec:
  terminationGracePeriodSeconds: 45           # > drain_timeout_secs
  containers:
    - name: agentd
      image: ghcr.io/agentd-dev/agentd:1.0.0
      args:
        - --config=/etc/agentd/wf.toml
        - --bind=0.0.0.0:8080
        - --drain-timeout-secs=30
      ports:
        - containerPort: 8080
      readinessProbe:
        httpGet: { path: /healthz, port: 8080 }
      livenessProbe:
        httpGet: { path: /healthz, port: 8080 }
      volumeMounts:
        - mountPath: /etc/agentd
          name: workflow
      env:
        - name: AGENTD_LOG
          value: info
        - name: AGENTD_LOG_FORMAT
          value: json

/healthz is wired as an always-live endpoint (no auth, not rate limited). terminationGracePeriodSeconds must exceed drain_timeout_secs or k8s will SIGKILL mid-drain.

8.6 Debian / RPM packages + systemd

Pre-built .deb and .rpm attach to each v* release on GitHub. Both drop a hardened systemd unit with DynamicUser, ProtectSystem=strict, empty CapabilityBoundingSet, MemoryDenyWriteExecute, and a restrictive SystemCallFilter.

# Debian / Ubuntu
sudo apt install ./agentd_1.0.0_amd64.deb

# RHEL / Fedora / Rocky
sudo dnf install ./agentd-1.0.0-1.x86_64.rpm

sudo cp my-workflow.toml /etc/agentd/workflow.toml
sudo systemctl enable --now agentd

Config knobs live in /etc/default/agentd as AGENTD_ARGS=.... See packaging/README.md for full unit details, drop-in overrides, and the locked-down filesystem / syscall / network posture.

Build a package locally:

cargo install cargo-deb cargo-generate-rpm
cargo build --release --manifest-path crates/agentd/Cargo.toml --all-features
cargo deb --manifest-path crates/agentd/Cargo.toml --no-build   # → target/debian/
cargo generate-rpm -p crates/agentd                              # → target/generate-rpm/

9. Runbook basics

9.1 Startup fails — where to look

SymptomLikely causeFix
agent: no workflow configuredNo --config, no embeddedPass --config or rebuild with AGENTD_EMBED_CONFIG.
failed to parse <path>: …TOML syntax or unknown field (all structs are deny_unknown_fields)Read the error; fix the file.
workflow X: duplicate node id 'Y'Validator caught an authoring mistakeRename / remove the offender.
workflow X: cycle at ZValidator caught a cycleDAG only — break the cycle.
http_route #N points at unknown start_node 'Y'Route refers to a start that isn't declaredDeclare it in [[start_nodes]].
auth: binding 'prod' referenced by routes not declared in [auth.bearer]Typo in route auth =Match the binding name.
tls: open cert_file /x: …Wrong path / permissionsCheck ls -l.
tls: <path> contains no certificatesEmpty or malformed PEMopenssl x509 -in … -noout -text sanity check.
tls: <path> has no recognised private keyKey is corrupt / wrong formatRegenerate as PKCS8.
bind 127.0.0.1:8080: Address already in useAnother process on the port`ss -ltnp
serve mode requires at least one [[http_routes]]--mode serve forced on a workflow with no routesDrop the override or add a route.

9.2 Mid-flight issues

9.3 Graceful restart / rollout

The drain sequence is the only supported shutdown path. A rolling restart looks like:

  1. Send SIGTERM to old pod / process.
  2. Wait for drain complete log line (bounded by drain_timeout_secs).
  3. Start the new pod / process.
  4. Readiness probe flips → receive traffic.

Set terminationGracePeriodSeconds (k8s) / TimeoutStopSec (systemd) higher than drain_timeout_secs, or SIGKILL will interrupt drain.


10. What this doc does NOT cover