07 — Observability

Metrics, logging, tracing, and health monitoring for the bext platform.

Current State

  • Logging: tracing crate with JSON output, env-based log level filter
  • Metrics: Analytics plugin (per-request counters, Prometheus or JSON export)
  • Health: /health endpoint with cache stats
  • Monitoring: /metrics endpoint with JSC pool + cache metrics

Target State

Comprehensive observability covering:

  1. Structured metrics with dimensional labels (per-app, per-route, per-plugin)
  2. Distributed tracing with OpenTelemetry
  3. Health checks with dependency monitoring
  4. Alerting hooks for anomaly detection

Metrics

Core Metrics

# Request metrics
bext_requests_total{app, method, status, route}
bext_request_duration_ms{app, route, quantile}
bext_request_size_bytes{app, direction}  # direction: request | response

# Cache metrics
bext_cache_hits_total{app, layer}       # layer: isr | fragment | layout | tenant
bext_cache_misses_total{app, layer}
bext_cache_entries{app, layer}
bext_cache_bytes{app, layer}
bext_cache_evictions_total{app, layer}
bext_cache_invalidations_total{app, method}  # method: tag | path | gc

# Isolate metrics
bext_isolate_count{app}
bext_isolate_memory_bytes{app}
bext_isolate_render_duration_ms{app, quantile}
bext_isolate_errors_total{app, error_type}

# Compression metrics
bext_compression_ratio{app, encoding}   # encoding: gzip | brotli
bext_compression_duration_us{app, encoding}

# Plugin metrics
bext_plugin_duration_us{plugin, hook}   # hook: on_request | on_response | etc.
bext_plugin_errors_total{plugin, hook}
bext_plugin_fuel_consumed{plugin}

# Deploy metrics
bext_deploys_total{app, status}         # status: success | failed | rolled_back
bext_deploy_duration_ms{app}

# Flow engine metrics
bext_flow_active_runs
bext_flow_completed_total
bext_flow_failed_total
bext_flow_step_duration_ms{flow_name}

Prometheus Endpoint

GET /metrics
Content-Type: text/plain; version=0.0.4

bext_requests_total{app="marketing",method="GET",status="200",route="/about"} 42531
bext_request_duration_ms{app="marketing",route="/about",quantile="0.5"} 2.1
bext_request_duration_ms{app="marketing",route="/about",quantile="0.99"} 48.3
bext_cache_hits_total{app="marketing",layer="isr"} 39842
bext_cache_misses_total{app="marketing",layer="isr"} 2689
...

Structured JSON Metrics

GET /metrics?format=json

{
  "timestamp": "2026-03-28T15:30:00Z",
  "uptime_secs": 86400,
  "apps": {
    "marketing": {
      "requests": { "total": 42531, "rps": 12.4 },
      "cache": { "hit_rate": 0.937, "entries": 2481, "bytes": 155189248 },
      "isolate": { "workers": 4, "memory_mb": 48, "avg_render_ms": 3.2 },
      "errors": { "total": 12, "rate": 0.0003 }
    }
  },
  "plugins": {
    "analytics": { "calls": 42531, "avg_us": 12 },
    "security-headers": { "calls": 42531, "avg_us": 3 }
  },
  "system": {
    "memory_mb": 256,
    "cpu_percent": 15.2,
    "open_fds": 128
  }
}

Logging

Log Format

Structured JSON logs with trace context:

{
  "timestamp": "2026-03-28T15:30:42.123Z",
  "level": "info",
  "target": "bext_server::handler",
  "message": "request completed",
  "app": "marketing",
  "method": "GET",
  "path": "/about",
  "status": 200,
  "duration_ms": 2.1,
  "cache": "hit",
  "trace_id": "abc123def456",
  "span_id": "789ghi"
}

Log Levels

Level What gets logged
error Request failures, isolate crashes, plugin errors
warn Cache evictions, slow renders (>100ms), config issues
info Request completions, deploys, cache invalidations
debug Cache hits/misses, transform timing, plugin calls
trace Full request/response bodies, WASM fuel consumption

Per-App Log Filtering

[apps.marketing]
log_level = "info"

[apps.api]
log_level = "debug"               # More verbose for API debugging

Health Checks

/health Endpoint

{
  "status": "healthy",
  "version": "0.5.0",
  "uptime_secs": 86400,
  "checks": {
    "database": { "status": "healthy", "latency_ms": 2 },
    "cache": { "status": "healthy", "entries": 4573, "hit_rate": 0.93 },
    "isolates": { "status": "healthy", "active": 8, "max": 100 },
    "flow_engine": { "status": "healthy", "active_runs": 3 },
    "plugins": { "status": "healthy", "loaded": 4 },
    "disk": { "status": "healthy", "data_dir_mb": 512, "free_mb": 10240 }
  },
  "apps": {
    "marketing": { "status": "running", "version": "abc1234" },
    "dashboard": { "status": "running", "version": "def5678" },
    "api": { "status": "running", "version": "ghi9012" }
  }
}

Liveness vs Readiness

Endpoint Purpose Checks
/health/live Is the process alive? Always 200
/health/ready Can it serve traffic? DB connected, at least 1 isolate ready
/health Full health check All subsystems

Implementation Tasks

OB-1: Metrics System

Tasks:

  • Create bext-core/src/metrics.rs with metric registry
  • Counter, Gauge, Histogram types (lockless, atomic)
  • Per-app metric labels
  • Prometheus text format export
  • JSON format export
  • Request middleware that records metrics
  • Cache operation metrics (already have hit/miss, add latency)
  • Isolate metrics (render time, memory)
  • Plugin metrics (execution time, errors)

OB-2: Structured Logging

Tasks:

  • Per-app log level configuration
  • Request context in all log entries (app, trace_id) — tracing crate with structured fields
  • Log rotation / output to file (optional)
  • Access log format (combined or JSON)
  • Slow request logging (threshold configurable)

OB-3: Health Check System

Tasks:

  • /health/live endpoint (always 200)
  • /health/ready endpoint (checks critical dependencies)
  • /health full health with per-subsystem status
  • Configurable health check thresholds
  • Per-app health status

OB-4: OpenTelemetry Integration (Future)

Note: Config scaffolding only (TelemetryConfig struct in config.rs) — no opentelemetry crate dependency yet.

Tasks:

  • OTLP exporter for traces
  • Trace context propagation (incoming W3C trace headers)
  • Span creation for key operations (render, cache, plugin)
  • Integration with tracing crate (already used)
  • Config: [telemetry] otlp_endpoint = "http://..."