Monitoring

bext exposes a Prometheus-compatible metrics endpoint and a health check endpoint, giving you full observability with your existing monitoring stack.

Configuration

# bext.config.toml
[metrics]
enabled = true                   # expose /__bext/metrics
path = "/__bext/metrics"         # customize the endpoint path
include_process = true           # include process_* metrics (RSS, CPU, FDs)

[health]
enabled = true                   # expose /__bext/health
path = "/__bext/health"          # customize the endpoint path
include_details = true           # include component-level health in response

Both endpoints are excluded from access logs and rate limiting by default.

Metrics Endpoint

Scrape /__bext/metrics with Prometheus. The response is in the standard Prometheus exposition format.

Available Metrics

HTTP Metrics

Metric Type Labels Description
http_requests_total counter method, status, path_group Total HTTP requests
http_request_duration_seconds histogram method, path_group Request latency (buckets: 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 5s)
http_request_size_bytes histogram method Request body size
http_response_size_bytes histogram method Response body size
http_active_connections gauge Currently open connections

Cache Metrics

Metric Type Labels Description
cache_hits_total counter layer Cache hits (l1 = in-memory, l2 = Redis)
cache_misses_total counter layer Cache misses
cache_entries gauge layer Current cache entry count
cache_evictions_total counter layer Entries evicted
cache_stampede_coalesced_total counter Requests coalesced by stampede guard
isr_revalidations_total counter status ISR background revalidations (success, error)

SSR / V8 Pool Metrics

Metric Type Labels Description
render_pool_active gauge render workers currently rendering
render_pool_idle gauge render workers available
render_pool_total gauge Total render workers configured
render_duration_seconds histogram SSR render time
render_pool_wait_seconds histogram Time spent waiting for an available worker
render_oom_kills_total counter Workers killed due to memory limits

Plugin Metrics

Metric Type Labels Description
plugin_invocations_total counter plugin, hook Plugin hook invocations
plugin_duration_seconds histogram plugin, hook Plugin execution time
plugin_errors_total counter plugin Plugin errors/timeouts

Process Metrics (when include_process = true)

Metric Type Description
process_resident_memory_bytes gauge RSS memory usage
process_cpu_seconds_total counter Total CPU time consumed
process_open_fds gauge Open file descriptor count
process_uptime_seconds gauge Server uptime

Example Prometheus Scrape Config

# prometheus.yml
scrape_configs:
  - job_name: bext
    scrape_interval: 15s
    static_configs:
      - targets: ["bext-host:3061"]
    metrics_path: /__bext/metrics

Health Check Endpoint

GET /__bext/health returns a JSON health report:

{
  "status": "healthy",
  "uptime_secs": 86412,
  "checks": {
    "render_pool": "ok",
    "isr_cache": "ok",
    "redis": "ok",
    "tls_certs": "ok",
    "disk_space": "ok"
  }
}

Returns 200 when healthy, 503 when any component reports degraded. Use this for load balancer health checks and Kubernetes liveness probes.

Grafana Dashboard

A sample Grafana dashboard JSON is available in the bext repository at contrib/grafana-dashboard.json. Import it into Grafana and set the Prometheus data source.

Key panels include:

- Request rate (rate(http_requests_total[5m]))

- P50/P95/P99 latency (histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])))

- Error rate (rate(http_requests_total{status=~"5.."}[5m]))

- Cache hit ratio (rate(cache_hits_total[5m]) / (rate(cache_hits_total[5m]) + rate(cache_misses_total[5m])))

- V8 pool saturation (render_pool_active / render_pool_total)

Alerting Rules

Recommended Prometheus alerting rules:

# alerts.yml
groups:
  - name: bext
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "bext error rate above 5%"

      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1.0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "bext P95 latency above 1 second"

      - alert: JscPoolExhausted
        expr: render_pool_idle == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "All render workers busy — SSR requests are queuing"

      - alert: CacheHitRateLow
        expr: rate(cache_hits_total[10m]) / (rate(cache_hits_total[10m]) + rate(cache_misses_total[10m])) < 0.5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Cache hit rate below 50% — check ISR TTLs and cache capacity"

OpenTelemetry (Pro Feature)

With a Pro or Enterprise license, bext exports traces and metrics via OTLP:

# bext.config.toml
[telemetry]
otlp_endpoint = "http://otel-collector:4317"
otlp_protocol = "grpc"          # grpc | http
sample_rate = 0.1                # sample 10% of traces
service_name = "bext-production"

Each request creates a span with request_id, path, status, and duration. Child spans are created for SSR render, cache lookup, plugin hooks, and upstream proxy calls, giving you a full distributed trace for every request.