Monitoring

bext exposes a Prometheus-compatible metrics endpoint and a health check endpoint, giving you full observability with your existing monitoring stack.

Configuration

# bext.config.toml
[metrics]
enabled = true                   # expose /__bext/metrics
path = "/__bext/metrics"         # customize the endpoint path
include_process = true           # include process_* metrics (RSS, CPU, FDs)

[health]
enabled = true                   # expose /__bext/health
path = "/__bext/health"          # customize the endpoint path
include_details = true           # include component-level health in response

Both endpoints are excluded from access logs and rate limiting by default.

Metrics Endpoint

Scrape /__bext/metrics with Prometheus. The response is in the standard Prometheus exposition format.

Available Metrics

HTTP Metrics

Metric	Type	Labels	Description
`http_requests_total`	counter	`method`, `status`, `path_group`	Total HTTP requests
`http_request_duration_seconds`	histogram	`method`, `path_group`	Request latency (buckets: 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 5s)
`http_request_size_bytes`	histogram	`method`	Request body size
`http_response_size_bytes`	histogram	`method`	Response body size
`http_active_connections`	gauge		Currently open connections

Cache Metrics

Metric	Type	Labels	Description
`cache_hits_total`	counter	`layer`	Cache hits (`l1` = in-memory, `l2` = Redis)
`cache_misses_total`	counter	`layer`	Cache misses
`cache_entries`	gauge	`layer`	Current cache entry count
`cache_evictions_total`	counter	`layer`	Entries evicted
`cache_stampede_coalesced_total`	counter		Requests coalesced by stampede guard
`isr_revalidations_total`	counter	`status`	ISR background revalidations (`success`, `error`)

SSR / V8 Pool Metrics

Metric	Type	Description
`render_pool_active`	gauge	render workers currently rendering
`render_pool_idle`	gauge	render workers available
`render_pool_total`	gauge	Total render workers configured
`render_duration_seconds`	histogram	SSR render time
`render_pool_wait_seconds`	histogram	Time spent waiting for an available worker
`render_oom_kills_total`	counter	Workers killed due to memory limits

Plugin Metrics

Metric	Type	Labels	Description
`plugin_invocations_total`	counter	`plugin`, `hook`	Plugin hook invocations
`plugin_duration_seconds`	histogram	`plugin`, `hook`	Plugin execution time
`plugin_errors_total`	counter	`plugin`	Plugin errors/timeouts

Process Metrics (when include_process = true)

Metric	Type	Description
`process_resident_memory_bytes`	gauge	RSS memory usage
`process_cpu_seconds_total`	counter	Total CPU time consumed
`process_open_fds`	gauge	Open file descriptor count
`process_uptime_seconds`	gauge	Server uptime

Example Prometheus Scrape Config

# prometheus.yml
scrape_configs:
  - job_name: bext
    scrape_interval: 15s
    static_configs:
      - targets: ["bext-host:3061"]
    metrics_path: /__bext/metrics

Health Check Endpoint

GET /__bext/health returns a JSON health report:

{
  "status": "healthy",
  "uptime_secs": 86412,
  "checks": {
    "render_pool": "ok",
    "isr_cache": "ok",
    "redis": "ok",
    "tls_certs": "ok",
    "disk_space": "ok"
  }
}

Returns 200 when healthy, 503 when any component reports degraded. Use this for load balancer health checks and Kubernetes liveness probes.

Grafana Dashboard

A sample Grafana dashboard JSON is available in the bext repository at contrib/grafana-dashboard.json. Import it into Grafana and set the Prometheus data source.

Key panels include:

- Request rate (rate(http_requests_total[5m]))

- P50/P95/P99 latency (histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])))

- Error rate (rate(http_requests_total{status=~"5.."}[5m]))

- Cache hit ratio (rate(cache_hits_total[5m]) / (rate(cache_hits_total[5m]) + rate(cache_misses_total[5m])))

- V8 pool saturation (render_pool_active / render_pool_total)

Alerting Rules

Recommended Prometheus alerting rules:

# alerts.yml
groups:
  - name: bext
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "bext error rate above 5%"

      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1.0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "bext P95 latency above 1 second"

      - alert: JscPoolExhausted
        expr: render_pool_idle == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "All render workers busy — SSR requests are queuing"

      - alert: CacheHitRateLow
        expr: rate(cache_hits_total[10m]) / (rate(cache_hits_total[10m]) + rate(cache_misses_total[10m])) < 0.5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Cache hit rate below 50% — check ISR TTLs and cache capacity"

OpenTelemetry (Pro Feature)

With a Pro or Enterprise license, bext exports traces and metrics via OTLP:

# bext.config.toml
[telemetry]
otlp_endpoint = "http://otel-collector:4317"
otlp_protocol = "grpc"          # grpc | http
sample_rate = 0.1                # sample 10% of traces
service_name = "bext-production"

Each request creates a span with request_id, path, status, and duration. Child spans are created for SSR render, cache lookup, plugin hooks, and upstream proxy calls, giving you a full distributed trace for every request.