Phase 6: Worker Lifecycle Management

Goal

Smart lifecycle management for JSC render workers — automatic rotation, crash recovery with backoff, per-worker metrics, and file-watch triggered restarts. Borrowed from FrankenPHP's battle-tested worker mode.

Current State

  • Shared 4-worker JSC render pool (configurable)
  • Workers are long-lived — never rotated
  • No max-requests-per-worker limit
  • No crash recovery with backoff
  • No per-worker metrics (requests served, memory, uptime)
  • Plugin hot-reload exists but doesn't cover JS workers

Why This Matters

Long-lived JS workers accumulate:

  • Memory leaks in user code (closures, event listeners, growing caches)
  • Stale state from module-level variables that persist across requests
  • V8/JSC fragmentation — the heap compactor can't fully defragment

FrankenPHP's MAX_REQUESTS pattern forces periodic worker restart, keeping memory bounded. Combined with exponential backoff on crashes, the system self-heals without operator intervention.

Design

Worker States

       ┌──────────┐
       │  BOOTING │──── startup error ──▶ FAILED ──▶ backoff ──▶ BOOTING
       └────┬─────┘
            │ ready
            ▼
       ┌──────────┐
       │  ACTIVE  │──── request ──▶ process ──▶ ACTIVE
       └────┬─────┘
            │ max_requests reached OR memory limit OR watch trigger
            ▼
       ┌──────────┐
       │ DRAINING │──── finish in-flight ──▶ STOPPED ──▶ BOOTING (new worker)
       └──────────┘

Rotation Policy

[render]
jsc_workers = 4

[render.lifecycle]
max_requests = 10000              # Rotate after N renders (0 = never)
max_memory_mb = 512               # Rotate if RSS exceeds N MB (0 = no limit)
max_uptime_hours = 24             # Rotate after N hours (0 = never)
drain_timeout_ms = 5000           # Wait for in-flight requests before force-kill

When a worker hits any threshold:

  1. Mark worker as DRAINING — stop accepting new requests
  2. Wait for in-flight requests to complete (up to drain_timeout_ms)
  3. Stop the old worker
  4. Boot a fresh worker in its place
  5. Mark new worker as ACTIVE

During draining, the pool routes requests to remaining active workers. With 4 workers, at most 1 is draining at a time (staggered rotation).

Crash Recovery with Backoff

When a worker crashes (JSC exception, segfault, OOM kill):

Attempt 1: restart immediately
Attempt 2: wait 100ms
Attempt 3: wait 500ms
Attempt 4: wait 2s
Attempt 5: wait 10s
...
Max backoff: 60s
Reset backoff after: 60s of healthy operation
[render.lifecycle]
max_consecutive_failures = 10     # Give up after N consecutive crashes
backoff_initial_ms = 100
backoff_max_ms = 60000
backoff_multiplier = 3.0
healthy_reset_ms = 60000          # Reset failure count after N ms healthy

After max_consecutive_failures, the worker slot stays empty and an alert is emitted. The pool operates with reduced capacity until an operator investigates.

Per-Worker Metrics

Each worker tracks:

struct WorkerMetrics {
    id: u32,
    state: WorkerState,
    requests_served: u64,
    requests_failed: u64,
    total_render_time_ms: u64,
    avg_render_time_ms: f64,
    peak_memory_bytes: u64,
    current_memory_bytes: u64,
    started_at: Instant,
    last_request_at: Option,
    restarts: u32,
    consecutive_failures: u32,
}

Exposed via:

  • GET /__bext/workers — JSON array of all worker metrics
  • GET /metrics — Prometheus gauges per worker
  • bext ps CLI — tabular worker status

File-Watch Restart

Extend the existing plugin hot-reload pattern to JS workers:

[render.lifecycle]
watch_dirs = ["src/", "components/"]  # Restart workers on file change
watch_debounce_ms = 500               # Debounce rapid changes

On file change:

  1. Debounce for 500ms (batch rapid saves)
  2. Drain all workers gracefully
  3. Boot fresh workers with updated code
  4. Log: [bext] Workers restarted (file change: src/components/Header.tsx)

This is particularly useful for bext dev mode but also works in production for hot-deploying JS changes without full redeploy.

Implementation

Worker Pool Refactor

pub struct WorkerPool {
    workers: Vec<Arc<RwLock>>,
    config: WorkerLifecycleConfig,
    metrics: Arc,
    rotation_tx: mpsc::Sender,
}

struct Worker {
    id: u32,
    state: WorkerState,
    metrics: WorkerMetrics,
    isolate: JscIsolate,  // Or handle to JSC context
}

enum RotationEvent {
    MaxRequests(u32),      // Worker ID
    MaxMemory(u32),
    MaxUptime(u32),
    Crash(u32, String),    // Worker ID, error message
    FileChange(Vec),
    ManualRestart,         // CLI trigger
}

Rotation Coordinator

Background task that handles rotation events:

async fn rotation_coordinator(
    pool: Arc,
    mut rx: mpsc::Receiver,
) {
    while let Some(event) = rx.recv().await {
        match event {
            RotationEvent::MaxRequests(id) | RotationEvent::MaxMemory(id) | RotationEvent::MaxUptime(id) => {
                pool.rotate_worker(id).await;
            }
            RotationEvent::Crash(id, err) => {
                tracing::error!(worker_id = id, error = %err, "worker crashed");
                pool.restart_with_backoff(id).await;
            }
            RotationEvent::FileChange(paths) => {
                tracing::info!(files = ?paths, "restarting all workers (file change)");
                pool.restart_all().await;
            }
            RotationEvent::ManualRestart => {
                pool.restart_all().await;
            }
        }
    }
}

Memory Monitoring

Check worker RSS periodically (every 10 seconds):

async fn memory_monitor(pool: Arc, config: &WorkerLifecycleConfig) {
    let mut interval = tokio::time::interval(Duration::from_secs(10));
    loop {
        interval.tick().await;
        for worker in &pool.workers {
            let w = worker.read().await;
            if w.state == WorkerState::Active {
                let rss = w.isolate.memory_usage();
                w.metrics.current_memory_bytes = rss;
                w.metrics.peak_memory_bytes = w.metrics.peak_memory_bytes.max(rss);

                if config.max_memory_mb > 0 && rss > config.max_memory_mb * 1024 * 1024 {
                    pool.rotation_tx.send(RotationEvent::MaxMemory(w.id)).await;
                }
            }
        }
    }
}

Staggered Rotation

To avoid rotating all workers simultaneously:

fn should_rotate(&self, worker: &Worker) -> bool {
    if self.config.max_requests == 0 {
        return false;
    }
    // Add jitter: rotate at max_requests ± 10%
    let jitter = self.config.max_requests / 10;
    let threshold = self.config.max_requests + (worker.id as u64 * jitter / self.workers.len() as u64);
    worker.metrics.requests_served >= threshold
}

Config Reference

[render]
jsc_workers = 4                       # Number of workers

[render.lifecycle]
max_requests = 10000                  # Rotate after N renders (0 = never)
max_memory_mb = 512                   # Rotate if RSS > N MB (0 = no limit)
max_uptime_hours = 24                 # Rotate after N hours (0 = never)
drain_timeout_ms = 5000               # Grace period for in-flight requests
watch_dirs = []                       # File paths to watch for changes
watch_debounce_ms = 500               # Debounce file change events

[render.lifecycle.recovery]
max_consecutive_failures = 10         # Give up after N crashes
backoff_initial_ms = 100
backoff_max_ms = 60000
backoff_multiplier = 3.0
healthy_reset_ms = 60000              # Reset failure count after healthy period

CLI Integration

bext ps                               # Show worker status table
bext workers restart                  # Gracefully restart all workers
bext workers restart --worker 2       # Restart specific worker
bext workers stats                    # Detailed per-worker metrics

Example bext ps output:

Workers (4/4 active)
  ID  State    Requests  Memory   Uptime    Avg Render
  0   ACTIVE   3,421     142 MB   2h 15m    4.2ms
  1   ACTIVE   3,380     138 MB   2h 15m    3.8ms
  2   DRAINING 3,412     256 MB   2h 15m    4.1ms  (rotating: memory)
  3   ACTIVE   3,399     145 MB   2h 15m    3.9ms

Testing Plan

Test Type What it validates
Max requests rotation Unit Worker rotated after N requests
Max memory rotation Unit Worker rotated when RSS exceeds limit
Max uptime rotation Unit Worker rotated after N hours
Graceful drain Integration In-flight request completes before worker stops
Crash backoff Unit Exponential backoff between restart attempts
Backoff reset Unit Healthy period resets failure counter
Max failures gives up Unit Worker slot stays empty after N crashes
Staggered rotation Unit Workers don't all rotate simultaneously
File watch restart Integration File change triggers worker restart
File watch debounce Unit Rapid saves produce single restart
Pool degradation Unit Pool works with reduced workers during rotation
Worker metrics Unit Counters increment correctly
Manual restart CLI Integration bext workers restart rotates all

Done Criteria

  • Workers rotate after max_requests renders
  • Workers rotate when RSS exceeds max_memory_mb
  • Workers rotate after max_uptime_hours
  • Graceful draining with in-flight request completion
  • Crash recovery with exponential backoff
  • Backoff reset after healthy period
  • Per-worker metrics (requests, memory, uptime, avg render time)
  • /__bext/workers JSON endpoint
  • Prometheus gauges per worker
  • bext ps CLI shows worker table
  • bext workers restart CLI command
  • File-watch triggered restart (with debounce)
  • Staggered rotation (workers don't rotate at the same time)
  • All tests passing