Phase 6: Worker Lifecycle Management

Goal

Smart lifecycle management for JSC render workers — automatic rotation, crash recovery with backoff, per-worker metrics, and file-watch triggered restarts. Borrowed from FrankenPHP's battle-tested worker mode.

Current State

Shared 4-worker JSC render pool (configurable)
Workers are long-lived — never rotated
No max-requests-per-worker limit
No crash recovery with backoff
No per-worker metrics (requests served, memory, uptime)
Plugin hot-reload exists but doesn't cover JS workers

Why This Matters

Long-lived JS workers accumulate:

Memory leaks in user code (closures, event listeners, growing caches)
Stale state from module-level variables that persist across requests
V8/JSC fragmentation — the heap compactor can't fully defragment

FrankenPHP's MAX_REQUESTS pattern forces periodic worker restart, keeping memory bounded. Combined with exponential backoff on crashes, the system self-heals without operator intervention.

Design

Worker States

       ┌──────────┐
       │  BOOTING │──── startup error ──▶ FAILED ──▶ backoff ──▶ BOOTING
       └────┬─────┘
            │ ready
            ▼
       ┌──────────┐
       │  ACTIVE  │──── request ──▶ process ──▶ ACTIVE
       └────┬─────┘
            │ max_requests reached OR memory limit OR watch trigger
            ▼
       ┌──────────┐
       │ DRAINING │──── finish in-flight ──▶ STOPPED ──▶ BOOTING (new worker)
       └──────────┘

Rotation Policy

[render]
jsc_workers = 4

[render.lifecycle]
max_requests = 10000              # Rotate after N renders (0 = never)
max_memory_mb = 512               # Rotate if RSS exceeds N MB (0 = no limit)
max_uptime_hours = 24             # Rotate after N hours (0 = never)
drain_timeout_ms = 5000           # Wait for in-flight requests before force-kill

When a worker hits any threshold:

Mark worker as DRAINING — stop accepting new requests
Wait for in-flight requests to complete (up to drain_timeout_ms)
Stop the old worker
Boot a fresh worker in its place
Mark new worker as ACTIVE

During draining, the pool routes requests to remaining active workers. With 4 workers, at most 1 is draining at a time (staggered rotation).

Crash Recovery with Backoff

When a worker crashes (JSC exception, segfault, OOM kill):

Attempt 1: restart immediately
Attempt 2: wait 100ms
Attempt 3: wait 500ms
Attempt 4: wait 2s
Attempt 5: wait 10s
...
Max backoff: 60s
Reset backoff after: 60s of healthy operation

[render.lifecycle]
max_consecutive_failures = 10     # Give up after N consecutive crashes
backoff_initial_ms = 100
backoff_max_ms = 60000
backoff_multiplier = 3.0
healthy_reset_ms = 60000          # Reset failure count after N ms healthy

After max_consecutive_failures, the worker slot stays empty and an alert is emitted. The pool operates with reduced capacity until an operator investigates.

Per-Worker Metrics

Each worker tracks:

struct WorkerMetrics {
    id: u32,
    state: WorkerState,
    requests_served: u64,
    requests_failed: u64,
    total_render_time_ms: u64,
    avg_render_time_ms: f64,
    peak_memory_bytes: u64,
    current_memory_bytes: u64,
    started_at: Instant,
    last_request_at: Option,
    restarts: u32,
    consecutive_failures: u32,
}

Exposed via:

GET /__bext/workers — JSON array of all worker metrics
GET /metrics — Prometheus gauges per worker
bext ps CLI — tabular worker status

File-Watch Restart

Extend the existing plugin hot-reload pattern to JS workers:

[render.lifecycle]
watch_dirs = ["src/", "components/"]  # Restart workers on file change
watch_debounce_ms = 500               # Debounce rapid changes

On file change:

Debounce for 500ms (batch rapid saves)
Drain all workers gracefully
Boot fresh workers with updated code
Log: [bext] Workers restarted (file change: src/components/Header.tsx)

This is particularly useful for bext dev mode but also works in production for hot-deploying JS changes without full redeploy.

Implementation

Worker Pool Refactor

pub struct WorkerPool {
    workers: Vec<Arc<RwLock>>,
    config: WorkerLifecycleConfig,
    metrics: Arc,
    rotation_tx: mpsc::Sender,
}

struct Worker {
    id: u32,
    state: WorkerState,
    metrics: WorkerMetrics,
    isolate: JscIsolate,  // Or handle to JSC context
}

enum RotationEvent {
    MaxRequests(u32),      // Worker ID
    MaxMemory(u32),
    MaxUptime(u32),
    Crash(u32, String),    // Worker ID, error message
    FileChange(Vec),
    ManualRestart,         // CLI trigger
}

Rotation Coordinator

Background task that handles rotation events:

async fn rotation_coordinator(
    pool: Arc,
    mut rx: mpsc::Receiver,
) {
    while let Some(event) = rx.recv().await {
        match event {
            RotationEvent::MaxRequests(id) | RotationEvent::MaxMemory(id) | RotationEvent::MaxUptime(id) => {
                pool.rotate_worker(id).await;
            }
            RotationEvent::Crash(id, err) => {
                tracing::error!(worker_id = id, error = %err, "worker crashed");
                pool.restart_with_backoff(id).await;
            }
            RotationEvent::FileChange(paths) => {
                tracing::info!(files = ?paths, "restarting all workers (file change)");
                pool.restart_all().await;
            }
            RotationEvent::ManualRestart => {
                pool.restart_all().await;
            }
        }
    }
}

Memory Monitoring

Check worker RSS periodically (every 10 seconds):

async fn memory_monitor(pool: Arc, config: &WorkerLifecycleConfig) {
    let mut interval = tokio::time::interval(Duration::from_secs(10));
    loop {
        interval.tick().await;
        for worker in &pool.workers {
            let w = worker.read().await;
            if w.state == WorkerState::Active {
                let rss = w.isolate.memory_usage();
                w.metrics.current_memory_bytes = rss;
                w.metrics.peak_memory_bytes = w.metrics.peak_memory_bytes.max(rss);

                if config.max_memory_mb > 0 && rss > config.max_memory_mb * 1024 * 1024 {
                    pool.rotation_tx.send(RotationEvent::MaxMemory(w.id)).await;
                }
            }
        }
    }
}

Staggered Rotation

To avoid rotating all workers simultaneously:

fn should_rotate(&self, worker: &Worker) -> bool {
    if self.config.max_requests == 0 {
        return false;
    }
    // Add jitter: rotate at max_requests ± 10%
    let jitter = self.config.max_requests / 10;
    let threshold = self.config.max_requests + (worker.id as u64 * jitter / self.workers.len() as u64);
    worker.metrics.requests_served >= threshold
}

Config Reference

[render]
jsc_workers = 4                       # Number of workers

[render.lifecycle]
max_requests = 10000                  # Rotate after N renders (0 = never)
max_memory_mb = 512                   # Rotate if RSS > N MB (0 = no limit)
max_uptime_hours = 24                 # Rotate after N hours (0 = never)
drain_timeout_ms = 5000               # Grace period for in-flight requests
watch_dirs = []                       # File paths to watch for changes
watch_debounce_ms = 500               # Debounce file change events

[render.lifecycle.recovery]
max_consecutive_failures = 10         # Give up after N crashes
backoff_initial_ms = 100
backoff_max_ms = 60000
backoff_multiplier = 3.0
healthy_reset_ms = 60000              # Reset failure count after healthy period

CLI Integration

bext ps                               # Show worker status table
bext workers restart                  # Gracefully restart all workers
bext workers restart --worker 2       # Restart specific worker
bext workers stats                    # Detailed per-worker metrics

Example bext ps output:

Workers (4/4 active)
  ID  State    Requests  Memory   Uptime    Avg Render
  0   ACTIVE   3,421     142 MB   2h 15m    4.2ms
  1   ACTIVE   3,380     138 MB   2h 15m    3.8ms
  2   DRAINING 3,412     256 MB   2h 15m    4.1ms  (rotating: memory)
  3   ACTIVE   3,399     145 MB   2h 15m    3.9ms

Testing Plan

Test	Type	What it validates
Max requests rotation	Unit	Worker rotated after N requests
Max memory rotation	Unit	Worker rotated when RSS exceeds limit
Max uptime rotation	Unit	Worker rotated after N hours
Graceful drain	Integration	In-flight request completes before worker stops
Crash backoff	Unit	Exponential backoff between restart attempts
Backoff reset	Unit	Healthy period resets failure counter
Max failures gives up	Unit	Worker slot stays empty after N crashes
Staggered rotation	Unit	Workers don't all rotate simultaneously
File watch restart	Integration	File change triggers worker restart
File watch debounce	Unit	Rapid saves produce single restart
Pool degradation	Unit	Pool works with reduced workers during rotation
Worker metrics	Unit	Counters increment correctly
Manual restart CLI	Integration	`bext workers restart` rotates all

Done Criteria

Workers rotate after max_requests renders
Workers rotate when RSS exceeds max_memory_mb
Workers rotate after max_uptime_hours
Graceful draining with in-flight request completion
Crash recovery with exponential backoff
Backoff reset after healthy period
Per-worker metrics (requests, memory, uptime, avg render time)
/__bext/workers JSON endpoint
Prometheus gauges per worker
bext ps CLI shows worker table
bext workers restart CLI command
File-watch triggered restart (with debounce)
Staggered rotation (workers don't rotate at the same time)
All tests passing