Phase 6: Worker Lifecycle Management
Goal
Smart lifecycle management for JSC render workers — automatic rotation, crash recovery with backoff, per-worker metrics, and file-watch triggered restarts. Borrowed from FrankenPHP's battle-tested worker mode.
Current State
- Shared 4-worker JSC render pool (configurable)
- Workers are long-lived — never rotated
- No max-requests-per-worker limit
- No crash recovery with backoff
- No per-worker metrics (requests served, memory, uptime)
- Plugin hot-reload exists but doesn't cover JS workers
Why This Matters
Long-lived JS workers accumulate:
- Memory leaks in user code (closures, event listeners, growing caches)
- Stale state from module-level variables that persist across requests
- V8/JSC fragmentation — the heap compactor can't fully defragment
FrankenPHP's MAX_REQUESTS pattern forces periodic worker restart, keeping memory bounded. Combined with exponential backoff on crashes, the system self-heals without operator intervention.
Design
Worker States
┌──────────┐
│ BOOTING │──── startup error ──▶ FAILED ──▶ backoff ──▶ BOOTING
└────┬─────┘
│ ready
▼
┌──────────┐
│ ACTIVE │──── request ──▶ process ──▶ ACTIVE
└────┬─────┘
│ max_requests reached OR memory limit OR watch trigger
▼
┌──────────┐
│ DRAINING │──── finish in-flight ──▶ STOPPED ──▶ BOOTING (new worker)
└──────────┘
Rotation Policy
[render]
jsc_workers = 4
[render.lifecycle]
max_requests = 10000 # Rotate after N renders (0 = never)
max_memory_mb = 512 # Rotate if RSS exceeds N MB (0 = no limit)
max_uptime_hours = 24 # Rotate after N hours (0 = never)
drain_timeout_ms = 5000 # Wait for in-flight requests before force-kill
When a worker hits any threshold:
- Mark worker as
DRAINING— stop accepting new requests - Wait for in-flight requests to complete (up to
drain_timeout_ms) - Stop the old worker
- Boot a fresh worker in its place
- Mark new worker as
ACTIVE
During draining, the pool routes requests to remaining active workers. With 4 workers, at most 1 is draining at a time (staggered rotation).
Crash Recovery with Backoff
When a worker crashes (JSC exception, segfault, OOM kill):
Attempt 1: restart immediately
Attempt 2: wait 100ms
Attempt 3: wait 500ms
Attempt 4: wait 2s
Attempt 5: wait 10s
...
Max backoff: 60s
Reset backoff after: 60s of healthy operation
[render.lifecycle]
max_consecutive_failures = 10 # Give up after N consecutive crashes
backoff_initial_ms = 100
backoff_max_ms = 60000
backoff_multiplier = 3.0
healthy_reset_ms = 60000 # Reset failure count after N ms healthy
After max_consecutive_failures, the worker slot stays empty and an alert is emitted. The pool operates with reduced capacity until an operator investigates.
Per-Worker Metrics
Each worker tracks:
struct WorkerMetrics {
id: u32,
state: WorkerState,
requests_served: u64,
requests_failed: u64,
total_render_time_ms: u64,
avg_render_time_ms: f64,
peak_memory_bytes: u64,
current_memory_bytes: u64,
started_at: Instant,
last_request_at: Option,
restarts: u32,
consecutive_failures: u32,
}
Exposed via:
GET /__bext/workers— JSON array of all worker metricsGET /metrics— Prometheus gauges per workerbext psCLI — tabular worker status
File-Watch Restart
Extend the existing plugin hot-reload pattern to JS workers:
[render.lifecycle]
watch_dirs = ["src/", "components/"] # Restart workers on file change
watch_debounce_ms = 500 # Debounce rapid changes
On file change:
- Debounce for 500ms (batch rapid saves)
- Drain all workers gracefully
- Boot fresh workers with updated code
- Log:
[bext] Workers restarted (file change: src/components/Header.tsx)
This is particularly useful for bext dev mode but also works in production for hot-deploying JS changes without full redeploy.
Implementation
Worker Pool Refactor
pub struct WorkerPool {
workers: Vec<Arc<RwLock>>,
config: WorkerLifecycleConfig,
metrics: Arc,
rotation_tx: mpsc::Sender,
}
struct Worker {
id: u32,
state: WorkerState,
metrics: WorkerMetrics,
isolate: JscIsolate, // Or handle to JSC context
}
enum RotationEvent {
MaxRequests(u32), // Worker ID
MaxMemory(u32),
MaxUptime(u32),
Crash(u32, String), // Worker ID, error message
FileChange(Vec),
ManualRestart, // CLI trigger
}
Rotation Coordinator
Background task that handles rotation events:
async fn rotation_coordinator(
pool: Arc,
mut rx: mpsc::Receiver,
) {
while let Some(event) = rx.recv().await {
match event {
RotationEvent::MaxRequests(id) | RotationEvent::MaxMemory(id) | RotationEvent::MaxUptime(id) => {
pool.rotate_worker(id).await;
}
RotationEvent::Crash(id, err) => {
tracing::error!(worker_id = id, error = %err, "worker crashed");
pool.restart_with_backoff(id).await;
}
RotationEvent::FileChange(paths) => {
tracing::info!(files = ?paths, "restarting all workers (file change)");
pool.restart_all().await;
}
RotationEvent::ManualRestart => {
pool.restart_all().await;
}
}
}
}
Memory Monitoring
Check worker RSS periodically (every 10 seconds):
async fn memory_monitor(pool: Arc, config: &WorkerLifecycleConfig) {
let mut interval = tokio::time::interval(Duration::from_secs(10));
loop {
interval.tick().await;
for worker in &pool.workers {
let w = worker.read().await;
if w.state == WorkerState::Active {
let rss = w.isolate.memory_usage();
w.metrics.current_memory_bytes = rss;
w.metrics.peak_memory_bytes = w.metrics.peak_memory_bytes.max(rss);
if config.max_memory_mb > 0 && rss > config.max_memory_mb * 1024 * 1024 {
pool.rotation_tx.send(RotationEvent::MaxMemory(w.id)).await;
}
}
}
}
}
Staggered Rotation
To avoid rotating all workers simultaneously:
fn should_rotate(&self, worker: &Worker) -> bool {
if self.config.max_requests == 0 {
return false;
}
// Add jitter: rotate at max_requests ± 10%
let jitter = self.config.max_requests / 10;
let threshold = self.config.max_requests + (worker.id as u64 * jitter / self.workers.len() as u64);
worker.metrics.requests_served >= threshold
}
Config Reference
[render]
jsc_workers = 4 # Number of workers
[render.lifecycle]
max_requests = 10000 # Rotate after N renders (0 = never)
max_memory_mb = 512 # Rotate if RSS > N MB (0 = no limit)
max_uptime_hours = 24 # Rotate after N hours (0 = never)
drain_timeout_ms = 5000 # Grace period for in-flight requests
watch_dirs = [] # File paths to watch for changes
watch_debounce_ms = 500 # Debounce file change events
[render.lifecycle.recovery]
max_consecutive_failures = 10 # Give up after N crashes
backoff_initial_ms = 100
backoff_max_ms = 60000
backoff_multiplier = 3.0
healthy_reset_ms = 60000 # Reset failure count after healthy period
CLI Integration
bext ps # Show worker status table
bext workers restart # Gracefully restart all workers
bext workers restart --worker 2 # Restart specific worker
bext workers stats # Detailed per-worker metrics
Example bext ps output:
Workers (4/4 active)
ID State Requests Memory Uptime Avg Render
0 ACTIVE 3,421 142 MB 2h 15m 4.2ms
1 ACTIVE 3,380 138 MB 2h 15m 3.8ms
2 DRAINING 3,412 256 MB 2h 15m 4.1ms (rotating: memory)
3 ACTIVE 3,399 145 MB 2h 15m 3.9ms
Testing Plan
| Test | Type | What it validates |
|---|---|---|
| Max requests rotation | Unit | Worker rotated after N requests |
| Max memory rotation | Unit | Worker rotated when RSS exceeds limit |
| Max uptime rotation | Unit | Worker rotated after N hours |
| Graceful drain | Integration | In-flight request completes before worker stops |
| Crash backoff | Unit | Exponential backoff between restart attempts |
| Backoff reset | Unit | Healthy period resets failure counter |
| Max failures gives up | Unit | Worker slot stays empty after N crashes |
| Staggered rotation | Unit | Workers don't all rotate simultaneously |
| File watch restart | Integration | File change triggers worker restart |
| File watch debounce | Unit | Rapid saves produce single restart |
| Pool degradation | Unit | Pool works with reduced workers during rotation |
| Worker metrics | Unit | Counters increment correctly |
| Manual restart CLI | Integration | bext workers restart rotates all |
Done Criteria
- Workers rotate after
max_requestsrenders - Workers rotate when RSS exceeds
max_memory_mb - Workers rotate after
max_uptime_hours - Graceful draining with in-flight request completion
- Crash recovery with exponential backoff
- Backoff reset after healthy period
- Per-worker metrics (requests, memory, uptime, avg render time)
-
/__bext/workersJSON endpoint - Prometheus gauges per worker
-
bext psCLI shows worker table -
bext workers restartCLI command - File-watch triggered restart (with debounce)
- Staggered rotation (workers don't rotate at the same time)
- All tests passing