Runtime Recovery
Overview
This page describes the baseline recovery strategy implemented in the current phase.
The goal of Phase 1 recovery is not sophistication. It is to be explicit, observable, debuggable, and verifiable:
- unexpected FFmpeg exits trigger automatic restart
- manual stop does not trigger restart
- long periods without output trigger recovery
- recovery should preserve existing websocket clients whenever possible
- retries stop after the max restart count is reached
- a session is only destroyed when no websocket clients remain
At this stage, the runtime explicitly distinguishes between:
- FFmpeg process restart
- session teardown
These are not the same operation and must not be treated as one.
Goals
The current recovery strategy is designed to address:
- unexpected FFmpeg process exits
- invalid or unavailable upstream RTSP sources
- cases where FFmpeg still exists but produces no output
- uncontrolled infinite restart loops
- accidental websocket client cleanup during recovery
- accidental session destroy during process restart
Session Lifecycle Boundaries
1. Recovery restart is not session destroy
When a session enters recovery restart:
- the old FFmpeg process is stopped
- a new FFmpeg process is started after the old process exits
- existing websocket clients are preserved
- the session itself is not destroyed
In other words, restart is intended to recover stream output, not to end the session.
2. Manual stop is allowed to clear clients
When a session is intentionally stopped:
- the FFmpeg process is stopped
- further automatic restart is disabled
- websocket clients attached to the session are cleared and detached
- the session becomes eligible for final teardown
This means the session has been logically ended by runtime intent.
3. Session destroy is owned by StreamManager
Session teardown is not decided by FFmpeg process exit alone. It is decided by StreamManager.
The current rule is:
- a session may only be destroyed when no websocket clients remain
- if clients are still attached, the session must not be destroyed
- idle recovery / restart must not accidentally tear down a live session
This means:
- process exit != session end
- process restart != session destroy
Automatic Restart
When FFmpeg exits unexpectedly, the session enters the automatic restart flow.
The basic sequence is:
- record exit information
- determine whether this was a manual stop
- if not manual, wait for
STREAM_RESTART_DELAY_MS - increment
restartCount - attempt to start FFmpeg again
- stop retrying after
STREAM_MAX_RESTARTS
Relevant state fields include:
restartCountlastRestartAtlastExitCodelastExitSignal
Under the current lifecycle rules, automatic restart is expected to:
- preserve existing websocket clients whenever possible
- recover FFmpeg stream output
- avoid destroying the session by default
Manual Stop Does Not Restart
This is a key rule in the current phase.
If a session is intentionally stopped by runtime logic, for example:
- the last websocket client disconnects
- the stream manager explicitly stops the session
stop()is called intentionally
then FFmpeg exit must not trigger another restart.
This avoids:
- restarting a session that has already been logically ended
- consuming resources with no active clients
- interfering with manager cleanup behavior
At the same time, manual stop is allowed to:
- clear websocket client bindings
- move the session toward final teardown
Idle Recovery
Background
Some failures do not appear as a direct process exit. FFmpeg may remain alive while upstream is already unusable, or the process may no longer produce valid stdout data.
Relying only on exit/error events is not enough to detect this class of failure.
Detection Rule
The manager sweeps active sessions using STREAM_SWEEP_INTERVAL_MS.
A session is treated as idle/stalled when all the following are true:
- it still has websocket clients
- it is in
runningstate lastDataAthas not been updated for too long- the threshold exceeds
STREAM_IDLE_TIMEOUT_MS
Recovery Action
In Phase 1, the simplest and safest action is used:
- trigger
restart() restart()only recovers the FFmpeg process- existing websocket clients are preserved during recovery
- restart does not destroy the session
This allows the idle recovery path to reuse the existing restart lifecycle instead of introducing a second recovery state machine.
Max Restart Protection
To avoid endless retry loops, the runtime enforces a maximum restart count.
When:
restartCount >= STREAM_MAX_RESTARTSthe session:
- stops attempting automatic restart
- enters an error state
- logs an error event
This helps prevent:
- endless noisy retry loops
- repeated process creation with no recovery chance
- uncontrolled resource usage
Note:
- reaching the max restart limit moves the session into
errored - this does not automatically mean immediate session teardown
- final cleanup is still decided by
StreamManagerbased on client presence and cleanup flow
Related Configuration
The main recovery-related env variables are:
STREAM_IDLE_TIMEOUT_MSSTREAM_SWEEP_INTERVAL_MSSTREAM_RESTART_DELAY_MSSTREAM_MAX_RESTARTS
These should usually differ by environment.
Local debugging
Use smaller values to make recovery easier to verify, for example:
- shorter idle timeout
- shorter restart delay
- smaller max restart count
Stable runtime environments
Use more conservative values to avoid overly sensitive recovery behavior, for example:
- more conservative idle timeout
- more reasonable restart delay
- bounded automatic recovery attempts
Current Limitations
Phase 1 intentionally keeps recovery simple and does not yet include:
- complex backoff
- rolling-window retry accounting
- multi-stage recovery policies
- shared-upstream-aware recovery
- direct integration with metrics / alerting systems
- multi-priority recovery orchestration
- a more complex session state machine
At this stage, the priorities are:
- clear lifecycle boundaries
- avoiding accidental websocket client cleanup
- clear manager cleanup ownership
- recovery paths that are easy to trace and debug
Debugging Tips
When recovery behavior looks wrong, inspect in this order:
/healthz- session snapshot fields such as
state / restartCount / lastRestartAt / lastDataAt - FFmpeg stderr logs
- exit / restart / idle recovery logs
- RTSP source validity
- whether websocket clients are still attached
- whether session destroy was triggered unexpectedly
The most important log fields are:
streamIdsessionIdpidreason
If the problem is “recovery completed but clients still receive no media”, check these first:
- whether clients are still attached after restart
- whether FFmpeg restarted successfully after the old process exited
- whether
lastDataAtcontinues to move forward - whether
clientCountwas unexpectedly reduced to zero during recovery
Local Testing and Verification
To keep runtime recovery behavior stable as the project evolves, rtsp-ws-bridge includes lifecycle-oriented regression tests.
These tests do not aim to validate FFmpeg encoding itself. Their main purpose is to protect the most important runtime guarantees of Phase 1:
- FFmpeg session lifecycle correctness
- restart / recovery semantics
- websocket client retention and cleanup behavior
- stream manager orchestration behavior
/healthzresponse structure stability
Current test approach
The current test setup uses:
- Node built-in test runner:
node:test - TypeScript execution via:
tsx - Test directory:
apps/rtsp-ws-bridge/test/
Why this approach is used:
- no heavy test framework is required
- it matches the current TypeScript / Node setup of the repository
- it is sufficient for Phase 1 lifecycle regression coverage
- it can be extended later without introducing early tooling complexity
Common commands
Run all rtsp-ws-bridge tests:
pnpm --filter @ad-stream-bridge/rtsp-ws-bridge testRun typecheck:
pnpm --filter @ad-stream-bridge/rtsp-ws-bridge typecheckRun lint:
pnpm --filter @ad-stream-bridge/rtsp-ws-bridge lintBuild the app:
pnpm --filter @ad-stream-bridge/rtsp-ws-bridge buildRecommended local execution order
Before submitting changes, it is recommended to run:
pnpm --filter @ad-stream-bridge/rtsp-ws-bridge test
pnpm --filter @ad-stream-bridge/rtsp-ws-bridge typecheck
pnpm --filter @ad-stream-bridge/rtsp-ws-bridge lintIf the current change touches runtime logic, also run:
pnpm --filter @ad-stream-bridge/rtsp-ws-bridge buildCurrent test coverage
ffmpeg-session.lifecycle.test.ts
Mainly covers:
- whether recovery restart preserves websocket clients
- whether manual stop clears websocket clients
- whether unexpected exit triggers automatic restart
- whether the session enters
erroredafter the max restart limit is reached
stream-manager.lifecycle.test.ts
Mainly covers:
- whether the first client attach triggers session start
- whether multiple clients on the same stream reuse the same session
- whether the last client disconnect triggers stop + destroy
- whether the websocket error path also triggers cleanup
stream-manager.idle-recovery.test.ts
Mainly covers:
- whether idle timeout triggers restart
- whether no-client sessions are destroyed instead of restarted
- whether non-
runningsessions avoid accidental idle recovery
health-route.test.ts
Mainly covers:
- whether
/healthzreturns 200 - whether the top-level response structure is complete
- whether
bridge/sessionskeep a stable structure - whether
streamManagerdata is correctly exposed by the route
Debugging tips
If a test fails, inspect in this order:
- determine whether the test itself is wrong or the runtime behavior has changed
- if a session lifecycle test fails, inspect:
staterestartCountlastRestartAtlastStartedAtlastStoppedAtlastDataAtclientCount
- if a manager test fails, check:
- whether a session was created more than once
- whether
start()was triggered multiple times - whether
stop()only happens when no clients remain - whether websocket
close/errorpaths reached cleanup
- if a
/healthztest fails, check:- whether the route is registered
- whether
streamManager.getRuntimeStats()changed shape - whether
streamManager.getAllSessionSnapshots()still returns serializable data
Note
These tests are primarily designed to protect lifecycle semantics, not to replace real RTSP / FFmpeg environment validation.
In other words:
- tests protect runtime behavior and boundaries
- real local RTSP source validation confirms external dependency behavior
Both are important, but they serve different purposes and should not be treated as the same thing.