URL: https://mkdocs.justinsforge.com/memory/handoffs/spawn-tmux-harness-2026-04-30/
Spawn / tmux Self-Healing Harness¶
Closes the forge-tmux-anchor item in reference_reboot_resilience Pending section. Composes with the existing reboot resilience stack (fleet probe, pre-reboot save, boot notify, post-boot sessions, /resume skill).
Why this exists¶
On 2026-04-30 ~14:39 CDT the remote-bridge bot returned to Telegram:
Spawn failed (rc=1).
BOOT FAILED, '❯' prompt never appeared. Check auth / API key.
error creating /tmp/tmux-1000/default (Address already in use)
no server running on /tmp/tmux-1000/default
Cause: a previous tmux server crashed and left /tmp/tmux-1000/default as a stale Unix socket file. New tmux new-session calls tried to bind that socket and failed. One stale socket poisoned every future /spawn until manually cleaned up.
The home-base session stayed healthy through this because it uses tmux -L homebase (its own socket), so the rot was contained to the shared default socket. That isolation is the model.
Goal¶
Make spawning resilient to crashes. One stale socket (or a crashed tmux server) must never block a future spawn. Plays into the reboot resilience drill: post-boot Opus 4.7 sessions must spawn cleanly after every reboot.
Surfaces involved¶
forge/skills/spawn/(the/spawnskill, Justin's primary entrypoint)forge/scripts/forge_spawn*.py(search for the actual spawn implementation; helpers inforge/scripts/forge_telegram_*remote*.py)forge/scripts/forge_telegram_remote_bridge_bot.py(the bot that surfaced the failure)forge/scripts/forge_post_boot_sessions.sh(post-boot Opus session spawner; audit + align)forge/scripts/forge_fleet_probe.sh(add socket-health row)- New:
forge-tmux-socket-sweep.service+.timer
Required changes¶
1. Pre-flight socket sweep¶
Before every spawn, walk /tmp/tmux-1000/* (and any other socket dir tmux is using). For each socket, run tmux -S <path> list-sessions with a 2s timeout. If the command fails because no server is listening, rm the socket. Idempotent, cheap, no human in the loop. Log each removal. Skip live sockets (the ping handles this correctly).
2. Per-session socket isolation¶
Stop using the shared default socket for new spawns. Use tmux -L spawn-<sanitized-session-name> so each spawn gets its own socket. One crashed server cannot poison the others. Mirror the homebase pattern.
3. Background janitor (boot + timer)¶
Add a systemd user service + timer:
- forge-tmux-socket-sweep.service — oneshot, runs the sweep
- forge-tmux-socket-sweep.timer — every 10 min
- Service ordering: Before=forge-post-boot-sessions.service so post-boot Opus spawns hit a clean tree at boot.
- Also fire as a oneshot at boot (After=network.target, WantedBy=multi-user.target).
4. Spawn retry-once-with-cleanup¶
If first tmux new-session attempt fails with a socket error (Address already in use, error creating, lost server), run the pre-flight sweep again and retry exactly once. Only escalate to the user on a second consecutive failure. Surface the cleanup in the success message (spawned after cleaning 1 stale socket) so it is not silent.
5. Audit forge_post_boot_sessions.sh¶
It currently spawns onto tmux -L homebase, which is fine. Confirm no internal path falls back to the shared default socket. If any do, switch them to per-spawn isolation.
6. Socket-health row in forge_fleet_probe.sh¶
Add: count of /tmp/tmux-*/sock entries where tmux -S X list-sessions fails. One extra row in the Telegram boot-notify summary. Surfaces rot before it bites a spawn.
Validation¶
- Manually create a stale socket:
tmux -L stale new-session -d -s test 'sleep 60'; tmux -L stale kill-server. Confirm/tmp/tmux-1000/staleexists. Run/spawn. Verify the stale socket is gone and the spawn succeeded. - Kill a non-homebase tmux server mid-spawn. Verify the harness still spawns into its own isolated socket without touching homebase.
- Reboot Console. Verify the boot-time sweep ran, post-boot Opus sessions came up cleanly, fleet probe shows 0 dead sockets.
- Confirm
homebasewas not touched at any point.
Out of scope¶
- Don't change the home-base session — healthy and isolated already.
- Don't change
tmux.confdefaults — fix is purely orchestration-layer.
Immediate unblock for Justin¶
rm /tmp/tmux-1000/default on Console clears today's poisoned socket. Do this first so /spawn works while the harness is being built.
Composition with reboot resilience¶
| Resilience piece | Interaction |
|---|---|
| Pre-reboot save | Snapshots active session UUIDs. Harness ensures those UUIDs can actually spawn-resume after boot. |
| Boot notify | Will surface socket-health row once item 6 ships. |
| Post-boot sessions | Depends on harness for clean spawn. Add After=forge-tmux-socket-sweep.service to its unit. |
| /resume skill | Unaffected; reads JSONL, not sockets. |
| Fleet probe | Gains socket-health row (item 6). |
When done, edit reference_reboot_resilience: move forge-tmux-anchor from Pending to Shipped, point at this handoff.
[Claude Code]