Finn Recovery + Forge Hardening, Shipped 2026-04-30¶
URL: https://mkdocs.justinsforge.com/memory/handoffs/finn-recovery-shipped-2026-04-30/
Status: Closed loop on initial Phase 0 recovery + three prevention fixes shipped + AVX passthrough planned. Companion to finn-power-loss-2026-04-30 (initial report) and finn-power-resilience-proposal-2026-04-30 (proposal). This handoff documents what shipped today.
TL;DR¶
Finn outage 12:10 to 12:40 CDT. Initial recovery in Phase 0 left bots dead because npm auto-installed a Bun-bundled claude binary requiring AVX2 at 12:11 CDT (during the outage); Console VM CPU x86-64-v2-AES lacks AVX2; every claude -p SIGILL'd. Today's session unblocked the bots, hardened the recovery layer, cleaned up an orphan mount unit on Finn, and patched a latent script bug uncovered by the new restart policy. Console CPU passthrough (cpu: host) is the final layer; reboot in progress at session end.
Forensic Update¶
The original "wall power outage" hypothesis is unproven. Vector (Windows PC) was online at 12:39:44 CDT, mid-outage. Justin confirmed Vector and Finn share the same circuit and same outlet, both via separate Tapo smart plugs. New prime suspect: Tapo smart plug glitch. The Tapo on Finn's leg may have toggled off (firmware, schedule, voice misfire) while the Vector Tapo stayed on. No other hardware-fault evidence in dmesg, SMART, or sensors. Forensic confidence in "Tapo glitch vs hard kernel hang vs PSU brownout" is roughly equal; UPS + sensors + IPMI baseline (proposal section 3 + 5) would give us a definitive answer next time.
Independent finding worth noting: last -x on Finn shows a cluster of 4 unclean reboots on 2026-03-07 (within 10 min) and another on 2026-03-11. The 50-day uptime since Mar 11 was the calm window, not the steady state. Whatever is happening on this hardware is recurrent, not one-shot.
What Shipped Today¶
1. Bots unblocked, npm churn neutralized¶
| Where | Change |
|---|---|
| Symlink | ~/.nvm/versions/node/v20.20.0/bin/claude → ~/.local/bin/claude (was → broken Bun-bundled claude.exe). |
| 14 forge scripts | All CLAUDE_BIN constants repointed from the nvm path to ~/.local/bin/claude directly. Files: forge_telegram_inbox_brain.py, forge_telegram_remote_bridge.py, forge_morning_report.py, forge_heartbeat_random.py, forge_evening_winddown.py, forge_dispatcher_critic.py, forge_eval_harness.py, forge_food_lookup.py, forge_fitness_weekly_retro.py, forge_habits_alias_backfill.py, forge_memory_auto_capture.py, forge_memory_auto_dream.py, forge_training_recommendation.py, forge_wellness_daily_summary.py. (forge_telegram_brain.py and forge_orient.py import core.CLAUDE_BIN from forge_telegram_inbox_brain so they inherit.) Also forge_tmux_anchor_session.sh. |
| Result | claude -p works under all 14 scripts. Verified live: brain returns "PONG" via the same path. The npm symlink CAN still be clobbered by a future npm install -g @anthropic-ai/claude-code, but the scripts no longer depend on it. |
2. Forge oneshot units self-heal¶
Added Restart=on-failure / RestartSec=30s / StartLimitIntervalSec=600 / StartLimitBurst=10 drop-ins to 10 forge oneshot services that lacked retry policy. Every long-running Type=simple forge service already had Restart=on-failure. The two units that bit us today (forge-context-prefetch, forge-habits-instantiate) are now self-healing.
Drop-in path: /etc/systemd/system/<unit>.service.d/restart.conf. Units patched:
- forge-context-prefetch, forge-habits-instantiate, forge-habits-alias-backfill, forge-habits-streaks
- forge-mkdocs-build, forge-time-daily, forge-bot-watchdog
- forge-telegram-hourly-checkin, forge-telegram-followup-unscheduled, forge-telegram-weekly-review-prep
Skipped: forge-tmux-anchor (special case, deferred per proposal section B7).
3. Finn mount cleanup¶
/etc/systemd/system/mnt-pve-fast\x2dstorage.mount was an orphan from a prior storage rename. It pointed at the same UUID as the live Workspace storage at /mnt/pve/workspace. Failed at boot with a device-readiness race; nothing else referenced it. Disabled, removed wants symlink, deleted file. Zero .mount units in /etc/systemd/system/ on Finn now. The real Workspace mount (3.5TB, all brand assets) and /mnt/storage (24TB, Plex) are untouched and working.
4. Bonus latent bug caught + fixed¶
The new restart policy on forge-context-prefetch revealed that the script was writing its cache successfully but then exiting 1. Cause: set -o pipefail + tmux ls (which returns 1 when no tmux server exists, normal for systemd contexts) at forge_context_prefetch.sh:68. Wrapped tmux ls in { ... || true; } to swallow the harmless failure. Verified: systemctl start forge-context-prefetch.service now exits clean.
Same pattern checked across all forge .sh scripts; this was the only instance.
Finn Console State (closing)¶
| Surface | State |
|---|---|
| Failed units (Console + Finn) | 0 |
| 4 Telegram bots | running, brain calls working |
| 6 production CTs | healthy, 60+ min uptime |
| Plex GPU patch | confirmed at conf level; transcode test still pending |
| Frigate cameras | 1 camera (front_floodlight) — confirmed configured count, not a regression |
/mnt/pve/workspace |
mounted, 3.5TB used |
/mnt/storage |
mounted, 10TB used |
| Cloudflare tunnels | mkdocs / dashboard / usage all 200 or 302 (Access redirect) |
| Tailscale | Console (100.97.43.104), Finn (100.112.22.2), media-server (100.86.48.50) all alive |
Pending (not shipped, queued)¶
Imminent (this session)¶
- Console AVX passthrough. Change
/etc/pve/qemu-server/103.conffromcpu: x86-64-v2-AEStocpu: host, reboot Console. Exposes Finn's i9-13900H AVX/AVX2/AVX_VNNI to Console. The npm-bundledclaudebinary works again, defense-in-depth alongside the script repoints. Reboot kills the active Claude Code session; auto-resume hooked up viaforge-finn-recovery-resume.serviceso this chat respawns after boot.
Next sessions¶
- MS-01 BMC LAN configure. Plug the BMC NIC, set static IP (suggest
192.168.86.68), put creds in NordPass, advertise via Tailscale subnet route. Single most valuable upgrade for the away scenario. - Plex GPU device pin. Replace
dev1: /dev/dri/card0,gid=44in/etc/pve/lxc/101.confwith/dev/dri/by-path/...symlink so the next kernel bump does not break transcoding. Test by bouncing CT 101 once after change. - Tiered onboot ordering. Set
startup: order=N,up=15on all production CTs (tier 1: adguard. tier 2: media-server, n8n. tier 3: plex, immich, frigate). Avoids the load-spike that produced "SSH refused" on this recovery. - Boot-notify service. New
forge-boot-notify.serviceon both Finn and Console. Posts boot timestamp + kernel viaforge_notifyto@forge_notify_outbound_bot. We were blind today; would have known instantly. - Fleet probe. New
forge_fleet_probe.sh+ 5-min systemd timer. ICMP every CT, HTTPS every public surface, alert on 2x consecutive fail. - UPS + NUT. APC Back-UPS Pro 1500VA-class, NUT daemon on Finn, Home Assistant integration. The only real fix for actual wall power loss and the only path to forensic clarity next time.
- Encrypted secrets backup. Daily cron, GPG-encrypted snapshot of
~/.forge-secrets/to Google Drive. Passphrase already exists at~/.forge-secrets/.backup-passphrase. forge-tmux-anchorhardening (B7). Convert toType=forking+tmux-resurrect. Defer, focused session.- Sensors + smartctl baseline.
apt install lm-sensors ipmitool.sensors-detect. Dailysmartctl -t shortcron.
Investigation (low priority)¶
- Tapo smart plug event log for 12:10 to 12:40 CDT. The Tapo app exposes per-plug history; if Finn's plug shows an off-on event in that window, we have the smoking gun.
- Eight Sleep cloud data for the same window as a second independent witness (their data path does not depend on Finn).
Files Changed Today¶
forge/scripts/forge_context_prefetch.sh # tmux ls + pipefail fix
forge/scripts/forge_telegram_inbox_brain.py # CLAUDE_BIN repoint
forge/scripts/forge_telegram_remote_bridge.py # ditto
forge/scripts/forge_morning_report.py # ditto
forge/scripts/forge_heartbeat_random.py # ditto
forge/scripts/forge_evening_winddown.py # ditto
forge/scripts/forge_dispatcher_critic.py # ditto
forge/scripts/forge_eval_harness.py # ditto
forge/scripts/forge_food_lookup.py # ditto
forge/scripts/forge_fitness_weekly_retro.py # ditto
forge/scripts/forge_habits_alias_backfill.py # ditto
forge/scripts/forge_memory_auto_capture.py # ditto
forge/scripts/forge_memory_auto_dream.py # ditto
forge/scripts/forge_training_recommendation.py # ditto
forge/scripts/forge_wellness_daily_summary.py # ditto
forge/scripts/forge_tmux_anchor_session.sh # ditto
forge/scripts/forge_finn_recovery_resume.sh # NEW: auto-respawn this Claude session post-boot
# Outside the repo, written via systemd / Proxmox:
~/.nvm/versions/node/v20.20.0/bin/claude # symlink target swapped to ~/.local/bin/claude
/etc/systemd/system/forge-context-prefetch.service.d/restart.conf # NEW
/etc/systemd/system/forge-habits-instantiate.service.d/restart.conf # NEW
/etc/systemd/system/forge-habits-alias-backfill.service.d/restart.conf # NEW
/etc/systemd/system/forge-habits-streaks.service.d/restart.conf # NEW
/etc/systemd/system/forge-mkdocs-build.service.d/restart.conf # NEW
/etc/systemd/system/forge-time-daily.service.d/restart.conf # NEW
/etc/systemd/system/forge-bot-watchdog.service.d/restart.conf # NEW
/etc/systemd/system/forge-telegram-hourly-checkin.service.d/restart.conf # NEW
/etc/systemd/system/forge-telegram-followup-unscheduled.service.d/restart.conf # NEW
/etc/systemd/system/forge-telegram-weekly-review-prep.service.d/restart.conf # NEW
/etc/systemd/system/forge-finn-recovery-resume.service # NEW (auto-resume oneshot)
# On Finn (via Tailscale because LAN ssh got rate-limited mid-session):
/etc/systemd/system/mnt-pve-fast\x2dstorage.mount # DELETED (orphan)
How to Resume This Chat After Console Reboot¶
Console reboot kills the tmux session finn-recovery_Opus47 that wraps this Claude Code process. The session JSONL persists at ~/.claude/projects/-home-justinwieb-forge/e5162cae-5d60-4ef4-9892-fa6f91ddafa9.jsonl.
A oneshot systemd service forge-finn-recovery-resume.service is enabled to fire at boot. It executes forge_finn_recovery_resume.sh, which spawns:
tmux new-session -d -s finn-recovery_Opus47 \
"claude --model claude-opus-4-7 --dangerously-skip-permissions --resume e5162cae-5d60-4ef4-9892-fa6f91ddafa9"
Justin reconnects from any device by:
1. Open Claude Code (claude.ai/code or VS Code extension)
2. Find session "finn-recovery_Opus47" / project /home/justinwieb/forge
3. Resume
The session state (this entire conversation) loads automatically.
References¶
- Finn power loss + recovery report 2026-04-30 (initial Phase 0)
- Finn power resilience + DR proposal 2026-04-30
- FORGE-DOCTRINE.md sections 3, 8, 9, 10, 11, 12.
- Telegram bot robustness stack
Author. Claude Code (Opus 4.7) on Console, tmux session finn-recovery_Opus47, 2026-04-30 14:00 CDT. Reboot triggered at end of this session.