Finn Recovery + Forge Hardening, Shipped 2026-04-30¶

URL: https://mkdocs.justinsforge.com/memory/handoffs/finn-recovery-shipped-2026-04-30/

Status: Closed loop on initial Phase 0 recovery + three prevention fixes shipped + AVX passthrough planned. Companion to finn-power-loss-2026-04-30 (initial report) and finn-power-resilience-proposal-2026-04-30 (proposal). This handoff documents what shipped today.

TL;DR¶

Finn outage 12:10 to 12:40 CDT. Initial recovery in Phase 0 left bots dead because npm auto-installed a Bun-bundled claude binary requiring AVX2 at 12:11 CDT (during the outage); Console VM CPU x86-64-v2-AES lacks AVX2; every claude -p SIGILL'd. Today's session unblocked the bots, hardened the recovery layer, cleaned up an orphan mount unit on Finn, and patched a latent script bug uncovered by the new restart policy. Console CPU passthrough (cpu: host) is the final layer; reboot in progress at session end.

Forensic Update¶

The original "wall power outage" hypothesis is unproven. Vector (Windows PC) was online at 12:39:44 CDT, mid-outage. Justin confirmed Vector and Finn share the same circuit and same outlet, both via separate Tapo smart plugs. New prime suspect: Tapo smart plug glitch. The Tapo on Finn's leg may have toggled off (firmware, schedule, voice misfire) while the Vector Tapo stayed on. No other hardware-fault evidence in dmesg, SMART, or sensors. Forensic confidence in "Tapo glitch vs hard kernel hang vs PSU brownout" is roughly equal; UPS + sensors + IPMI baseline (proposal section 3 + 5) would give us a definitive answer next time.

Independent finding worth noting: last -x on Finn shows a cluster of 4 unclean reboots on 2026-03-07 (within 10 min) and another on 2026-03-11. The 50-day uptime since Mar 11 was the calm window, not the steady state. Whatever is happening on this hardware is recurrent, not one-shot.

What Shipped Today¶

1. Bots unblocked, npm churn neutralized¶

Where	Change
Symlink	`~/.nvm/versions/node/v20.20.0/bin/claude` → `~/.local/bin/claude` (was → broken Bun-bundled `claude.exe`).
14 forge scripts	All `CLAUDE_BIN` constants repointed from the nvm path to `~/.local/bin/claude` directly. Files: `forge_telegram_inbox_brain.py`, `forge_telegram_remote_bridge.py`, `forge_morning_report.py`, `forge_heartbeat_random.py`, `forge_evening_winddown.py`, `forge_dispatcher_critic.py`, `forge_eval_harness.py`, `forge_food_lookup.py`, `forge_fitness_weekly_retro.py`, `forge_habits_alias_backfill.py`, `forge_memory_auto_capture.py`, `forge_memory_auto_dream.py`, `forge_training_recommendation.py`, `forge_wellness_daily_summary.py`. (`forge_telegram_brain.py` and `forge_orient.py` import `core.CLAUDE_BIN` from `forge_telegram_inbox_brain` so they inherit.) Also `forge_tmux_anchor_session.sh`.
Result	`claude -p` works under all 14 scripts. Verified live: brain returns "PONG" via the same path. The npm symlink CAN still be clobbered by a future `npm install -g @anthropic-ai/claude-code`, but the scripts no longer depend on it.

2. Forge oneshot units self-heal¶

Added Restart=on-failure / RestartSec=30s / StartLimitIntervalSec=600 / StartLimitBurst=10 drop-ins to 10 forge oneshot services that lacked retry policy. Every long-running Type=simple forge service already had Restart=on-failure. The two units that bit us today (forge-context-prefetch, forge-habits-instantiate) are now self-healing.

Drop-in path: /etc/systemd/system/<unit>.service.d/restart.conf. Units patched: - forge-context-prefetch, forge-habits-instantiate, forge-habits-alias-backfill, forge-habits-streaks - forge-mkdocs-build, forge-time-daily, forge-bot-watchdog - forge-telegram-hourly-checkin, forge-telegram-followup-unscheduled, forge-telegram-weekly-review-prep

Skipped: forge-tmux-anchor (special case, deferred per proposal section B7).

3. Finn mount cleanup¶

/etc/systemd/system/mnt-pve-fast\x2dstorage.mount was an orphan from a prior storage rename. It pointed at the same UUID as the live Workspace storage at /mnt/pve/workspace. Failed at boot with a device-readiness race; nothing else referenced it. Disabled, removed wants symlink, deleted file. Zero .mount units in /etc/systemd/system/ on Finn now. The real Workspace mount (3.5TB, all brand assets) and /mnt/storage (24TB, Plex) are untouched and working.

4. Bonus latent bug caught + fixed¶

The new restart policy on forge-context-prefetch revealed that the script was writing its cache successfully but then exiting 1. Cause: set -o pipefail + tmux ls (which returns 1 when no tmux server exists, normal for systemd contexts) at forge_context_prefetch.sh:68. Wrapped tmux ls in { ... || true; } to swallow the harmless failure. Verified: systemctl start forge-context-prefetch.service now exits clean.

Same pattern checked across all forge .sh scripts; this was the only instance.

Finn Console State (closing)¶

Surface	State
Failed units (Console + Finn)	0
4 Telegram bots	running, brain calls working
6 production CTs	healthy, 60+ min uptime
Plex GPU patch	confirmed at conf level; transcode test still pending
Frigate cameras	1 camera (`front_floodlight`) — confirmed configured count, not a regression
`/mnt/pve/workspace`	mounted, 3.5TB used
`/mnt/storage`	mounted, 10TB used
Cloudflare tunnels	mkdocs / dashboard / usage all 200 or 302 (Access redirect)
Tailscale	Console (100.97.43.104), Finn (100.112.22.2), media-server (100.86.48.50) all alive

Pending (not shipped, queued)¶

Imminent (this session)¶

Console AVX passthrough. Change /etc/pve/qemu-server/103.conf from cpu: x86-64-v2-AES to cpu: host, reboot Console. Exposes Finn's i9-13900H AVX/AVX2/AVX_VNNI to Console. The npm-bundled claude binary works again, defense-in-depth alongside the script repoints. Reboot kills the active Claude Code session; auto-resume hooked up via forge-finn-recovery-resume.service so this chat respawns after boot.

Next sessions¶

MS-01 BMC LAN configure. Plug the BMC NIC, set static IP (suggest 192.168.86.68), put creds in NordPass, advertise via Tailscale subnet route. Single most valuable upgrade for the away scenario.
Plex GPU device pin. Replace dev1: /dev/dri/card0,gid=44 in /etc/pve/lxc/101.conf with /dev/dri/by-path/... symlink so the next kernel bump does not break transcoding. Test by bouncing CT 101 once after change.
Tiered onboot ordering. Set startup: order=N,up=15 on all production CTs (tier 1: adguard. tier 2: media-server, n8n. tier 3: plex, immich, frigate). Avoids the load-spike that produced "SSH refused" on this recovery.
Boot-notify service. New forge-boot-notify.service on both Finn and Console. Posts boot timestamp + kernel via forge_notify to @forge_notify_outbound_bot. We were blind today; would have known instantly.
Fleet probe. New forge_fleet_probe.sh + 5-min systemd timer. ICMP every CT, HTTPS every public surface, alert on 2x consecutive fail.
UPS + NUT. APC Back-UPS Pro 1500VA-class, NUT daemon on Finn, Home Assistant integration. The only real fix for actual wall power loss and the only path to forensic clarity next time.
Encrypted secrets backup. Daily cron, GPG-encrypted snapshot of ~/.forge-secrets/ to Google Drive. Passphrase already exists at ~/.forge-secrets/.backup-passphrase.
forge-tmux-anchor hardening (B7). Convert to Type=forking + tmux-resurrect. Defer, focused session.
Sensors + smartctl baseline. apt install lm-sensors ipmitool. sensors-detect. Daily smartctl -t short cron.

Investigation (low priority)¶

Tapo smart plug event log for 12:10 to 12:40 CDT. The Tapo app exposes per-plug history; if Finn's plug shows an off-on event in that window, we have the smoking gun.
Eight Sleep cloud data for the same window as a second independent witness (their data path does not depend on Finn).

Files Changed Today¶

forge/scripts/forge_context_prefetch.sh                    # tmux ls + pipefail fix
forge/scripts/forge_telegram_inbox_brain.py                # CLAUDE_BIN repoint
forge/scripts/forge_telegram_remote_bridge.py              # ditto
forge/scripts/forge_morning_report.py                      # ditto
forge/scripts/forge_heartbeat_random.py                    # ditto
forge/scripts/forge_evening_winddown.py                    # ditto
forge/scripts/forge_dispatcher_critic.py                   # ditto
forge/scripts/forge_eval_harness.py                        # ditto
forge/scripts/forge_food_lookup.py                         # ditto
forge/scripts/forge_fitness_weekly_retro.py                # ditto
forge/scripts/forge_habits_alias_backfill.py               # ditto
forge/scripts/forge_memory_auto_capture.py                 # ditto
forge/scripts/forge_memory_auto_dream.py                   # ditto
forge/scripts/forge_training_recommendation.py             # ditto
forge/scripts/forge_wellness_daily_summary.py              # ditto
forge/scripts/forge_tmux_anchor_session.sh                 # ditto
forge/scripts/forge_finn_recovery_resume.sh                # NEW: auto-respawn this Claude session post-boot

# Outside the repo, written via systemd / Proxmox:
~/.nvm/versions/node/v20.20.0/bin/claude                   # symlink target swapped to ~/.local/bin/claude
/etc/systemd/system/forge-context-prefetch.service.d/restart.conf       # NEW
/etc/systemd/system/forge-habits-instantiate.service.d/restart.conf     # NEW
/etc/systemd/system/forge-habits-alias-backfill.service.d/restart.conf  # NEW
/etc/systemd/system/forge-habits-streaks.service.d/restart.conf         # NEW
/etc/systemd/system/forge-mkdocs-build.service.d/restart.conf           # NEW
/etc/systemd/system/forge-time-daily.service.d/restart.conf             # NEW
/etc/systemd/system/forge-bot-watchdog.service.d/restart.conf           # NEW
/etc/systemd/system/forge-telegram-hourly-checkin.service.d/restart.conf       # NEW
/etc/systemd/system/forge-telegram-followup-unscheduled.service.d/restart.conf # NEW
/etc/systemd/system/forge-telegram-weekly-review-prep.service.d/restart.conf   # NEW
/etc/systemd/system/forge-finn-recovery-resume.service                  # NEW (auto-resume oneshot)

# On Finn (via Tailscale because LAN ssh got rate-limited mid-session):
/etc/systemd/system/mnt-pve-fast\x2dstorage.mount          # DELETED (orphan)

How to Resume This Chat After Console Reboot¶

Console reboot kills the tmux session finn-recovery_Opus47 that wraps this Claude Code process. The session JSONL persists at ~/.claude/projects/-home-justinwieb-forge/e5162cae-5d60-4ef4-9892-fa6f91ddafa9.jsonl.

A oneshot systemd service forge-finn-recovery-resume.service is enabled to fire at boot. It executes forge_finn_recovery_resume.sh, which spawns:

tmux new-session -d -s finn-recovery_Opus47 \
  "claude --model claude-opus-4-7 --dangerously-skip-permissions --resume e5162cae-5d60-4ef4-9892-fa6f91ddafa9"

Justin reconnects from any device by: 1. Open Claude Code (claude.ai/code or VS Code extension) 2. Find session "finn-recovery_Opus47" / project /home/justinwieb/forge 3. Resume

The session state (this entire conversation) loads automatically.

References¶

Finn power loss + recovery report 2026-04-30 (initial Phase 0)
Finn power resilience + DR proposal 2026-04-30
FORGE-DOCTRINE.md sections 3, 8, 9, 10, 11, 12.
Telegram bot robustness stack

Author. Claude Code (Opus 4.7) on Console, tmux session finn-recovery_Opus47, 2026-04-30 14:00 CDT. Reboot triggered at end of this session.