Finn Power Resilience + DR Proposal, 2026-04-30¶

URL: https://mkdocs.justinsforge.com/memory/handoffs/finn-power-resilience-proposal-2026-04-30/

Status: Proposal. No destructive action taken. Awaiting Justin's review before any change. Companion to finn-power-loss-2026-04-30.

1. Forensic Verdict¶

After deeper investigation, the original handoff's "wall power outage" hypothesis is unproven, and one piece of evidence actively contradicts it.

Evidence	What it says
No dmesg thermal / MCE / watchdog / NMI / panic events in the hour before 12:10:29 CDT	Consistent with sudden power loss OR sudden hard hang. Does not distinguish the two.
Last log line on boot -1 was apparmor noise at 12:10:29, no shutdown sequence	Same: power cut OR instant freeze.
Boot -1 had 50 days of clean uptime	Argues against a chronic instability pattern.
`last -x` shows multiple `crash` markers on 2026-03-07 (cluster of 4 unclean reboots within 10 min) and 2026-03-11	There is prior crash history. The 50-day window after Mar 11 was the calm between bumps.
Vector (Windows PC) was online and trying to spawn Claude at 12:39:44 CDT, inside Finn's outage window	If Vector and Finn share house power, this is a contradiction. Most likely: it was not whole-house power. Either Finn's circuit only, or a Finn-specific fault (PSU, hard kernel hang, BIOS-level event).
SMART on both NVMe drives: PASSED	No imminent storage failure.
No UPS exists. `lsusb` shows no APC / CyberPower / Eaton vendor IDs.	Confirmed gap.
Sensors / lm-sensors not configured. ipmitool not installed.	We have no temperature or PSU baseline going forward. Worth fixing.
Finn's BIOS: version 1.27, dated 04/03/2025, system "Venus Series" (Minisforum MS-01)	BIOS is recent.
Finn CPU exposes AVX / AVX2 / AVX_VNNI	Capability is there for the Console VM if we let it through. See Section 4.

Conclusion. Whether it was Finn's circuit, the box's PSU, or a hard freeze that produced no log signal, the recovery gaps are the same. Stop chasing root cause beyond this. Do invest in independent witnesses going forward (UPS log, BMC SEL, sensors baseline) so the next event is forensically interpretable.

2. The Three Big Asks¶

Ask A: Auto-power-on after wall AC loss (the BIOS question)¶

Yes, the MS-01 supports this. It is a BIOS setting on Minisforum MS-01: Advanced > Power & Performance > "Restore on AC Power Loss" (or "AC Power Loss Recovery"), set to Power On (alternatives are "Stay Off" and "Last State"). With "Power On" set, the host comes up automatically the moment wall power is restored. No human keypress.

This setting can be checked / changed two ways: - In-BIOS during boot (DEL key on POST). Requires a monitor + keyboard at the rack. - Via the BMC web UI / KVM-over-IP. MS-01 has a dedicated BMC port that exposes a web UI accessible over LAN. From there: power control, BIOS access, video console, all from anywhere with VPN reach. This is the high-value path. See Section 5.

I cannot verify the current value from the host OS side (it is not exposed via dmidecode on this BIOS revision). It must be confirmed in BIOS or BMC. Action item: Justin verifies the setting on next visit to the rack, or after BMC LAN is configured.

Ask B: Make Finn return to 100% on cold reboot, no human involvement¶

Today's gaps and the proposed fix:

#	Gap	Fix	Risk if applied
B1	CT 107 (immich), CT 108 (frigate), CT 110 (minecraft) had `onboot: 0` (already fixed for 107 / 108 in Phase 0; minecraft stays off by design).	Verify across all CTs; commit `onboot: 1` and add `startup: order=N,up=15` for tiered boot.	Low. Each CT change is idempotent and documented.
B2	No tiered startup. All CTs started at once on recovery. Load spiked to 5.4 within seconds, SSH refused under the load.	Phased order. Tier 1: 105 adguard. Tier 2: 102 media-server, 106 n8n. Tier 3: 101 plex, 107 immich, 108 frigate. 15 s gap between tiers.	Low.
B3	Plex CT 101 failed to autostart after the recovery: kernel renumbered iGPU node from `card1` to `card0`. Patched by hand once. Will recur on next kernel bump.	Pin `dev1` line in `/etc/pve/lxc/101.conf` to a stable symlink: `/dev/dri/by-path/pci-0000:00:02.0-card,gid=44` (verify with `ls -la /dev/dri/by-path/`).	Medium. Touches the same conf line that Phase 0 patched. Test by bouncing CT 101 once after change.
B4	`forge-context-prefetch` and `forge-habits-instantiate` units failed at recovery because n8n CT was still booting when their timers fired. They have no retry. After manual run both succeeded.	Add `Restart=on-failure` + `RestartSec=30s` + `StartLimitIntervalSec=600` + `StartLimitBurst=10` to both, and to every long-running forge-* service. Sweep all units.	Low. Pure systemd config edit.
B5	No boot-time notification. We did not learn Finn rebooted until Justin noticed something off.	New `forge-boot-notify.service` (oneshot, WantedBy=multi-user.target, runs late) on both Finn and Console. Posts "Finn boot at $TS, kernel $K, uptime restored" via `forge_notify` to `@forge_notify_outbound_bot`.	Low.
B6	No periodic health probe. We discovered camera count was wrong only because we looked.	New `forge_fleet_probe.sh` + 5-min systemd timer. ICMP every CT, HTTPS every public surface, alert on 2x consecutive fail.	Low.
B7	`forge-tmux-anchor.service` exits cleanly (Type=oneshot) but the underlying tmux session is fragile. Right now it shows as alive because the post-boot script recreated it; after a panic-style reboot the whole tmux state is lost.	Convert to `Type=forking`, store state under `/var/lib/forge/tmux-anchor`, restore via `tmux-resurrect` plugin.	Medium. Tmux semantics under systemd are finicky. Test on a non-anchor session first.
B8	No `sensors` baseline. No `ipmitool` installed. No SMART self-test cron.	Install `lm-sensors`, run `sensors-detect`. Install `ipmitool`. Cron daily `smartctl -t short` on each NVMe.	Low. Read-only + cron jobs.

Ordering for execution. B1, B2, B4, B8 are no-risk batch one. B3 (Plex GPU pin) is a single test bounce. B5, B6 are new services. B7 is the trickiest, defer to a focused session.

Ask C: Justin-away, Finn bricked. What does recovery look like?¶

Today, the answer is bad. Here is the upgrade path:

Tier 1, hardware-level remote console¶

Configure the MS-01 BMC LAN port. Plug it into vmbr0 with a static IP (e.g. 192.168.86.68). Set BMC creds in NordPass. Enable the web UI. Make it Tailscale-reachable by adding the BMC IP to a Tailscale subnet route advertised by the same Finn host (or by any other always-on machine with a route to vmbr0). With BMC online, Justin can: power-cycle Finn, see the BIOS, attach a virtual ISO, boot from rescue media, watch boot output. All without being home. This is the single most valuable upgrade for the away scenario.
Document the BMC IP, default creds reset, and recovery procedure in system-map/fleet.md.

Tier 2, OS-level remote shell¶

Tailscale on Finn already exists (100.112.22.2). Confirmed. From any device with Tailscale logged in as Justin, ssh [email protected] works regardless of home WAN state. Document this.
Tailscale on the BMC. If BMC supports a static route, route the BMC subnet through Finn's Tailscale. If not, run a tiny Tailscale exit node on a $10 Raspberry Pi or any always-on LXC that has reach to BMC. Then BMC is reachable from the road.
Tailscale on Vector and Sol (already exists per memory). So even if Console is dead, Justin can reach Finn from his Mac or Windows PC.

Tier 3, agent-level recovery (the "Claude on the box" part)¶

Today's gap. Console (the dev VM that hosts Claude Code) is itself a Finn LXC + VM. If Finn is bricked, Console is bricked. Claude-on-Console cannot help recover Finn.
Fix. Spawn a backup Claude Code session on Vector (Windows PC) with full forge tree access via SMB or git clone https://github.com/<your-forge-repo>. Vector has its own power, network, and Claude binary. As long as Vector is on, Justin can run Claude there to drive recovery via Tailscale + SSH to Finn / BMC. Document the Vector Claude bootstrap path.
Backup option: Sol (Mac) running Claude as a second human-attended workstation.
Worst case: phone. Tailscale on phone + Termius (or similar) gives Justin direct SSH to Finn. Slow but possible. Document the SSH alias on the phone.

Tier 4, the data layer (in case Finn loses its disks too)¶

Forge tree is in git. Confirm it is pushed to GitHub right now. Add a precommit hook that warns if origin/main is behind by more than 1 day.
Secrets are NOT in git (correct). They live in ~/.forge-secrets/. Add an encrypted backup of that directory to Google Drive on a daily cron. GPG-encrypt with a passphrase Justin keeps in NordPass. The backup-passphrase file already exists at ~/.forge-secrets/.backup-passphrase, that is the seed for this.
Notion is the second-brain. It is cloud, independent of Finn. Not at risk in this scenario.
Plex / immich / media lives on Finn's storage. If Finn dies, the data is on the disks. In a "Finn box dead" scenario Justin can pull the NVMe + HDD into another machine. Document drive layout in system-map/fleet.md.

3. UPS Recommendation¶

This is the only real fix for actual wall power loss, and it doubles as a clean-shutdown path that produces forensic logs.

Spec	Recommendation
Capacity	1500VA / 900W class, line-interactive with AVR. Finn idle ~30W, peak ~120W. UPS needs to also cover the Finn switch, and ideally the modem + AP. Real-world budget ~250W.
Specific model	APC Back-UPS Pro 1500VA (BR1500MS2) — pure sine wave, USB to host, ~$250-300 retail. NUT-compatible. Or CyberPower CP1500PFCLCD (~$200, also NUT-compatible). Both ship with USB cable.
Daemon	NUT (Network UTilities for power) on Finn. USB to UPS. Console talks to Finn's NUT over LAN as a slave. Both shut down at 30% battery via `pct stop` + `qm shutdown` + `shutdown -h`.
Battery runtime	At ~250W draw, expect 8-12 min. Plenty for graceful shutdown.
Visibility	Home Assistant integration (HACS NUT integration). Graphs battery / load / wattage. UPS event log replaces guesswork.

If the budget allows a second UPS for the network rack (router + AP + modem), that buys 10-15 min of internet during a blip and prevents the LAN partition that took down DNS today.

4. Console AVX Fix¶

Root cause confirmed. /etc/pve/qemu-server/103.conf pins Console to cpu: x86-64-v2-AES, a CPU model that deliberately omits AVX/AVX2/AVX-512 (those are v3+). The host CPU has all of them.

Fix. Change the line to cpu: host (full passthrough) and reboot Console. Exposes AVX, AVX2, AVX_VNNI to anything running inside Console.

Why this matters now. The current claude CLI binary (2.1.123) happens to run on Console anyway; Anthropic appears to have relaxed the AVX requirement in this build. But Justin's incident note ("CLI claude binary requires AVX") was true on a prior version, and any future AVX-requiring tool (most ML workloads, several Python wheels, llama.cpp, etc.) will break. Doing the passthrough now removes the entire class of bug.

Risk. cpu: host ties the VM to the host's CPU model. If Finn is replaced with a different CPU family later, the VM may need to be cold-migrated. Acceptable; document the dependency in system-map/fleet.md.

Action.

ssh finn 'qm shutdown 103'
# wait for VM 103 to be stopped
ssh finn 'sed -i "s/^cpu: x86-64-v2-AES/cpu: host/" /etc/pve/qemu-server/103.conf'
ssh finn 'qm start 103'

Console reboots, ~60 s of session disruption. Justin should run this when no critical interactive Claude session is in progress.

5. Recommended Sequence¶

If Justin agrees with this proposal, the execution order I'd recommend:

Today: B1 + B2 + B4 + B8 (no-risk batch, all on Finn host, can be done in one pass with diffs shown before write).
Today: B5 boot-notify + B6 fleet-probe (two new services + one timer).
This week: BMC LAN configure (Justin physically at the rack to plug the BMC NIC + set static IP in BIOS / BMC web UI).
This week: AVX fix (one-shot Console reboot, when Justin is OK with ~60 s of Claude downtime).
This week: Plex GPU symlink pin (B3) (single CT bounce, low risk after BMC is up so Justin has remote console if Plex breaks).
Next week: UPS purchase + NUT config.
Next week: GitHub repo coverage check + secrets-encrypted-backup cron.
Next week: B7 tmux-anchor hardening + Vector Claude bootstrap doc.

Nothing in steps 1-2 is destructive or hard-to-reverse. Each diff is shown first.

6. What I Need From Justin Before Continuing¶

Greenlight on Section 5 step 1 (the no-risk batch). I'll show every diff before writing.
Confirm whether Vector and Finn share a power circuit. This decides whether the Vector evidence really disproves the wall-power hypothesis or not. (Same room? Same outlet strip? Same breaker?)
Decide whether to chase forensics further (pull Eight Sleep cloud data, router event log, Garmin sleep payload around 12:10) or move on with the recovery work.
Decide whether the MS-01 BMC has ever been configured. If yes, what is its IP and are creds in NordPass. If no, schedule a rack visit to plug the BMC NIC and set static IP.

7. References¶

Finn power loss / recovery report 2026-04-30
Fleet Names
Notify
Tailscale
FORGE-DOCTRINE.md, Section 10 (incident schema), Section 11 (context boundaries for downstream agents)

Author. Claude Code (Opus 4.7) on Console, 2026-04-30 13:30 CDT, picking up from the Phase 0 handoff. No destructive action taken in this session.