Finn (MS-01) Power Loss + Recovery Report, 2026-04-30¶
URL: https://mkdocs.justinsforge.com/memory/handoffs/finn-power-loss-2026-04-30/
Status: Initial recovery complete. Deeper auto-start hardening + forensics + prevention work pending. Worker picking this up should start at "Open Worklist" below.
TL;DR¶
- Event: Hard power loss to Finn (MS-01 Proxmox host) at 12:10:29 CDT, recovered at 12:40:38 CDT (~30 min outage). Not a software crash, not a kernel-upgrade event, not OOM. Power was cut.
- Blast radius: All 6 production LXCs went down. plex (CT 101) failed to autostart on recovery due to a kernel-renumbered iGPU node (
/dev/dri/card1tocard0). 3 CTs (immich, frigate, minecraft) hadonboot=0so didn't auto-recover. - Current state: plex patched and running. immich + frigate now running with
onboot=1. minecraft still off by design. Console host fine. 2 forge-* services on Console need restart. Telegram bot fleet needs end-to-end verify. - Root cause hypothesis (most likely): wall power blip / breaker trip / UPS exhaustion. No software signal points to host fault. Confirm with Home Assistant + AdGuard logs as independent witnesses.
Forensic Evidence¶
Boot timeline¶
Boot -1: f848c49e... started Wed 2026-03-11 19:28:52 CDT
ended Thu 2026-04-30 12:10:29 CDT (50d uptime)
Boot 0: 62b02158... started Thu 2026-04-30 12:40:38 CDT (current)
30-minute gap between last log line on boot -1 and first log line on boot 0.
Why it's a power cut, not a crash¶
- Last log line on boot -1: normal apparmor
DENIED sendmsgnoise from rsyslogd on lxc-107 at 12:10:29. No shutdown sequence. No kernel panic. No OOM kill of init. No watchdog trigger. Just instant silence. last -xshows the prior session ended incrash(Linux's marker for an unclean reboot, i.e. nothing wrote a clean wtmp shutdown record).apt history.loghas zero entries on 2026-04-30. The kernel jump from6.17.2-1-pveto6.17.13-4-pveis from a queued install during a priorapt upgradeweeks ago; the reboot just happened to load the newer kernel that was already on disk.- Free memory at recovery: 11G/31G used, 8G swap free. No memory pressure.
- ZFS / disks:
zpool status -xclean.dfhealthy (rpool 12%, workspace nvme 51%). - Boot -1 had 50 days continuous uptime. Pattern is a one-shot event, not chronic instability.
What we have NOT yet checked (worker should do this)¶
- Home Assistant power / energy integration log for 12:00-12:45 (independent power-state witness; HA was on a different power path? confirm.)
- AdGuard query log around 12:10 (DNS dies → power cut signature on local network).
- Finn's
dmesglooking for pre-12:10 thermal warnings, ECC corrections, MCE events, or PSU under-voltage hints in IPMI / sensors. sensorsoutput baseline now to see if the box runs hot.- Whether a UPS exists at all on the Finn rack. (User has not mentioned one; Section 3 amendment says Finn = Proxmox host without specifying power.) If yes, query its logs. If no, this is the single biggest gap.
What Was Recovered (already done)¶
| Item | Action | Verified |
|---|---|---|
| Finn host | Booted clean on 6.17.13-4-pve | uptime confirms |
| CT 101 plex | /dev/dri/card1 to card0 patch in /etc/pve/lxc/101.conf, started |
pct status running |
| CT 107 immich | Started, set onboot=1 |
conf + status confirmed |
| CT 108 frigate | Started, set onboot=1 |
conf + status confirmed |
| CT 102 media-server, 105 adguard, 106 n8n | Auto-recovered on boot | reachable, uptime confirms |
What's Still Broken / Unverified¶
Console (host running this Claude Code session)¶
forge-context-prefetch.service— failedforge-habits-instantiate.service— failedforge-tmux-anchor.service— exited (Type=oneshot, may be normal, but no tmux sessions are alive:tmux lserrors with "no such file or directory"). Justin's home base session is gone.
Finn LXCs (services inside containers, not yet checked)¶
- n8n (CT 106): docker stack inside might or might not have auto-started; n8n workflow engine status not verified.
- media-server (CT 102): arr-stack docker compose project; needs
docker psaudit. - frigate (CT 108): docker compose, camera RTSP reconnects can take minutes; needs verification + camera health check.
- adguard (CT 105): docker container quirk per memory; verify resolving DNS.
- plex (CT 101): verify transcoding works (GPU passthrough was just patched, so this is the highest-risk service).
- immich (CT 107): verify web UI + photo backup queue.
- homeassistant (CT 109?): uptime says 6 min, no users; verify integrations are reconnected (Eight Sleep, Hevy bridges, Garmin, Ring, etc.).
Telegram bot fleet¶
All 4 bots (capture, coordinator, notify, remote-bridge) show active running on Console. Has not been pinged end-to-end. During the 30-min outage they were down; on recovery they may have a webhook/polling backlog. Worker should send a test message to each and confirm reaction + reply.
Public surfaces¶
mkdocs.justinsforge.com— depends onforge-mkdocs.service(running), Cloudflare tunnel on Console (need to verifycloudflaredstatus).dashboard.justinsforge.com— depends onforge-dashboard.service(running) + tunnel.usage.justinsforge.com— same.- Any sites under
sites/deployed via Cloudflare Pages — independent of Finn, should be unaffected. - Notion-driven flows (orient, triage, habits) — depend on Console cron + n8n; n8n's status is the question mark.
Cron / timer landmines¶
- 12:00 CT cron jobs that run on Finn or in LXCs: any that fired between 12:00 and 12:10:29 may have written partial state. Audit
forge-time-daily(12:00 fire), morning report flow (8:50), habits (04:00), all already past for today; only 17:00 and 23:30 still pending.
Recommended Solutions¶
A. Make recovery automatic next time (the explicit ask)¶
- Audit
onbootacross all CTs. Every production CT should haveonboot: 1and a sanestartup: order=N,up=K,down=Kvalue. Phased boot order avoids the load-storm we just hit (load 5.4 from simultaneous starts caused SSH refusal). - Tier 1 (must boot first, no deps): 105 adguard (DNS for everything else)
- Tier 2 (core): 106 n8n, 102 media-server
- Tier 3 (UI): 101 plex, 107 immich, 108 frigate
- Set order/up to spread starts ~15s apart.
- Audit
systemctl is-enabledfor every forge-* unit on Console. Anythingdisabledorstaticthat should boot at startup gets enabled. - Restart-on-failure policy. Every long-running forge-* service should have
Restart=on-failure,RestartSec=10s, andStartLimitIntervalSec=300,StartLimitBurst=5. Currently inconsistent; needs a sweep. - Boot watchdog notification. Add a
forge-boot-notify.service(Type=oneshot, WantedBy=multi-user.target, runs late) that posts a Telegram alert viaforge_notifyon every Finn and Console boot. We were blind today; would have known instantly. - Cron-driven health probe. Every 5 min,
forge_fleet_probe.shpings every CT + every public surface, posts a failure to@forge_notify_outbound_botif anything is down for 2 consecutive checks. Lighter than a real monitor, catches degraded states. - Tmux anchor reliability.
forge-tmux-anchor.serviceexits cleanly because oneshot, but the underlying tmux session apparently does not survive boot. Either persist withtmux-resurrector convert to aType=forkingservice that creates the named session and never exits.
B. Prevent a recurrence¶
- UPS is the only real fix for wall-power loss. Software cannot prevent power being cut. Recommendation: APC Back-UPS Pro 1500VA-class on the Finn rack, NUT daemon talking USB to Finn and a slave config to Console (if both share the rack). NUT shutdown rules: at 30% battery, graceful
pct stopall CTs, thenshutdown -h. Logs and graphing via Home Assistant integration. Cost: ~$250-400. - Hardware health baseline. Even if UPS is the answer, run
smartctl -aon Finn's NVMe drives now to rule out impending storage failure that a power blip could have masked. Set up daily SMART self-test cron. - Proxmox health alerting. Enable
pvesh set /cluster/notificationsor wirepve-zsync/pve-firewallnotifications through n8n to Telegram for any CRITICAL host event. - Replication / failover. Out of scope without a second host. Note as a Phase X consideration.
C. Defense against the specific bug we hit¶
/dev/dri/card*renumbering after kernel upgrade. Pin by-path or by-id instead of card name. In CT 101's conf, replacedev1: /dev/dri/card0,gid=44withdev1: /dev/dri/by-path/pci-0000:00:02.0-card,gid=44(or whatever stable symlink applies to MS-01's iGPU). Verify withls -la /dev/dri/by-path/. This stops the Plex GPU breakage from happening again on the next kernel bump.
Open Worklist for the Worker¶
Pick up here. Each item is independently shippable.
Phase 1 — verify what we already restarted (~10 min)¶
- End-to-end ping each Telegram bot: send a test message to
@forge_inbox_capture_bot,@forge_lifeos_coordinator_bot,@forge_notify_outbound_bot,@forge_remote_bridge_bot. Confirm reactions + replies. ssh n8n "docker ps"— confirm n8n is up; checkhttps://n8n.justinkrystal.com(or whatever the tunnel hostname is) responds.ssh frigate "docker ps; curl -s localhost:5000/api/stats | jq .cameras | head"— confirm cameras streaming.ssh plex "curl -s localhost:32400/identity"— confirm Plex API live; trigger a transcode test if possible to validate GPU patch.- Open
https://mkdocs.justinsforge.com,https://dashboard.justinsforge.com,https://usage.justinsforge.com— confirm 200. systemctl restart forge-context-prefetch forge-habits-instantiateon Console; check journal for the underlying error if they fail again.- Re-spawn
forge-tmux-anchorsession:systemctl restart forge-tmux-anchorthentmux lsto confirm.
Phase 2 — independent forensic witnesses (~10 min)¶
- Pull HA Eight Sleep + Ring + any energy entity timestamps around 12:10 to confirm whole-network power loss vs Finn-only.
- AdGuard query log: any DNS gap between 12:10 and 12:40?
dmesg -Ton Finn for any thermal / MCE / hardware events in the last 24h before the cut. Alsosensorsbaseline.smartctl -a /dev/nvme0n1and any other drives.- Check whether a UPS is physically attached (USB devices:
lsusb,ls /dev/usb/hiddev*). If yes, identify model and grab event log.
Phase 3 — wire auto-recovery (~30 min)¶
- Sweep
/etc/pve/lxc/*.conf: confirmonboot: 1on all production CTs; addstartup: order=N,up=15per the tiering above. - On Console:
systemctl list-unit-files 'forge-*'and checkis-enabledper unit. Enable any that are runtime-only. - Audit each forge-* service unit file for
Restart=on-failure+RestartSec+StartLimit*. Fix in place; commit. - Build
forge-boot-notify.serviceon both Finn and Console. Useforge_notify(already in MEMORY.md).WantedBy=multi-user.target, runs near the end of boot. - Build
forge_fleet_probe.sh+ a 5-min systemd timer. Light: ICMP every CT, HTTPS every public surface, alert on 2x consecutive fail. - Patch CT 101
dev1to use/dev/dri/by-path/...symlink; reboot CT to validate; document the symlink path inreference_media_server_stack.mdandMEMORY.md.
Phase 4 — prevention proposal (deliverable, not action) (~15 min)¶
- Write a follow-up handoff
finn-power-resilience-proposal-2026-04-30.mdwith: UPS recommendation (specific model + price), NUT config sketch, SMART monitoring cron, Proxmox notification wiring. Justin will decide what to buy.
Phase 5 — checkpointing¶
- Update
LESSONS.mdwith the kernel-upgrade-renumbers-DRI-cards lesson (Section 10.4 incident schema). - Update
MEMORY.mdreference entry forreference_media_server_stack.mdto note the DRI by-path pin. - Append a final section to this handoff documenting what shipped vs deferred.
Files / Paths Touched So Far¶
/etc/pve/lxc/101.conf(Finn):card1tocard0patch ondev1line./etc/pve/lxc/107.conf(Finn):onboot: 1set./etc/pve/lxc/108.conf(Finn):onboot: 1set.
References¶
- Doctrine: FORGE-DOCTRINE.md, Section 3 (naming) + Section 10 (self-iteration / incident schema).
- Fleet Names — Finn = Proxmox host on MS-01.
- Media-server stack — needs DRI symlink note added.
- Notify — used for boot-notify service.
Worker handoff signature. This handoff was written by Claude Code (Opus 4.7) on Console at 2026-04-30 13:05 CDT, immediately after the recovery work in Phase 0. Justin is continuing with the worker that picks this up.