Finn (MS-01) Power Loss + Recovery Report, 2026-04-30¶

URL: https://mkdocs.justinsforge.com/memory/handoffs/finn-power-loss-2026-04-30/

Status: Initial recovery complete. Deeper auto-start hardening + forensics + prevention work pending. Worker picking this up should start at "Open Worklist" below.

TL;DR¶

Event: Hard power loss to Finn (MS-01 Proxmox host) at 12:10:29 CDT, recovered at 12:40:38 CDT (~30 min outage). Not a software crash, not a kernel-upgrade event, not OOM. Power was cut.
Blast radius: All 6 production LXCs went down. plex (CT 101) failed to autostart on recovery due to a kernel-renumbered iGPU node (/dev/dri/card1 to card0). 3 CTs (immich, frigate, minecraft) had onboot=0 so didn't auto-recover.
Current state: plex patched and running. immich + frigate now running with onboot=1. minecraft still off by design. Console host fine. 2 forge-* services on Console need restart. Telegram bot fleet needs end-to-end verify.
Root cause hypothesis (most likely): wall power blip / breaker trip / UPS exhaustion. No software signal points to host fault. Confirm with Home Assistant + AdGuard logs as independent witnesses.

Forensic Evidence¶

Boot timeline¶

Boot -1: f848c49e... started Wed 2026-03-11 19:28:52 CDT
                    ended  Thu 2026-04-30 12:10:29 CDT  (50d uptime)
Boot  0: 62b02158... started Thu 2026-04-30 12:40:38 CDT  (current)

30-minute gap between last log line on boot -1 and first log line on boot 0.

Why it's a power cut, not a crash¶

Last log line on boot -1: normal apparmor DENIED sendmsg noise from rsyslogd on lxc-107 at 12:10:29. No shutdown sequence. No kernel panic. No OOM kill of init. No watchdog trigger. Just instant silence.
last -x shows the prior session ended in crash (Linux's marker for an unclean reboot, i.e. nothing wrote a clean wtmp shutdown record).
apt history.log has zero entries on 2026-04-30. The kernel jump from 6.17.2-1-pve to 6.17.13-4-pve is from a queued install during a prior apt upgrade weeks ago; the reboot just happened to load the newer kernel that was already on disk.
Free memory at recovery: 11G/31G used, 8G swap free. No memory pressure.
ZFS / disks: zpool status -x clean. df healthy (rpool 12%, workspace nvme 51%).
Boot -1 had 50 days continuous uptime. Pattern is a one-shot event, not chronic instability.

What we have NOT yet checked (worker should do this)¶

Home Assistant power / energy integration log for 12:00-12:45 (independent power-state witness; HA was on a different power path? confirm.)
AdGuard query log around 12:10 (DNS dies → power cut signature on local network).
Finn's dmesg looking for pre-12:10 thermal warnings, ECC corrections, MCE events, or PSU under-voltage hints in IPMI / sensors.
sensors output baseline now to see if the box runs hot.
Whether a UPS exists at all on the Finn rack. (User has not mentioned one; Section 3 amendment says Finn = Proxmox host without specifying power.) If yes, query its logs. If no, this is the single biggest gap.

What Was Recovered (already done)¶

Item	Action	Verified
Finn host	Booted clean on 6.17.13-4-pve	uptime confirms
CT 101 plex	`/dev/dri/card1` to `card0` patch in `/etc/pve/lxc/101.conf`, started	`pct status` running
CT 107 immich	Started, set `onboot=1`	conf + status confirmed
CT 108 frigate	Started, set `onboot=1`	conf + status confirmed
CT 102 media-server, 105 adguard, 106 n8n	Auto-recovered on boot	reachable, uptime confirms

What's Still Broken / Unverified¶

Console (host running this Claude Code session)¶

forge-context-prefetch.service — failed
forge-habits-instantiate.service — failed
forge-tmux-anchor.service — exited (Type=oneshot, may be normal, but no tmux sessions are alive: tmux ls errors with "no such file or directory"). Justin's home base session is gone.

Finn LXCs (services inside containers, not yet checked)¶

n8n (CT 106): docker stack inside might or might not have auto-started; n8n workflow engine status not verified.
media-server (CT 102): arr-stack docker compose project; needs docker ps audit.
frigate (CT 108): docker compose, camera RTSP reconnects can take minutes; needs verification + camera health check.
adguard (CT 105): docker container quirk per memory; verify resolving DNS.
plex (CT 101): verify transcoding works (GPU passthrough was just patched, so this is the highest-risk service).
immich (CT 107): verify web UI + photo backup queue.
homeassistant (CT 109?): uptime says 6 min, no users; verify integrations are reconnected (Eight Sleep, Hevy bridges, Garmin, Ring, etc.).

Telegram bot fleet¶

All 4 bots (capture, coordinator, notify, remote-bridge) show active running on Console. Has not been pinged end-to-end. During the 30-min outage they were down; on recovery they may have a webhook/polling backlog. Worker should send a test message to each and confirm reaction + reply.

Public surfaces¶

mkdocs.justinsforge.com — depends on forge-mkdocs.service (running), Cloudflare tunnel on Console (need to verify cloudflared status).
dashboard.justinsforge.com — depends on forge-dashboard.service (running) + tunnel.
usage.justinsforge.com — same.
Any sites under sites/ deployed via Cloudflare Pages — independent of Finn, should be unaffected.
Notion-driven flows (orient, triage, habits) — depend on Console cron + n8n; n8n's status is the question mark.

Cron / timer landmines¶

12:00 CT cron jobs that run on Finn or in LXCs: any that fired between 12:00 and 12:10:29 may have written partial state. Audit forge-time-daily (12:00 fire), morning report flow (8:50), habits (04:00), all already past for today; only 17:00 and 23:30 still pending.

Open Worklist for the Worker¶

Pick up here. Each item is independently shippable.

Phase 1 — verify what we already restarted (~10 min)¶

End-to-end ping each Telegram bot: send a test message to @forge_inbox_capture_bot, @forge_lifeos_coordinator_bot, @forge_notify_outbound_bot, @forge_remote_bridge_bot. Confirm reactions + replies.
ssh n8n "docker ps" — confirm n8n is up; check https://n8n.justinkrystal.com (or whatever the tunnel hostname is) responds.
ssh frigate "docker ps; curl -s localhost:5000/api/stats | jq .cameras | head" — confirm cameras streaming.
ssh plex "curl -s localhost:32400/identity" — confirm Plex API live; trigger a transcode test if possible to validate GPU patch.
Open https://mkdocs.justinsforge.com, https://dashboard.justinsforge.com, https://usage.justinsforge.com — confirm 200.
systemctl restart forge-context-prefetch forge-habits-instantiate on Console; check journal for the underlying error if they fail again.
Re-spawn forge-tmux-anchor session: systemctl restart forge-tmux-anchor then tmux ls to confirm.

Phase 2 — independent forensic witnesses (~10 min)¶

Pull HA Eight Sleep + Ring + any energy entity timestamps around 12:10 to confirm whole-network power loss vs Finn-only.
AdGuard query log: any DNS gap between 12:10 and 12:40?
dmesg -T on Finn for any thermal / MCE / hardware events in the last 24h before the cut. Also sensors baseline.
smartctl -a /dev/nvme0n1 and any other drives.
Check whether a UPS is physically attached (USB devices: lsusb, ls /dev/usb/hiddev*). If yes, identify model and grab event log.

Phase 3 — wire auto-recovery (~30 min)¶

Sweep /etc/pve/lxc/*.conf: confirm onboot: 1 on all production CTs; add startup: order=N,up=15 per the tiering above.
On Console: systemctl list-unit-files 'forge-*' and check is-enabled per unit. Enable any that are runtime-only.
Audit each forge-* service unit file for Restart=on-failure + RestartSec + StartLimit*. Fix in place; commit.
Build forge-boot-notify.service on both Finn and Console. Use forge_notify (already in MEMORY.md). WantedBy=multi-user.target, runs near the end of boot.
Build forge_fleet_probe.sh + a 5-min systemd timer. Light: ICMP every CT, HTTPS every public surface, alert on 2x consecutive fail.
Patch CT 101 dev1 to use /dev/dri/by-path/... symlink; reboot CT to validate; document the symlink path in reference_media_server_stack.md and MEMORY.md.

Phase 4 — prevention proposal (deliverable, not action) (~15 min)¶

Write a follow-up handoff finn-power-resilience-proposal-2026-04-30.md with: UPS recommendation (specific model + price), NUT config sketch, SMART monitoring cron, Proxmox notification wiring. Justin will decide what to buy.

Phase 5 — checkpointing¶

Update LESSONS.md with the kernel-upgrade-renumbers-DRI-cards lesson (Section 10.4 incident schema).
Update MEMORY.md reference entry for reference_media_server_stack.md to note the DRI by-path pin.
Append a final section to this handoff documenting what shipped vs deferred.

Files / Paths Touched So Far¶

/etc/pve/lxc/101.conf (Finn): card1 to card0 patch on dev1 line.
/etc/pve/lxc/107.conf (Finn): onboot: 1 set.
/etc/pve/lxc/108.conf (Finn): onboot: 1 set.

References¶

Doctrine: FORGE-DOCTRINE.md, Section 3 (naming) + Section 10 (self-iteration / incident schema).
Fleet Names — Finn = Proxmox host on MS-01.
Media-server stack — needs DRI symlink note added.
Notify — used for boot-notify service.

Worker handoff signature. This handoff was written by Claude Code (Opus 4.7) on Console at 2026-04-30 13:05 CDT, immediately after the recovery work in Phase 0. Justin is continuing with the worker that picks this up.