Skip to content

Finn (MS-01) Power Loss + Recovery Report, 2026-04-30

URL: https://mkdocs.justinsforge.com/memory/handoffs/finn-power-loss-2026-04-30/

Status: Initial recovery complete. Deeper auto-start hardening + forensics + prevention work pending. Worker picking this up should start at "Open Worklist" below.


TL;DR

  • Event: Hard power loss to Finn (MS-01 Proxmox host) at 12:10:29 CDT, recovered at 12:40:38 CDT (~30 min outage). Not a software crash, not a kernel-upgrade event, not OOM. Power was cut.
  • Blast radius: All 6 production LXCs went down. plex (CT 101) failed to autostart on recovery due to a kernel-renumbered iGPU node (/dev/dri/card1 to card0). 3 CTs (immich, frigate, minecraft) had onboot=0 so didn't auto-recover.
  • Current state: plex patched and running. immich + frigate now running with onboot=1. minecraft still off by design. Console host fine. 2 forge-* services on Console need restart. Telegram bot fleet needs end-to-end verify.
  • Root cause hypothesis (most likely): wall power blip / breaker trip / UPS exhaustion. No software signal points to host fault. Confirm with Home Assistant + AdGuard logs as independent witnesses.

Forensic Evidence

Boot timeline

Boot -1: f848c49e... started Wed 2026-03-11 19:28:52 CDT
                    ended  Thu 2026-04-30 12:10:29 CDT  (50d uptime)
Boot  0: 62b02158... started Thu 2026-04-30 12:40:38 CDT  (current)

30-minute gap between last log line on boot -1 and first log line on boot 0.

Why it's a power cut, not a crash

  • Last log line on boot -1: normal apparmor DENIED sendmsg noise from rsyslogd on lxc-107 at 12:10:29. No shutdown sequence. No kernel panic. No OOM kill of init. No watchdog trigger. Just instant silence.
  • last -x shows the prior session ended in crash (Linux's marker for an unclean reboot, i.e. nothing wrote a clean wtmp shutdown record).
  • apt history.log has zero entries on 2026-04-30. The kernel jump from 6.17.2-1-pve to 6.17.13-4-pve is from a queued install during a prior apt upgrade weeks ago; the reboot just happened to load the newer kernel that was already on disk.
  • Free memory at recovery: 11G/31G used, 8G swap free. No memory pressure.
  • ZFS / disks: zpool status -x clean. df healthy (rpool 12%, workspace nvme 51%).
  • Boot -1 had 50 days continuous uptime. Pattern is a one-shot event, not chronic instability.

What we have NOT yet checked (worker should do this)

  • Home Assistant power / energy integration log for 12:00-12:45 (independent power-state witness; HA was on a different power path? confirm.)
  • AdGuard query log around 12:10 (DNS dies → power cut signature on local network).
  • Finn's dmesg looking for pre-12:10 thermal warnings, ECC corrections, MCE events, or PSU under-voltage hints in IPMI / sensors.
  • sensors output baseline now to see if the box runs hot.
  • Whether a UPS exists at all on the Finn rack. (User has not mentioned one; Section 3 amendment says Finn = Proxmox host without specifying power.) If yes, query its logs. If no, this is the single biggest gap.

What Was Recovered (already done)

Item Action Verified
Finn host Booted clean on 6.17.13-4-pve uptime confirms
CT 101 plex /dev/dri/card1 to card0 patch in /etc/pve/lxc/101.conf, started pct status running
CT 107 immich Started, set onboot=1 conf + status confirmed
CT 108 frigate Started, set onboot=1 conf + status confirmed
CT 102 media-server, 105 adguard, 106 n8n Auto-recovered on boot reachable, uptime confirms

What's Still Broken / Unverified

Console (host running this Claude Code session)

  • forge-context-prefetch.service — failed
  • forge-habits-instantiate.service — failed
  • forge-tmux-anchor.service — exited (Type=oneshot, may be normal, but no tmux sessions are alive: tmux ls errors with "no such file or directory"). Justin's home base session is gone.

Finn LXCs (services inside containers, not yet checked)

  • n8n (CT 106): docker stack inside might or might not have auto-started; n8n workflow engine status not verified.
  • media-server (CT 102): arr-stack docker compose project; needs docker ps audit.
  • frigate (CT 108): docker compose, camera RTSP reconnects can take minutes; needs verification + camera health check.
  • adguard (CT 105): docker container quirk per memory; verify resolving DNS.
  • plex (CT 101): verify transcoding works (GPU passthrough was just patched, so this is the highest-risk service).
  • immich (CT 107): verify web UI + photo backup queue.
  • homeassistant (CT 109?): uptime says 6 min, no users; verify integrations are reconnected (Eight Sleep, Hevy bridges, Garmin, Ring, etc.).

Telegram bot fleet

All 4 bots (capture, coordinator, notify, remote-bridge) show active running on Console. Has not been pinged end-to-end. During the 30-min outage they were down; on recovery they may have a webhook/polling backlog. Worker should send a test message to each and confirm reaction + reply.

Public surfaces

  • mkdocs.justinsforge.com — depends on forge-mkdocs.service (running), Cloudflare tunnel on Console (need to verify cloudflared status).
  • dashboard.justinsforge.com — depends on forge-dashboard.service (running) + tunnel.
  • usage.justinsforge.com — same.
  • Any sites under sites/ deployed via Cloudflare Pages — independent of Finn, should be unaffected.
  • Notion-driven flows (orient, triage, habits) — depend on Console cron + n8n; n8n's status is the question mark.

Cron / timer landmines

  • 12:00 CT cron jobs that run on Finn or in LXCs: any that fired between 12:00 and 12:10:29 may have written partial state. Audit forge-time-daily (12:00 fire), morning report flow (8:50), habits (04:00), all already past for today; only 17:00 and 23:30 still pending.

A. Make recovery automatic next time (the explicit ask)

  1. Audit onboot across all CTs. Every production CT should have onboot: 1 and a sane startup: order=N,up=K,down=K value. Phased boot order avoids the load-storm we just hit (load 5.4 from simultaneous starts caused SSH refusal).
  2. Tier 1 (must boot first, no deps): 105 adguard (DNS for everything else)
  3. Tier 2 (core): 106 n8n, 102 media-server
  4. Tier 3 (UI): 101 plex, 107 immich, 108 frigate
  5. Set order/up to spread starts ~15s apart.
  6. Audit systemctl is-enabled for every forge-* unit on Console. Anything disabled or static that should boot at startup gets enabled.
  7. Restart-on-failure policy. Every long-running forge-* service should have Restart=on-failure, RestartSec=10s, and StartLimitIntervalSec=300, StartLimitBurst=5. Currently inconsistent; needs a sweep.
  8. Boot watchdog notification. Add a forge-boot-notify.service (Type=oneshot, WantedBy=multi-user.target, runs late) that posts a Telegram alert via forge_notify on every Finn and Console boot. We were blind today; would have known instantly.
  9. Cron-driven health probe. Every 5 min, forge_fleet_probe.sh pings every CT + every public surface, posts a failure to @forge_notify_outbound_bot if anything is down for 2 consecutive checks. Lighter than a real monitor, catches degraded states.
  10. Tmux anchor reliability. forge-tmux-anchor.service exits cleanly because oneshot, but the underlying tmux session apparently does not survive boot. Either persist with tmux-resurrect or convert to a Type=forking service that creates the named session and never exits.

B. Prevent a recurrence

  • UPS is the only real fix for wall-power loss. Software cannot prevent power being cut. Recommendation: APC Back-UPS Pro 1500VA-class on the Finn rack, NUT daemon talking USB to Finn and a slave config to Console (if both share the rack). NUT shutdown rules: at 30% battery, graceful pct stop all CTs, then shutdown -h. Logs and graphing via Home Assistant integration. Cost: ~$250-400.
  • Hardware health baseline. Even if UPS is the answer, run smartctl -a on Finn's NVMe drives now to rule out impending storage failure that a power blip could have masked. Set up daily SMART self-test cron.
  • Proxmox health alerting. Enable pvesh set /cluster/notifications or wire pve-zsync/pve-firewall notifications through n8n to Telegram for any CRITICAL host event.
  • Replication / failover. Out of scope without a second host. Note as a Phase X consideration.

C. Defense against the specific bug we hit

  • /dev/dri/card* renumbering after kernel upgrade. Pin by-path or by-id instead of card name. In CT 101's conf, replace dev1: /dev/dri/card0,gid=44 with dev1: /dev/dri/by-path/pci-0000:00:02.0-card,gid=44 (or whatever stable symlink applies to MS-01's iGPU). Verify with ls -la /dev/dri/by-path/. This stops the Plex GPU breakage from happening again on the next kernel bump.

Open Worklist for the Worker

Pick up here. Each item is independently shippable.

Phase 1 — verify what we already restarted (~10 min)

  1. End-to-end ping each Telegram bot: send a test message to @forge_inbox_capture_bot, @forge_lifeos_coordinator_bot, @forge_notify_outbound_bot, @forge_remote_bridge_bot. Confirm reactions + replies.
  2. ssh n8n "docker ps" — confirm n8n is up; check https://n8n.justinkrystal.com (or whatever the tunnel hostname is) responds.
  3. ssh frigate "docker ps; curl -s localhost:5000/api/stats | jq .cameras | head" — confirm cameras streaming.
  4. ssh plex "curl -s localhost:32400/identity" — confirm Plex API live; trigger a transcode test if possible to validate GPU patch.
  5. Open https://mkdocs.justinsforge.com, https://dashboard.justinsforge.com, https://usage.justinsforge.com — confirm 200.
  6. systemctl restart forge-context-prefetch forge-habits-instantiate on Console; check journal for the underlying error if they fail again.
  7. Re-spawn forge-tmux-anchor session: systemctl restart forge-tmux-anchor then tmux ls to confirm.

Phase 2 — independent forensic witnesses (~10 min)

  1. Pull HA Eight Sleep + Ring + any energy entity timestamps around 12:10 to confirm whole-network power loss vs Finn-only.
  2. AdGuard query log: any DNS gap between 12:10 and 12:40?
  3. dmesg -T on Finn for any thermal / MCE / hardware events in the last 24h before the cut. Also sensors baseline.
  4. smartctl -a /dev/nvme0n1 and any other drives.
  5. Check whether a UPS is physically attached (USB devices: lsusb, ls /dev/usb/hiddev*). If yes, identify model and grab event log.

Phase 3 — wire auto-recovery (~30 min)

  1. Sweep /etc/pve/lxc/*.conf: confirm onboot: 1 on all production CTs; add startup: order=N,up=15 per the tiering above.
  2. On Console: systemctl list-unit-files 'forge-*' and check is-enabled per unit. Enable any that are runtime-only.
  3. Audit each forge-* service unit file for Restart=on-failure + RestartSec + StartLimit*. Fix in place; commit.
  4. Build forge-boot-notify.service on both Finn and Console. Use forge_notify (already in MEMORY.md). WantedBy=multi-user.target, runs near the end of boot.
  5. Build forge_fleet_probe.sh + a 5-min systemd timer. Light: ICMP every CT, HTTPS every public surface, alert on 2x consecutive fail.
  6. Patch CT 101 dev1 to use /dev/dri/by-path/... symlink; reboot CT to validate; document the symlink path in reference_media_server_stack.md and MEMORY.md.

Phase 4 — prevention proposal (deliverable, not action) (~15 min)

  1. Write a follow-up handoff finn-power-resilience-proposal-2026-04-30.md with: UPS recommendation (specific model + price), NUT config sketch, SMART monitoring cron, Proxmox notification wiring. Justin will decide what to buy.

Phase 5 — checkpointing

  1. Update LESSONS.md with the kernel-upgrade-renumbers-DRI-cards lesson (Section 10.4 incident schema).
  2. Update MEMORY.md reference entry for reference_media_server_stack.md to note the DRI by-path pin.
  3. Append a final section to this handoff documenting what shipped vs deferred.

Files / Paths Touched So Far

  • /etc/pve/lxc/101.conf (Finn): card1 to card0 patch on dev1 line.
  • /etc/pve/lxc/107.conf (Finn): onboot: 1 set.
  • /etc/pve/lxc/108.conf (Finn): onboot: 1 set.

References


Worker handoff signature. This handoff was written by Claude Code (Opus 4.7) on Console at 2026-04-30 13:05 CDT, immediately after the recovery work in Phase 0. Justin is continuing with the worker that picks this up.