Skip to content

Architecture Optimization Proposals, 2026-04-21

Auditor: Claude Opus 4.7 (sec-audit session) Companion doc: security-audit-2026-04-21.md Fleet snapshot: Finn load 1.18, RAM 20G/31G used, workspace 45%, media 66%. UDev 72% root, 2.6G/11G RAM. All services up.


TL;DR, 5 Biggest Wins

Rank Proposal Effort Impact
1 Add 64GB RAM to Finn (total 96GB) 15 min + $170 Unclogs everything: no more 78% RAM pressure, room for Coral/PBS/new CTs
2 Offsite backup pipeline (/mnt/storage/workspace-backup → Backblaze B2) 1 hr setup, $10/mo Removes the "Finn dies = total loss" SPOF
3 Coral TPU for Frigate (M.2 or USB) 30 min + $30–60 Cuts Frigate CPU ~80%; frees cores for Plex transcoding, ready for 3-cam expansion
4 UPS on Finn (CyberPower 1500VA or APC equivalent) 10 min + $200 Kills brownout reboots; protects NVMe + HDD from corruption
5 Reverse-proxy consolidation (Caddy or Traefik on UDev) 2 hr Cleaner tunnel→service routing, metrics, one-place TLS/access config, prepares for Cloudflare Access → internal auth migration

1. Single Points of Failure

SPOF Current Exposure Fix
Finn itself Hardware failure = every service down, Plex, HA, n8n, cameras, all VMs gone Long-term: cluster with a second mini PC (MS-01 variant) for HA migrations. Short-term: Proxmox Backup Server to external store.
Workspace NVMe Single NVMe on Finn holds 2.7TB of active work + Frigate recordings. If it dies, 30 days of cameras + active video projects gone. Offsite backup pipeline (see #2 below). Consider a second NVMe slot on Finn as a ZFS mirror long-term.
Google Fiber / home network Single ISP; entire fleet unreachable externally during outages Acceptable. Tailscale provides LAN-like recovery if cell-tethered.
AdGuard (CT 105) Single DNS for the fleet, if it dies or gets wedged, lookups fail fleet-wide (just happened via Tailscale MagicDNS issue) Add 8.8.8.8 or 1.1.1.1 as secondary nameserver in each LXC config so fallback works. Currently most containers have ONLY .75.
n8n (CT 106) Sole event-trigger layer for the whole AI fleet Redundancy not worth cost; ensure n8n DB is backed up nightly (it's in container LVM: PBS covers this)
Cloudflare Tunnel (UDev) If UDev cloudflared dies, justinsforge.com + dashboard.justinsforge.com go down Acceptable, tunnel auto-reconnects. systemd restart is set.
Finn itself as NFS server UDev I/O fully depends on Finn's NFS exports Acceptable; they're on the same physical box. If Finn is down, UDev is down.

2. Resource Waste / Contention

Live reading at audit time:

Resource Current Issue Proposal
Finn RAM 20.6G used / 31G total, 24.8G allocated across CTs+VMs = over-provisioned by 1.5GB Adding CT 107 (Immich 4G) + CT 108 (Frigate 4G) pushed allocation >31G. Swap active. Upgrade to 96GB DDR5 SO-DIMM (2×48GB Crucial). MS-01 supports up to 128G. ~$170. Single biggest quality-of-life improvement.
Frigate CPU ~75% per doc from CPU-based detection No accelerator; burns 3–4 cores continuously Coral TPU (Google Coral Mini PCIe Accelerator or USB). MS-01 has M.2 slots. Drops Frigate to ~10% CPU. Also future-proofs for 3 cameras.
media-server RAM 2048 MB for 5+ containers → 87% routine Tight but stable per design Bump to 3072 MB after Finn RAM upgrade
UDev root disk 72% of 24G used Nothing heavy lives here but tasks/pending at 2500 JSON files adds up Purge task backlog (security-audit doc step 7). Raise UDev root to 40G via Proxmox resize if needed.
UDev RAM 2.6G / 11G used Plenty of headroom Consider shrinking to 8G and giving those 3GB to Frigate or media-server, after Finn RAM upgrade
Plex LXC RAM 5120 MB allocated Generous; is that peak usage? Measure via pct exec 101 -- free; likely reducible to 3GB
iGPU contention Plex + Frigate both passthrough to /dev/dri/renderD128 Concurrent Plex transcodes + Frigate detection may throttle each other Coral TPU removes Frigate from iGPU path entirely, fixes this.
/mnt/seagate 7.3T drive, 75% used, undocumented Unknown content, wasting either space or risk Investigate: what's on it? Staging? Old backup? Either promote to official role (e.g., PBS datastore) or wipe and reclaim. 1.9T free right now.

3. Network Topology Improvements

Area Current Proposal
Reverse proxy Cloudflare tunnel routes each subdomain directly to port on media-server/UDev. No internal proxy. Run Caddy or Traefik on UDev. Cloudflare tunnel routes all *.justinkrystal.com → Caddy → internal service. Wins: central TLS, metrics, middleware (auth/basic/forwardAuth), one place to change routes.
Cloudflare Access coverage 6 of 16 public subdomains have no Access gate Gate homeassistant.* (physical world), requests.* (Overseerr), audiobooks/books (minor)
AdGuard redundancy Single DNS, boot-order-critical Add secondary nameserver in each LXC config (nameserver 1.1.1.1 after .75). Two-line fix per LXC, prevents the DNS-crisis-induced wedging we just fixed.
Tailscale Finn offers exit node; MagicDNS Override now ON Audit exit-node advertisement. If not intentional, disable (tailscale set --advertise-exit-node=false). Document any tailnet admin keys storage.
Monitoring Monitors are bash+cron, good but opaque Add simple Prometheus + Grafana on UDev for historical metrics (disk trending, load patterns, Frigate CPU). Current monitors are pass/fail only.
Internal service names Everything is IP-addressed AdGuard supports custom DNS rewrites, add plex.lan → 192.168.86.73 etc. for better scripts/docs (optional polish)

4. Backup Posture

Tier Current Gap Proposal
Workspace NVMe Nightly rsync to HDD (BROKEN for 32 days; see audit doc) Same-machine only; no offsite Fix cron path, add rclone step to Backblaze B2 (/mnt/storage/workspace-backup/b2:justin-backup/workspace/). ~$0.005/GB/month. 2.7TB = ~$13/mo.
Media HDD None 14TB of irreplaceable photos (1991–2020) + audiobooks/books Seedbox-tier: hot copy to second HDD or tape occasionally. Critical-tier (photos only ~500GB): offsite to B2. ~$2.50/mo.
VM/LXC snapshots None Full CT/VM loss on Finn failure Proxmox Backup Server (PBS) on /mnt/seagate or a new dedicated disk. Free, handles dedup, offsite-replicates. Fits on 1.9T free in seagate.
Forge repo Git origin Up-to-date
n8n encrypted DB Inside CT 106 LVM No separate export PBS covers. Optional: n8n "Export workflows" to git-tracked JSON dump nightly (enables git diffing of workflow changes).
HA config Inside VM 100 No export PBS covers + HA's native backup feature to /mnt/storage/ha-backups/.
Cloudflare tunnel configs Cloudflare dashboard + tokens in systemd units Lose all ingress rules if account compromised Export tunnel YAML via cloudflared to git-tracked file in forge (redacted tokens) as reference doc.

5. Task Queue / Dispatcher Architecture

Current state: dispatcher failed Apr 14, 2482 backlogged tasks (2177 false-positive security alerts from port 8100 filter bug). Before restarting:

Action Why
Patch scripts/monitors/security-check.sh:54, add :8100 to KNOWN_PORTS_FILTER Stop the false positive firehose
Delete all infra-alert-*.json and security-alert-*.json in tasks/pending/ Clean slate
Verify task-creator dedup logic Prevent future storm: suppress alert if identical alert already queued in last N min
systemctl enable --now forge-dispatcher Restart dispatcher

Longer term: - The file-polling dispatcher pattern is robust but chatty. Consider migrating to the Agent SDK programmatic pattern documented in memory/general/ai-fleet-findings.md for the main dispatcher (keep file bus as audit trail). - Add a "dispatcher heartbeat" monitor, if dispatcher hasn't logged in 5 min, notify. (Same monitor style already used for other services.) - Consider task TTL, any task older than 24h in pending gets auto-failed and archived.


6. Organizational / Operational

Area Observation Proposal
Docs drift docs/fleet-docs/ claim dispatcher is running; .service has been failed since Apr 14 Add /fleet-status to dashboard cron (already partially done) and treat doc updates as part of checkpointing discipline
Immich autostart missing CT 107 has no onboot: 1 Add it if Immich is keepers, currently photos are only in CT memory until restart
Decommissioned CT 104 Still in Proxmox, tunnel still has invoiceninja.* hostname Destroy CT, delete hostname from tunnel
MEMORY.md discipline New tools exist in .claude/skills/ but MEMORY.md hasn't been updated since the Forge platform block Whenever .claude/skills/ gains a new dir, MEMORY.md line required. Enforce via a pre-commit hook or a Claude Code hook.
Session checkpoints 5th-prompt hook enforces it but drift still happens Already solved by existing hook, no change needed

7. Upgrade Order (Impact × Effort)

Tier 1: Do this week (high impact, <2 hr each):

# Item Effort Cost
1 Fix workspace backup cron path 2 min $0
2 Fix port 8100 false positive in security monitor 1 min $0
3 Purge task backlog + restart dispatcher 5 min $0
4 Add 1.1.1.1 fallback DNS to every LXC 10 min $0
5 Add Cloudflare Access to homeassistant.* 5 min $0
6 Investigate & document /mnt/seagate role 15 min $0
7 Set up Proxmox Backup Server on /mnt/seagate 2 hr $0

Tier 2: Do this month (medium impact, hardware involved):

# Item Effort Cost
8 Upgrade Finn to 96GB RAM (2×48GB Crucial SO-DIMM DDR5) 15 min + downtime ~$170
9 Add UPS on Finn + network gear 30 min ~$200
10 Add Coral TPU (PCIe or USB) to Frigate 30 min ~$30–60
11 Offsite backup pipeline: rclone → Backblaze B2 1 hr $10–15/mo
12 Immich autostart + backup to /mnt/storage/ 30 min $0

Tier 3: Plan for quarter (larger architectural):

# Item Effort Cost
13 Caddy/Traefik reverse proxy on UDev + tunnel consolidation 4 hr $0
14 Prometheus + Grafana on UDev 3 hr $0
15 Dispatcher migration to Agent SDK 8 hr $0
16 Second MS-01 for Proxmox cluster HA 1 day ~$600–900
17 Pin all Docker images to specific tags / SHA digests 2 hr $0

8. Nothing to Change: Working Well

  • Three-tier storage model (Drive / Workspace NVMe / Media HDD) is clear, documented, correctly provisioned by use case
  • Gluetun VPN killswitch for Arr stack, verified working, no IP leaks
  • NFS exports locked to UDev IP, mounts persistent via _netdev fstab
  • Cloudflare tunnel (no open ports) architecture is exactly right
  • Monitor-only-when-broken task creation pattern, zero tokens on healthy days
  • Google Drive via rclone FUSE, smart use of cloud for shareable docs
  • iGPU passthrough for Plex, max perf, no separate GPU needed
  • Forge repo = shared brain, single source of truth for all agents, well-instrumented

[Claude Code], sec-audit (Opus47) session