Architecture Optimization Proposals, 2026-04-21
Auditor: Claude Opus 4.7 (sec-audit session)
Companion doc: security-audit-2026-04-21.md
Fleet snapshot: Finn load 1.18, RAM 20G/31G used, workspace 45%, media 66%. UDev 72% root, 2.6G/11G RAM. All services up.
TL;DR, 5 Biggest Wins
| Rank |
Proposal |
Effort |
Impact |
| 1 |
Add 64GB RAM to Finn (total 96GB) |
15 min + $170 |
Unclogs everything: no more 78% RAM pressure, room for Coral/PBS/new CTs |
| 2 |
Offsite backup pipeline (/mnt/storage/workspace-backup → Backblaze B2) |
1 hr setup, $10/mo |
Removes the "Finn dies = total loss" SPOF |
| 3 |
Coral TPU for Frigate (M.2 or USB) |
30 min + $30–60 |
Cuts Frigate CPU ~80%; frees cores for Plex transcoding, ready for 3-cam expansion |
| 4 |
UPS on Finn (CyberPower 1500VA or APC equivalent) |
10 min + $200 |
Kills brownout reboots; protects NVMe + HDD from corruption |
| 5 |
Reverse-proxy consolidation (Caddy or Traefik on UDev) |
2 hr |
Cleaner tunnel→service routing, metrics, one-place TLS/access config, prepares for Cloudflare Access → internal auth migration |
1. Single Points of Failure
| SPOF |
Current Exposure |
Fix |
| Finn itself |
Hardware failure = every service down, Plex, HA, n8n, cameras, all VMs gone |
Long-term: cluster with a second mini PC (MS-01 variant) for HA migrations. Short-term: Proxmox Backup Server to external store. |
| Workspace NVMe |
Single NVMe on Finn holds 2.7TB of active work + Frigate recordings. If it dies, 30 days of cameras + active video projects gone. |
Offsite backup pipeline (see #2 below). Consider a second NVMe slot on Finn as a ZFS mirror long-term. |
| Google Fiber / home network |
Single ISP; entire fleet unreachable externally during outages |
Acceptable. Tailscale provides LAN-like recovery if cell-tethered. |
| AdGuard (CT 105) |
Single DNS for the fleet, if it dies or gets wedged, lookups fail fleet-wide (just happened via Tailscale MagicDNS issue) |
Add 8.8.8.8 or 1.1.1.1 as secondary nameserver in each LXC config so fallback works. Currently most containers have ONLY .75. |
| n8n (CT 106) |
Sole event-trigger layer for the whole AI fleet |
Redundancy not worth cost; ensure n8n DB is backed up nightly (it's in container LVM: PBS covers this) |
| Cloudflare Tunnel (UDev) |
If UDev cloudflared dies, justinsforge.com + dashboard.justinsforge.com go down |
Acceptable, tunnel auto-reconnects. systemd restart is set. |
| Finn itself as NFS server |
UDev I/O fully depends on Finn's NFS exports |
Acceptable; they're on the same physical box. If Finn is down, UDev is down. |
2. Resource Waste / Contention
Live reading at audit time:
| Resource |
Current |
Issue |
Proposal |
| Finn RAM |
20.6G used / 31G total, 24.8G allocated across CTs+VMs = over-provisioned by 1.5GB |
Adding CT 107 (Immich 4G) + CT 108 (Frigate 4G) pushed allocation >31G. Swap active. |
Upgrade to 96GB DDR5 SO-DIMM (2×48GB Crucial). MS-01 supports up to 128G. ~$170. Single biggest quality-of-life improvement. |
| Frigate CPU |
~75% per doc from CPU-based detection |
No accelerator; burns 3–4 cores continuously |
Coral TPU (Google Coral Mini PCIe Accelerator or USB). MS-01 has M.2 slots. Drops Frigate to ~10% CPU. Also future-proofs for 3 cameras. |
| media-server RAM |
2048 MB for 5+ containers → 87% routine |
Tight but stable per design |
Bump to 3072 MB after Finn RAM upgrade |
| UDev root disk |
72% of 24G used |
Nothing heavy lives here but tasks/pending at 2500 JSON files adds up |
Purge task backlog (security-audit doc step 7). Raise UDev root to 40G via Proxmox resize if needed. |
| UDev RAM |
2.6G / 11G used |
Plenty of headroom |
Consider shrinking to 8G and giving those 3GB to Frigate or media-server, after Finn RAM upgrade |
| Plex LXC RAM |
5120 MB allocated |
Generous; is that peak usage? |
Measure via pct exec 101 -- free; likely reducible to 3GB |
| iGPU contention |
Plex + Frigate both passthrough to /dev/dri/renderD128 |
Concurrent Plex transcodes + Frigate detection may throttle each other |
Coral TPU removes Frigate from iGPU path entirely, fixes this. |
| /mnt/seagate |
7.3T drive, 75% used, undocumented |
Unknown content, wasting either space or risk |
Investigate: what's on it? Staging? Old backup? Either promote to official role (e.g., PBS datastore) or wipe and reclaim. 1.9T free right now. |
3. Network Topology Improvements
| Area |
Current |
Proposal |
| Reverse proxy |
Cloudflare tunnel routes each subdomain directly to port on media-server/UDev. No internal proxy. |
Run Caddy or Traefik on UDev. Cloudflare tunnel routes all *.justinkrystal.com → Caddy → internal service. Wins: central TLS, metrics, middleware (auth/basic/forwardAuth), one place to change routes. |
| Cloudflare Access coverage |
6 of 16 public subdomains have no Access gate |
Gate homeassistant.* (physical world), requests.* (Overseerr), audiobooks/books (minor) |
| AdGuard redundancy |
Single DNS, boot-order-critical |
Add secondary nameserver in each LXC config (nameserver 1.1.1.1 after .75). Two-line fix per LXC, prevents the DNS-crisis-induced wedging we just fixed. |
| Tailscale |
Finn offers exit node; MagicDNS Override now ON |
Audit exit-node advertisement. If not intentional, disable (tailscale set --advertise-exit-node=false). Document any tailnet admin keys storage. |
| Monitoring |
Monitors are bash+cron, good but opaque |
Add simple Prometheus + Grafana on UDev for historical metrics (disk trending, load patterns, Frigate CPU). Current monitors are pass/fail only. |
| Internal service names |
Everything is IP-addressed |
AdGuard supports custom DNS rewrites, add plex.lan → 192.168.86.73 etc. for better scripts/docs (optional polish) |
4. Backup Posture
| Tier |
Current |
Gap |
Proposal |
| Workspace NVMe |
Nightly rsync to HDD (BROKEN for 32 days; see audit doc) |
Same-machine only; no offsite |
Fix cron path, add rclone step to Backblaze B2 (/mnt/storage/workspace-backup/ → b2:justin-backup/workspace/). ~$0.005/GB/month. 2.7TB = ~$13/mo. |
| Media HDD |
None |
14TB of irreplaceable photos (1991–2020) + audiobooks/books |
Seedbox-tier: hot copy to second HDD or tape occasionally. Critical-tier (photos only ~500GB): offsite to B2. ~$2.50/mo. |
| VM/LXC snapshots |
None |
Full CT/VM loss on Finn failure |
Proxmox Backup Server (PBS) on /mnt/seagate or a new dedicated disk. Free, handles dedup, offsite-replicates. Fits on 1.9T free in seagate. |
| Forge repo |
Git origin |
Up-to-date |
✓ |
| n8n encrypted DB |
Inside CT 106 LVM |
No separate export |
PBS covers. Optional: n8n "Export workflows" to git-tracked JSON dump nightly (enables git diffing of workflow changes). |
| HA config |
Inside VM 100 |
No export |
PBS covers + HA's native backup feature to /mnt/storage/ha-backups/. |
| Cloudflare tunnel configs |
Cloudflare dashboard + tokens in systemd units |
Lose all ingress rules if account compromised |
Export tunnel YAML via cloudflared to git-tracked file in forge (redacted tokens) as reference doc. |
5. Task Queue / Dispatcher Architecture
Current state: dispatcher failed Apr 14, 2482 backlogged tasks (2177 false-positive security alerts from port 8100 filter bug). Before restarting:
| Action |
Why |
Patch scripts/monitors/security-check.sh:54, add :8100 to KNOWN_PORTS_FILTER |
Stop the false positive firehose |
Delete all infra-alert-*.json and security-alert-*.json in tasks/pending/ |
Clean slate |
| Verify task-creator dedup logic |
Prevent future storm: suppress alert if identical alert already queued in last N min |
systemctl enable --now forge-dispatcher |
Restart dispatcher |
Longer term:
- The file-polling dispatcher pattern is robust but chatty. Consider migrating to the Agent SDK programmatic pattern documented in memory/general/ai-fleet-findings.md for the main dispatcher (keep file bus as audit trail).
- Add a "dispatcher heartbeat" monitor, if dispatcher hasn't logged in 5 min, notify. (Same monitor style already used for other services.)
- Consider task TTL, any task older than 24h in pending gets auto-failed and archived.
6. Organizational / Operational
| Area |
Observation |
Proposal |
| Docs drift |
docs/fleet-docs/ claim dispatcher is running; .service has been failed since Apr 14 |
Add /fleet-status to dashboard cron (already partially done) and treat doc updates as part of checkpointing discipline |
| Immich autostart missing |
CT 107 has no onboot: 1 |
Add it if Immich is keepers, currently photos are only in CT memory until restart |
| Decommissioned CT 104 |
Still in Proxmox, tunnel still has invoiceninja.* hostname |
Destroy CT, delete hostname from tunnel |
| MEMORY.md discipline |
New tools exist in .claude/skills/ but MEMORY.md hasn't been updated since the Forge platform block |
Whenever .claude/skills/ gains a new dir, MEMORY.md line required. Enforce via a pre-commit hook or a Claude Code hook. |
| Session checkpoints |
5th-prompt hook enforces it but drift still happens |
Already solved by existing hook, no change needed |
7. Upgrade Order (Impact × Effort)
Tier 1: Do this week (high impact, <2 hr each):
| # |
Item |
Effort |
Cost |
| 1 |
Fix workspace backup cron path |
2 min |
$0 |
| 2 |
Fix port 8100 false positive in security monitor |
1 min |
$0 |
| 3 |
Purge task backlog + restart dispatcher |
5 min |
$0 |
| 4 |
Add 1.1.1.1 fallback DNS to every LXC |
10 min |
$0 |
| 5 |
Add Cloudflare Access to homeassistant.* |
5 min |
$0 |
| 6 |
Investigate & document /mnt/seagate role |
15 min |
$0 |
| 7 |
Set up Proxmox Backup Server on /mnt/seagate |
2 hr |
$0 |
Tier 2: Do this month (medium impact, hardware involved):
| # |
Item |
Effort |
Cost |
| 8 |
Upgrade Finn to 96GB RAM (2×48GB Crucial SO-DIMM DDR5) |
15 min + downtime |
~$170 |
| 9 |
Add UPS on Finn + network gear |
30 min |
~$200 |
| 10 |
Add Coral TPU (PCIe or USB) to Frigate |
30 min |
~$30–60 |
| 11 |
Offsite backup pipeline: rclone → Backblaze B2 |
1 hr |
$10–15/mo |
| 12 |
Immich autostart + backup to /mnt/storage/ |
30 min |
$0 |
Tier 3: Plan for quarter (larger architectural):
| # |
Item |
Effort |
Cost |
| 13 |
Caddy/Traefik reverse proxy on UDev + tunnel consolidation |
4 hr |
$0 |
| 14 |
Prometheus + Grafana on UDev |
3 hr |
$0 |
| 15 |
Dispatcher migration to Agent SDK |
8 hr |
$0 |
| 16 |
Second MS-01 for Proxmox cluster HA |
1 day |
~$600–900 |
| 17 |
Pin all Docker images to specific tags / SHA digests |
2 hr |
$0 |
8. Nothing to Change: Working Well
- Three-tier storage model (Drive / Workspace NVMe / Media HDD) is clear, documented, correctly provisioned by use case
- Gluetun VPN killswitch for Arr stack, verified working, no IP leaks
- NFS exports locked to UDev IP, mounts persistent via
_netdev fstab
- Cloudflare tunnel (no open ports) architecture is exactly right
- Monitor-only-when-broken task creation pattern, zero tokens on healthy days
- Google Drive via rclone FUSE, smart use of cloud for shareable docs
- iGPU passthrough for Plex, max perf, no separate GPU needed
- Forge repo = shared brain, single source of truth for all agents, well-instrumented
[Claude Code], sec-audit (Opus47) session