Architecture Optimization Proposals, 2026-04-21¶

Auditor: Claude Opus 4.7 (sec-audit session) Companion doc: security-audit-2026-04-21.md Fleet snapshot: Finn load 1.18, RAM 20G/31G used, workspace 45%, media 66%. UDev 72% root, 2.6G/11G RAM. All services up.

TL;DR, 5 Biggest Wins¶

Rank	Proposal	Effort	Impact
1	Add 64GB RAM to Finn (total 96GB)	15 min + $170	Unclogs everything: no more 78% RAM pressure, room for Coral/PBS/new CTs
2	Offsite backup pipeline (`/mnt/storage/workspace-backup` → Backblaze B2)	1 hr setup, $10/mo	Removes the "Finn dies = total loss" SPOF
3	Coral TPU for Frigate (M.2 or USB)	30 min + $30–60	Cuts Frigate CPU ~80%; frees cores for Plex transcoding, ready for 3-cam expansion
4	UPS on Finn (CyberPower 1500VA or APC equivalent)	10 min + $200	Kills brownout reboots; protects NVMe + HDD from corruption
5	Reverse-proxy consolidation (Caddy or Traefik on UDev)	2 hr	Cleaner tunnel→service routing, metrics, one-place TLS/access config, prepares for Cloudflare Access → internal auth migration

1. Single Points of Failure¶

SPOF	Current Exposure	Fix
Finn itself	Hardware failure = every service down, Plex, HA, n8n, cameras, all VMs gone	Long-term: cluster with a second mini PC (MS-01 variant) for HA migrations. Short-term: Proxmox Backup Server to external store.
Workspace NVMe	Single NVMe on Finn holds 2.7TB of active work + Frigate recordings. If it dies, 30 days of cameras + active video projects gone.	Offsite backup pipeline (see #2 below). Consider a second NVMe slot on Finn as a ZFS mirror long-term.
Google Fiber / home network	Single ISP; entire fleet unreachable externally during outages	Acceptable. Tailscale provides LAN-like recovery if cell-tethered.
AdGuard (CT 105)	Single DNS for the fleet, if it dies or gets wedged, lookups fail fleet-wide (just happened via Tailscale MagicDNS issue)	Add `8.8.8.8` or `1.1.1.1` as secondary nameserver in each LXC config so fallback works. Currently most containers have ONLY `.75`.
n8n (CT 106)	Sole event-trigger layer for the whole AI fleet	Redundancy not worth cost; ensure n8n DB is backed up nightly (it's in container LVM: PBS covers this)
Cloudflare Tunnel (UDev)	If UDev cloudflared dies, `justinsforge.com` + `dashboard.justinsforge.com` go down	Acceptable, tunnel auto-reconnects. systemd restart is set.
Finn itself as NFS server	UDev I/O fully depends on Finn's NFS exports	Acceptable; they're on the same physical box. If Finn is down, UDev is down.

2. Resource Waste / Contention¶

Live reading at audit time:

Resource	Current	Issue	Proposal
Finn RAM	20.6G used / 31G total, 24.8G allocated across CTs+VMs = over-provisioned by 1.5GB	Adding CT 107 (Immich 4G) + CT 108 (Frigate 4G) pushed allocation >31G. Swap active.	Upgrade to 96GB DDR5 SO-DIMM (2×48GB Crucial). MS-01 supports up to 128G. ~$170. Single biggest quality-of-life improvement.
Frigate CPU	~75% per doc from CPU-based detection	No accelerator; burns 3–4 cores continuously	Coral TPU (Google Coral Mini PCIe Accelerator or USB). MS-01 has M.2 slots. Drops Frigate to ~10% CPU. Also future-proofs for 3 cameras.
media-server RAM	2048 MB for 5+ containers → 87% routine	Tight but stable per design	Bump to 3072 MB after Finn RAM upgrade
UDev root disk	72% of 24G used	Nothing heavy lives here but tasks/pending at 2500 JSON files adds up	Purge task backlog (security-audit doc step 7). Raise UDev root to 40G via Proxmox resize if needed.
UDev RAM	2.6G / 11G used	Plenty of headroom	Consider shrinking to 8G and giving those 3GB to Frigate or media-server, after Finn RAM upgrade
Plex LXC RAM	5120 MB allocated	Generous; is that peak usage?	Measure via `pct exec 101 -- free`; likely reducible to 3GB
iGPU contention	Plex + Frigate both passthrough to `/dev/dri/renderD128`	Concurrent Plex transcodes + Frigate detection may throttle each other	Coral TPU removes Frigate from iGPU path entirely, fixes this.
/mnt/seagate	7.3T drive, 75% used, undocumented	Unknown content, wasting either space or risk	Investigate: what's on it? Staging? Old backup? Either promote to official role (e.g., PBS datastore) or wipe and reclaim. 1.9T free right now.

3. Network Topology Improvements¶

Area	Current	Proposal
Reverse proxy	Cloudflare tunnel routes each subdomain directly to port on media-server/UDev. No internal proxy.	Run Caddy or Traefik on UDev. Cloudflare tunnel routes all `*.justinkrystal.com` → Caddy → internal service. Wins: central TLS, metrics, middleware (auth/basic/forwardAuth), one place to change routes.
Cloudflare Access coverage	6 of 16 public subdomains have no Access gate	Gate `homeassistant.` (physical world), `requests.` (Overseerr), audiobooks/books (minor)
AdGuard redundancy	Single DNS, boot-order-critical	Add secondary nameserver in each LXC config (`nameserver 1.1.1.1` after `.75`). Two-line fix per LXC, prevents the DNS-crisis-induced wedging we just fixed.
Tailscale	Finn offers exit node; MagicDNS Override now ON	Audit exit-node advertisement. If not intentional, disable (`tailscale set --advertise-exit-node=false`). Document any tailnet admin keys storage.
Monitoring	Monitors are bash+cron, good but opaque	Add simple Prometheus + Grafana on UDev for historical metrics (disk trending, load patterns, Frigate CPU). Current monitors are pass/fail only.
Internal service names	Everything is IP-addressed	AdGuard supports custom DNS rewrites, add `plex.lan → 192.168.86.73` etc. for better scripts/docs (optional polish)

4. Backup Posture¶

Tier	Current	Gap	Proposal
Workspace NVMe	Nightly rsync to HDD (BROKEN for 32 days; see audit doc)	Same-machine only; no offsite	Fix cron path, add rclone step to Backblaze B2 (`/mnt/storage/workspace-backup/` → `b2:justin-backup/workspace/`). ~$0.005/GB/month. 2.7TB = ~$13/mo.
Media HDD	None	14TB of irreplaceable photos (1991–2020) + audiobooks/books	Seedbox-tier: hot copy to second HDD or tape occasionally. Critical-tier (photos only ~500GB): offsite to B2. ~$2.50/mo.
VM/LXC snapshots	None	Full CT/VM loss on Finn failure	Proxmox Backup Server (PBS) on `/mnt/seagate` or a new dedicated disk. Free, handles dedup, offsite-replicates. Fits on 1.9T free in seagate.
Forge repo	Git origin	Up-to-date	✓
n8n encrypted DB	Inside CT 106 LVM	No separate export	PBS covers. Optional: n8n "Export workflows" to git-tracked JSON dump nightly (enables git diffing of workflow changes).
HA config	Inside VM 100	No export	PBS covers + HA's native backup feature to `/mnt/storage/ha-backups/`.
Cloudflare tunnel configs	Cloudflare dashboard + tokens in systemd units	Lose all ingress rules if account compromised	Export tunnel YAML via `cloudflared` to git-tracked file in forge (redacted tokens) as reference doc.

5. Task Queue / Dispatcher Architecture¶

Current state: dispatcher failed Apr 14, 2482 backlogged tasks (2177 false-positive security alerts from port 8100 filter bug). Before restarting:

Action	Why
Patch `scripts/monitors/security-check.sh:54`, add `:8100` to `KNOWN_PORTS_FILTER`	Stop the false positive firehose
Delete all `infra-alert-.json` and `security-alert-.json` in `tasks/pending/`	Clean slate
Verify task-creator dedup logic	Prevent future storm: suppress alert if identical alert already queued in last N min
`systemctl enable --now forge-dispatcher`	Restart dispatcher

Longer term: - The file-polling dispatcher pattern is robust but chatty. Consider migrating to the Agent SDK programmatic pattern documented in memory/general/ai-fleet-findings.md for the main dispatcher (keep file bus as audit trail). - Add a "dispatcher heartbeat" monitor, if dispatcher hasn't logged in 5 min, notify. (Same monitor style already used for other services.) - Consider task TTL, any task older than 24h in pending gets auto-failed and archived.

6. Organizational / Operational¶

Area	Observation	Proposal
Docs drift	`docs/fleet-docs/` claim dispatcher is running; `.service` has been failed since Apr 14	Add `/fleet-status` to dashboard cron (already partially done) and treat doc updates as part of checkpointing discipline
Immich autostart missing	CT 107 has no `onboot: 1`	Add it if Immich is keepers, currently photos are only in CT memory until restart
Decommissioned CT 104	Still in Proxmox, tunnel still has `invoiceninja.*` hostname	Destroy CT, delete hostname from tunnel
MEMORY.md discipline	New tools exist in `.claude/skills/` but MEMORY.md hasn't been updated since the Forge platform block	Whenever `.claude/skills/` gains a new dir, MEMORY.md line required. Enforce via a pre-commit hook or a Claude Code hook.
Session checkpoints	5th-prompt hook enforces it but drift still happens	Already solved by existing hook, no change needed

7. Upgrade Order (Impact × Effort)¶

Tier 1: Do this week (high impact, <2 hr each):

#	Item	Effort	Cost
1	Fix workspace backup cron path	2 min	$0
2	Fix port 8100 false positive in security monitor	1 min	$0
3	Purge task backlog + restart dispatcher	5 min	$0
4	Add `1.1.1.1` fallback DNS to every LXC	10 min	$0
5	Add Cloudflare Access to `homeassistant.*`	5 min	$0
6	Investigate & document `/mnt/seagate` role	15 min	$0
7	Set up Proxmox Backup Server on `/mnt/seagate`	2 hr	$0

Tier 2: Do this month (medium impact, hardware involved):

#	Item	Effort	Cost
8	Upgrade Finn to 96GB RAM (2×48GB Crucial SO-DIMM DDR5)	15 min + downtime	~$170
9	Add UPS on Finn + network gear	30 min	~$200
10	Add Coral TPU (PCIe or USB) to Frigate	30 min	~$30–60
11	Offsite backup pipeline: rclone → Backblaze B2	1 hr	$10–15/mo
12	Immich autostart + backup to `/mnt/storage/`	30 min	$0

Tier 3: Plan for quarter (larger architectural):

#	Item	Effort	Cost
13	Caddy/Traefik reverse proxy on UDev + tunnel consolidation	4 hr	$0
14	Prometheus + Grafana on UDev	3 hr	$0
15	Dispatcher migration to Agent SDK	8 hr	$0
16	Second MS-01 for Proxmox cluster HA	1 day	~$600–900
17	Pin all Docker images to specific tags / SHA digests	2 hr	$0

8. Nothing to Change: Working Well¶

Three-tier storage model (Drive / Workspace NVMe / Media HDD) is clear, documented, correctly provisioned by use case
Gluetun VPN killswitch for Arr stack, verified working, no IP leaks
NFS exports locked to UDev IP, mounts persistent via _netdev fstab
Cloudflare tunnel (no open ports) architecture is exactly right
Monitor-only-when-broken task creation pattern, zero tokens on healthy days
Google Drive via rclone FUSE, smart use of cloud for shareable docs
iGPU passthrough for Plex, max perf, no separate GPU needed
Forge repo = shared brain, single source of truth for all agents, well-instrumented

[Claude Code], sec-audit (Opus47) session