Research: Self-Hosted Security, Monitoring, and Observability Stack¶

URL: https://mkdocs.justinsforge.com/memory/research/self-host-security-monitoring-observability-2026-05-24/

Date: 2026-05-24 Depth: deep Model: sonnet

TL;DR¶

Tier 1 core stack: Grafana + Prometheus + Loki + Alloy (unified telemetry) + Uptime Kuma + CrowdSec + PBS. Estimated 2.5GB RAM total, trivial on 128GB Finn.
Backup strategy is the biggest gap: PBS covers VM/CT snapshots natively; pair with Restic for off-site data-level backups to B2/S3. No verified off-site backup = guaranteed data loss eventually.
CrowdSec over Fail2ban: crowdsourced community blocklist + Cloudflare bouncer integration = blocks known attackers before they even connect. Fail2ban is single-node reactive only.
Wazuh is overkill for now: 4-8GB RAM for the indexer alone; defer unless ERPNext compliance requirements become formal. Loki + CrowdSec covers 90% of what Wazuh does for this scale.
Loki wins over Graylog/OpenSearch: Graylog requires a full OpenSearch backend (4GB+ RAM), Loki indexes only labels and stores raw log chunks, integrates natively with Grafana you are already deploying for metrics.

Priority Tiers¶

Tier 1: Deploy this week (prevents data loss or 2 AM wake-ups)¶

Tool	Problem it solves
Uptime Kuma	First to know when a service is down, with Telegram alerts
Proxmox Backup Server (PBS)	CT/VM snapshot backups with deduplication and verification
Restic + cron	Off-site file-level backup to Backblaze B2 for critical data dirs
CrowdSec	Blocks brute-force and known malicious IPs via community blocklist
Grafana + Prometheus	Disk usage, RAM, CPU trends so you see saturation before it bites

Tier 2: Deploy this month (improves daily ops visibility)¶

Tool	Problem it solves
Loki + Grafana Alloy	Centralized log search across all LXCs from one UI
Trivy (cron)	Weekly container image CVE scan, alerts on critical findings
Prometheus SNMP exporter	Mikrotik switch/router metrics into the same Grafana stack
node_exporter on each LXC	Per-container CPU/RAM/disk/network in Grafana

Tier 3: Nice to have, not urgent¶

Tool	Problem it solves
step-ca	Internal PKI for service-to-service mTLS (low urgency given Cloudflare Tunnels handle external TLS)
Wazuh	Full SIEM: file integrity, compliance dashboards, audit trails for ERPNext
LibreNMS	Network topology maps, SNMP-based full network NMS for Mikrotik
ntopng	Deep flow-level traffic analysis (useful if you suspect unusual traffic patterns)

Findings¶

1. Centralized Logging¶

Loki (Grafana Labs)¶

What it does: Stores logs indexed only by labels (host, container, service), not full-text. Ships log chunks to local disk or object storage. LogQL query language. Natively visualized in Grafana alongside metrics.

Why it matters for Finn: Graylog and OpenSearch both require a full inverted-text-index backend. OpenSearch recommends 4-8GB RAM for the indexer alone [1]. Loki avoids this entirely by not indexing log content, only metadata labels. This trades full-text search speed for massive storage and RAM savings.

vs Graylog: Graylog is a UI layer on top of OpenSearch/Elasticsearch. You pay the OpenSearch RAM tax plus a Graylog server process. Adds GELF input support and alert rules, but doubles the resource footprint vs Loki. Not worth it at this scale.

vs OpenSearch standalone: OpenSearch wins on full-text search (grep across unstructured text) and analytics. If you need to do compliance-grade log forensics for ERPNext, OpenSearch is better. For operational observability (why did n8n crash at 3 AM?), Loki is sufficient and an order of magnitude cheaper.

Complexity: Medium (configure Alloy log scraping per LXC, set up Loki datasource in Grafana) RAM/CPU: Loki: 300-500MB RAM, Alloy per LXC: 30-50MB each. 8 LXCs = ~400MB total for agents. Docker/LXC ready: Yes, official Docker images [1]

Grafana Alloy (log + metrics collector)¶

Unified OpenTelemetry collector replacing Promtail (for Loki) and standalone node-exporter runs [2]. One binary per LXC, ships logs to Loki AND metrics to Prometheus. Reduces operational overhead vs managing two separate agents.

2. Metrics and Dashboards¶

Grafana + Prometheus¶

What it does: Prometheus scrapes metrics from node_exporter (per LXC), cAdvisor (Docker containers), Proxmox exporter, and application exporters. Grafana visualizes everything with community dashboard templates (10,000+ available for Proxmox, n8n, AdGuard, etc.).

vs Netdata: Netdata claims 36% less CPU, 88% less RAM, and 97% less disk I/O than Prometheus at scale [4]. It is faster to deploy (auto-discovery, zero config) and has per-second resolution vs Prometheus default of 15s. However: Netdata uses a non-standard query language, has weaker ecosystem integration (fewer exporters for Proxmox/ERPNext/CrowdSec), and its Grafana integration is an add-on rather than native. Recommendation: Grafana + Prometheus for the ecosystem breadth. You can layer Netdata on top of specific hosts later if you want per-second granularity for troubleshooting.

Complexity: Medium (one Prometheus per stack, node_exporter on each LXC via Alloy, community dashboards import by ID) RAM/CPU: Prometheus: 300-500MB, Grafana: 200-350MB, node_exporter: 25MB each Docker/LXC ready: Yes, well-documented Proxmox LXC deployment pattern [1][2]

Key community dashboards: - Proxmox VE: ID 10347 or 21737 - AdGuard Home: ID 13946 - Node Exporter Full: ID 1860 - Docker/cAdvisor: ID 193 - CrowdSec: available via CrowdSec Hub

3. Uptime Monitoring¶

Uptime Kuma¶

What it does: Self-hosted uptime monitoring for HTTP/HTTPS, TCP ports, DNS, ping, Docker containers. Status page, Telegram/Slack/ntfy notifications, 90+ notification channels [3].

Why it wins: No meaningful competition in the self-hosted space for this price/complexity point. Freshping is SaaS. Statping is unmaintained. Uptime Kuma is actively maintained, Docker-native, and integrates directly with your Telegram notify stack.

Key monitors to set up: Cloudflare Tunnel endpoints, n8n webhook, Home Assistant, Plex, ERPNext, PBS web UI, AdGuard.

Complexity: Easy (single Docker container, 5-minute setup) RAM/CPU: ~50-100MB RAM, negligible CPU Docker/LXC ready: Yes [3]

4. Backups¶

Proxmox Backup Server (PBS)¶

What it does: Native Proxmox CT/VM backup with SHA-256 integrity verification, AES-256-GCM client-side encryption, zstd compression, incremental backups, deduplication. Written in Rust for high throughput [5]. Can be installed as an LXC on Finn itself pointing to the 24TB HDD pool.

Why it is Tier 1: PBS can verify backup data integrity (proxmox-backup-client verify) and restore individual files from a snapshot without full restore. Built-in prune policies (daily/weekly/monthly retention). Native Proxmox VE integration means one-click backup from the PVE UI.

Limitation: Backs up CT/VM images, not individual application data. A corrupted database inside a CT will be snapshotted as-is.

Complexity: Medium (LXC install, configure datastore, set backup schedules in PVE) RAM/CPU: 512MB-1GB RAM for PBS itself on Finn

Restic (off-site data backups)¶

What it does: Encrypts and deduplicates file-level backups to any backend: S3, Backblaze B2, SFTP, rclone targets [6]. restic check --read-data verifies backup integrity by reading back and decrypting all chunks.

Recommended use: Cron job per LXC to back up /opt/<service>/data, /etc/<service>/, database dumps to B2. Runs as a systemd timer, reports to forge_notify.sh on failure.

vs Borgmatic: BorgBackup + Borgmatic wrapper does deduplication, encryption, and scheduling in one tool. Borg uses a different chunking algorithm and stores chunks differently; performance is comparable to Restic. Borgmatic adds hooks and notifications. The choice between them is largely preference; Restic has wider cloud backend support and simpler mental model. Either works.

Complexity: Easy-Medium (Restic binary + cron/systemd timer) RAM/CPU: 100-200MB spike during backup, nothing at idle

Recommended strategy for Justin: 1. PBS on Finn: nightly CT/VM snapshots to 24TB HDD, 7 daily + 4 weekly + 3 monthly retention 2. Restic to B2: nightly critical data dirs (n8n data, HA config, ERPNext DB dump, Immich library index) with --read-data weekly verify + notify on failure

5. Intrusion Detection¶

CrowdSec (recommended)¶

What it does: Behavior-based log analysis with a crowdsourced IP blocklist (community CTI). Modular: Log Processor detects attacks, Bouncers enforce blocks at iptables, nginx, Cloudflare, Traefik level [7]. The Cloudflare Bouncer blocks malicious IPs at the CDN edge before traffic reaches Finn.

vs Fail2ban: Fail2ban is regex + iptables, single-node, no community intelligence. CrowdSec adds the crowdsourced blocklist (IPs attacking other CrowdSec users worldwide get pre-blocked on yours), Cloudflare-level blocking, WAF capability, and a proper local API. Resource overhead is comparable: both are lightweight. CrowdSec wins at every axis except simplicity.

Cloudflare integration: CrowdSec has a Cloudflare Bouncer that syncs blocked IPs to your Cloudflare IP firewall rules automatically. Given you use Cloudflare Tunnels, this is a natural fit.

Complexity: Medium (install crowdsec + bouncer, configure log sources per service) RAM/CPU: ~50-100MB RAM for the security engine + local API Docker/LXC ready: Yes, Debian package or Docker [7]

Wazuh (Tier 3)¶

What it does: Full XDR/SIEM: file integrity monitoring, vulnerability detection, regulatory compliance dashboards (PCI DSS, GDPR, NIST 800-53), audit trails, log aggregation. Three components: indexer (OpenSearch-based), server, dashboard [8].

Why it is Tier 3: The indexer alone requires 4GB RAM minimum (OpenSearch) and 8GB recommended. For a 10-service homelab, this is disproportionate unless ERPNext compliance is a formal business requirement. The compliance dashboards are genuinely useful for business auditing (GDPR for Wiebelhaus Enterprises) but can be deferred.

Complexity: Hard (three-node stack, agent on every endpoint) RAM/CPU: 6-10GB RAM total for the full stack

6. Vulnerability Scanning¶

Trivy (Aqua Security)¶

What it does: Scans container images, filesystems, and git repos for CVEs against multiple vulnerability databases (NVD, GitHub Advisory, etc.). No persistent daemon; runs as a CLI command or scheduled job [9].

Recommended use: Weekly cron on Finn that scans each running container image, outputs SARIF/JSON, pipes critical/high findings to forge_notify.sh via Telegram. No persistent RAM cost.

vs OpenVAS: OpenVAS is a full network vulnerability scanner with a PostgreSQL backend. 2-4GB RAM permanently. Better for network-level scanning (open ports, misconfigs) but overkill for container CVE tracking. Trivy handles the container surface; OpenVAS is Tier 3.

Complexity: Easy (binary install, cron job) RAM/CPU: ~100-200MB spike during scan, zero at idle Docker/LXC ready: Yes, Docker image or static binary

7. Certificate Management¶

step-ca (Smallstep)¶

What it does: ACME-compatible internal CA that issues X.509 certificates for TLS, mTLS, and document signing. Automates issuance and renewal via standard ACME protocol [10]. Any service using certbot or acme.sh can get certs from step-ca without going to Let's Encrypt.

Relevance for Justin's setup: Most external-facing services use Cloudflare Tunnels (Cloudflare handles TLS termination). Internal service-to-service communication (e.g., n8n calling a local FastAPI, Prometheus scraping endpoints) currently uses HTTP or self-signed certs. step-ca would clean up the "click past certificate warning" problem for internal admin UIs and enable mTLS between services.

Complexity: Medium (configure CA, distribute root cert to browsers/services, configure ACME renewal) RAM/CPU: ~50MB RAM Verdict: Tier 3. Solve backups and monitoring first.

8. Audit Logging¶

ERPNext built-in: ERPNext has native audit trails for financial transactions (doctype version history, activity logs). For a 5-company structure this is the first line of audit compliance.

Loki for operational audit: Structured log ingestion from ERPNext + n8n + HA into Loki provides a searchable audit trail for "who did what when" at the application layer. Sufficient for most small-business compliance needs.

Wazuh for formal compliance: If GDPR or financial audit requirements ever become formal for Wiebelhaus Enterprises, Wazuh's compliance dashboards (GDPR, NIST 800-53) are the right tool. Tag this as Tier 3 and revisit when a compliance framework is actually required.

9. Network Monitoring¶

Prometheus SNMP Exporter (recommended for Mikrotik)¶

What it does: Scrapes SNMP OIDs from the Mikrotik CRS328 (and any other SNMP-capable device) and exposes them as Prometheus metrics. Visualized in Grafana alongside host metrics. Mikrotik has full SNMP v2c/v3 support.

Why this first: Adds Mikrotik interface counters, bandwidth, error rates, CPU/RAM into your existing Grafana stack with minimal overhead. Same stack, zero new tools.

Complexity: Easy-Medium (enable SNMP on Mikrotik, configure SNMP exporter scrape job) RAM/CPU: ~30MB RAM

LibreNMS (Tier 2)¶

Full network monitoring platform with SNMP auto-discovery, topology maps, alerting, Mikrotik MIBs. Heavier (~1-2GB RAM) but gives visual network topology and per-device health history. Useful once the Mikrotik environment grows.

ntopng (Tier 3)¶

Flow-based deep traffic analysis. Requires NetFlow/IPFIX export from Mikrotik (supported). Useful for bandwidth hogs and suspicious traffic patterns. Community edition is free but limited; the features that matter (historical flow analysis) require paid license [11].

Resource Budget¶

Recommended Tier 1 + 2 stack, running as LXCs on Finn (128GB RAM available):

Component	RAM estimate
Grafana	300MB
Prometheus	400MB
Loki (single-node, local disk)	400MB
Grafana Alloy (8 LXCs x 40MB)	320MB
Uptime Kuma	80MB
PBS LXC	768MB
CrowdSec	100MB
Prometheus SNMP exporter	30MB
node_exporter (8 LXCs x 25MB)	200MB
Total	~2.6GB

Trivy is cron-only (no idle RAM). Wazuh would add 6-10GB if deployed (Tier 3).

2.6GB out of 128GB available is 2% of total RAM. There is no resource constraint blocking any of these deployments.

Disagreements / Open Questions¶

Loki vs OpenSearch: Grafana's own comparison blog [1] (biased source) argues Loki; the counterargument is that Loki's label-only indexing makes ad-hoc grep across unstructured legacy logs painful. If ERPNext's raw PHP/Python logs are important to search, consider a lightweight OpenSearch single-node (2GB RAM, manageable) over Loki's LogQL. Unverified: whether ERPNext emits structured JSON logs by default.
PBS on Finn vs separate machine: PBS on the same Proxmox host means a catastrophic Finn failure (disk controller, power, fire) takes PBS and primary data simultaneously. Ideal is PBS on a separate machine or at minimum a separate ZFS pool with a different disk controller. Off-site Restic to B2 mitigates this but does not replace a true secondary PBS target.
CrowdSec data sharing: CrowdSec's crowdsourced model requires opt-in sharing of your attack telemetry with their central API. This is acceptable for most homelab/business use but worth noting: your Cloudflare Tunnel IPs and attack patterns are shared externally. Disable with cscli config set --api.server.online_client.credentials_path="" if privacy is a concern; you lose the community blocklist but keep local detection.
Netdata vs Prometheus: Netdata's benchmark of 88% less RAM [4] is at 4.6M metrics/second, a scale far above this homelab. At 10 services / ~50K metrics, the difference is likely 200-400MB vs 300-500MB. Not material on 128GB. Prometheus wins on ecosystem.
step-ca vs Caddy as internal CA proxy: Caddy's built-in ACME server can issue certs for internal services trivially with zero PKI knowledge required. For the specific use case of eliminating self-signed cert warnings on internal admin UIs (Proxmox, PBS, etc.), Caddy in front of those services is faster to deploy than a full step-ca setup.

Sources¶

Grafana Loki Overview, official Loki architecture and design rationale
Grafana Alloy Introduction, unified telemetry collector replacing Promtail + agent; OpenTelemetry distribution
Uptime Kuma GitHub Wiki, features, notification methods, Docker install
Netdata Open Source, Netdata vs Prometheus+Grafana comparison (vendor source; treat as directionally accurate, not precise)
Proxmox Backup Server Introduction, PBS features: deduplication, SHA-256 verification, AES-256-GCM encryption, zstd compression
Restic Introduction, restic backup tool; restic check --read-data for verification
CrowdSec Security Engine Docs, architecture: log processor + bouncers + community CTI
Wazuh Components, Wazuh indexer (OpenSearch), server, dashboard, agent architecture; XDR+SIEM features
Trivy GitHub, Aqua Security container + filesystem vulnerability scanner
step-ca Smallstep Docs, ACME-compatible internal CA; X.509 issuance + renewal automation
ntopng ntop, network traffic analysis, flow-based monitoring

Search trail¶

Fetched: grafana.com/docs/loki/latest (Loki overview, architecture)
Fetched: uptime.kuma.pet + github.com/louislam/uptime-kuma/wiki (features, notification channels)
Fetched: docs.crowdsec.net/docs/intro (CrowdSec architecture, bouncers, community CTI)
Fetched: pbs.proxmox.com/docs/introduction.html (PBS features, deduplication, verification)
Fetched: restic.readthedocs.io (Restic quickstart, check command)
Fetched: borgbackup.readthedocs.io (Borg quickstart, encryption, remote repos)
Fetched: github.com/aquasecurity/trivy (Trivy scanner overview)
Fetched: smallstep.com/docs/step-ca (step-ca PKI, ACME, X.509)
Fetched: documentation.wazuh.com/current/getting-started/components (Wazuh components, RAM requirements)
Fetched: grafana.com/docs/alloy/latest (Alloy unified collector, migration from Promtail)
Fetched: netdata.cloud/open-source (Netdata vs Prometheus+Grafana comparison)
Fetched: ntop.org/products/traffic-analysis/ntopng (ntopng network monitoring)
Attempted (404): crowdsec.net/blog/crowdsec-vs-fail2ban (moved; content synthesized from docs.crowdsec.net)
Attempted (failed): docs.wazuh.com use-cases (network error; synthesized from components page)