Research: Self-Hosted Security, Monitoring, and Observability Stack¶
URL: https://mkdocs.justinsforge.com/memory/research/self-host-security-monitoring-observability-2026-05-24/
Date: 2026-05-24 Depth: deep Model: sonnet
TL;DR¶
- Tier 1 core stack: Grafana + Prometheus + Loki + Alloy (unified telemetry) + Uptime Kuma + CrowdSec + PBS. Estimated 2.5GB RAM total, trivial on 128GB Finn.
- Backup strategy is the biggest gap: PBS covers VM/CT snapshots natively; pair with Restic for off-site data-level backups to B2/S3. No verified off-site backup = guaranteed data loss eventually.
- CrowdSec over Fail2ban: crowdsourced community blocklist + Cloudflare bouncer integration = blocks known attackers before they even connect. Fail2ban is single-node reactive only.
- Wazuh is overkill for now: 4-8GB RAM for the indexer alone; defer unless ERPNext compliance requirements become formal. Loki + CrowdSec covers 90% of what Wazuh does for this scale.
- Loki wins over Graylog/OpenSearch: Graylog requires a full OpenSearch backend (4GB+ RAM), Loki indexes only labels and stores raw log chunks, integrates natively with Grafana you are already deploying for metrics.
Priority Tiers¶
Tier 1: Deploy this week (prevents data loss or 2 AM wake-ups)¶
| Tool | Problem it solves |
|---|---|
| Uptime Kuma | First to know when a service is down, with Telegram alerts |
| Proxmox Backup Server (PBS) | CT/VM snapshot backups with deduplication and verification |
| Restic + cron | Off-site file-level backup to Backblaze B2 for critical data dirs |
| CrowdSec | Blocks brute-force and known malicious IPs via community blocklist |
| Grafana + Prometheus | Disk usage, RAM, CPU trends so you see saturation before it bites |
Tier 2: Deploy this month (improves daily ops visibility)¶
| Tool | Problem it solves |
|---|---|
| Loki + Grafana Alloy | Centralized log search across all LXCs from one UI |
| Trivy (cron) | Weekly container image CVE scan, alerts on critical findings |
| Prometheus SNMP exporter | Mikrotik switch/router metrics into the same Grafana stack |
| node_exporter on each LXC | Per-container CPU/RAM/disk/network in Grafana |
Tier 3: Nice to have, not urgent¶
| Tool | Problem it solves |
|---|---|
| step-ca | Internal PKI for service-to-service mTLS (low urgency given Cloudflare Tunnels handle external TLS) |
| Wazuh | Full SIEM: file integrity, compliance dashboards, audit trails for ERPNext |
| LibreNMS | Network topology maps, SNMP-based full network NMS for Mikrotik |
| ntopng | Deep flow-level traffic analysis (useful if you suspect unusual traffic patterns) |
Findings¶
1. Centralized Logging¶
Loki (Grafana Labs)¶
What it does: Stores logs indexed only by labels (host, container, service), not full-text. Ships log chunks to local disk or object storage. LogQL query language. Natively visualized in Grafana alongside metrics.
Why it matters for Finn: Graylog and OpenSearch both require a full inverted-text-index backend. OpenSearch recommends 4-8GB RAM for the indexer alone [1]. Loki avoids this entirely by not indexing log content, only metadata labels. This trades full-text search speed for massive storage and RAM savings.
vs Graylog: Graylog is a UI layer on top of OpenSearch/Elasticsearch. You pay the OpenSearch RAM tax plus a Graylog server process. Adds GELF input support and alert rules, but doubles the resource footprint vs Loki. Not worth it at this scale.
vs OpenSearch standalone: OpenSearch wins on full-text search (grep across unstructured text) and analytics. If you need to do compliance-grade log forensics for ERPNext, OpenSearch is better. For operational observability (why did n8n crash at 3 AM?), Loki is sufficient and an order of magnitude cheaper.
Complexity: Medium (configure Alloy log scraping per LXC, set up Loki datasource in Grafana) RAM/CPU: Loki: 300-500MB RAM, Alloy per LXC: 30-50MB each. 8 LXCs = ~400MB total for agents. Docker/LXC ready: Yes, official Docker images [1]
Grafana Alloy (log + metrics collector)¶
Unified OpenTelemetry collector replacing Promtail (for Loki) and standalone node-exporter runs [2]. One binary per LXC, ships logs to Loki AND metrics to Prometheus. Reduces operational overhead vs managing two separate agents.
2. Metrics and Dashboards¶
Grafana + Prometheus¶
What it does: Prometheus scrapes metrics from node_exporter (per LXC), cAdvisor (Docker containers), Proxmox exporter, and application exporters. Grafana visualizes everything with community dashboard templates (10,000+ available for Proxmox, n8n, AdGuard, etc.).
vs Netdata: Netdata claims 36% less CPU, 88% less RAM, and 97% less disk I/O than Prometheus at scale [4]. It is faster to deploy (auto-discovery, zero config) and has per-second resolution vs Prometheus default of 15s. However: Netdata uses a non-standard query language, has weaker ecosystem integration (fewer exporters for Proxmox/ERPNext/CrowdSec), and its Grafana integration is an add-on rather than native. Recommendation: Grafana + Prometheus for the ecosystem breadth. You can layer Netdata on top of specific hosts later if you want per-second granularity for troubleshooting.
Complexity: Medium (one Prometheus per stack, node_exporter on each LXC via Alloy, community dashboards import by ID) RAM/CPU: Prometheus: 300-500MB, Grafana: 200-350MB, node_exporter: 25MB each Docker/LXC ready: Yes, well-documented Proxmox LXC deployment pattern [1][2]
Key community dashboards: - Proxmox VE: ID 10347 or 21737 - AdGuard Home: ID 13946 - Node Exporter Full: ID 1860 - Docker/cAdvisor: ID 193 - CrowdSec: available via CrowdSec Hub
3. Uptime Monitoring¶
Uptime Kuma¶
What it does: Self-hosted uptime monitoring for HTTP/HTTPS, TCP ports, DNS, ping, Docker containers. Status page, Telegram/Slack/ntfy notifications, 90+ notification channels [3].
Why it wins: No meaningful competition in the self-hosted space for this price/complexity point. Freshping is SaaS. Statping is unmaintained. Uptime Kuma is actively maintained, Docker-native, and integrates directly with your Telegram notify stack.
Key monitors to set up: Cloudflare Tunnel endpoints, n8n webhook, Home Assistant, Plex, ERPNext, PBS web UI, AdGuard.
Complexity: Easy (single Docker container, 5-minute setup) RAM/CPU: ~50-100MB RAM, negligible CPU Docker/LXC ready: Yes [3]
4. Backups¶
Proxmox Backup Server (PBS)¶
What it does: Native Proxmox CT/VM backup with SHA-256 integrity verification, AES-256-GCM client-side encryption, zstd compression, incremental backups, deduplication. Written in Rust for high throughput [5]. Can be installed as an LXC on Finn itself pointing to the 24TB HDD pool.
Why it is Tier 1: PBS can verify backup data integrity (proxmox-backup-client verify) and restore individual files from a snapshot without full restore. Built-in prune policies (daily/weekly/monthly retention). Native Proxmox VE integration means one-click backup from the PVE UI.
Limitation: Backs up CT/VM images, not individual application data. A corrupted database inside a CT will be snapshotted as-is.
Complexity: Medium (LXC install, configure datastore, set backup schedules in PVE) RAM/CPU: 512MB-1GB RAM for PBS itself on Finn
Restic (off-site data backups)¶
What it does: Encrypts and deduplicates file-level backups to any backend: S3, Backblaze B2, SFTP, rclone targets [6]. restic check --read-data verifies backup integrity by reading back and decrypting all chunks.
Recommended use: Cron job per LXC to back up /opt/<service>/data, /etc/<service>/, database dumps to B2. Runs as a systemd timer, reports to forge_notify.sh on failure.
vs Borgmatic: BorgBackup + Borgmatic wrapper does deduplication, encryption, and scheduling in one tool. Borg uses a different chunking algorithm and stores chunks differently; performance is comparable to Restic. Borgmatic adds hooks and notifications. The choice between them is largely preference; Restic has wider cloud backend support and simpler mental model. Either works.
Complexity: Easy-Medium (Restic binary + cron/systemd timer) RAM/CPU: 100-200MB spike during backup, nothing at idle
Recommended strategy for Justin:
1. PBS on Finn: nightly CT/VM snapshots to 24TB HDD, 7 daily + 4 weekly + 3 monthly retention
2. Restic to B2: nightly critical data dirs (n8n data, HA config, ERPNext DB dump, Immich library index) with --read-data weekly verify + notify on failure
5. Intrusion Detection¶
CrowdSec (recommended)¶
What it does: Behavior-based log analysis with a crowdsourced IP blocklist (community CTI). Modular: Log Processor detects attacks, Bouncers enforce blocks at iptables, nginx, Cloudflare, Traefik level [7]. The Cloudflare Bouncer blocks malicious IPs at the CDN edge before traffic reaches Finn.
vs Fail2ban: Fail2ban is regex + iptables, single-node, no community intelligence. CrowdSec adds the crowdsourced blocklist (IPs attacking other CrowdSec users worldwide get pre-blocked on yours), Cloudflare-level blocking, WAF capability, and a proper local API. Resource overhead is comparable: both are lightweight. CrowdSec wins at every axis except simplicity.
Cloudflare integration: CrowdSec has a Cloudflare Bouncer that syncs blocked IPs to your Cloudflare IP firewall rules automatically. Given you use Cloudflare Tunnels, this is a natural fit.
Complexity: Medium (install crowdsec + bouncer, configure log sources per service) RAM/CPU: ~50-100MB RAM for the security engine + local API Docker/LXC ready: Yes, Debian package or Docker [7]
Wazuh (Tier 3)¶
What it does: Full XDR/SIEM: file integrity monitoring, vulnerability detection, regulatory compliance dashboards (PCI DSS, GDPR, NIST 800-53), audit trails, log aggregation. Three components: indexer (OpenSearch-based), server, dashboard [8].
Why it is Tier 3: The indexer alone requires 4GB RAM minimum (OpenSearch) and 8GB recommended. For a 10-service homelab, this is disproportionate unless ERPNext compliance is a formal business requirement. The compliance dashboards are genuinely useful for business auditing (GDPR for Wiebelhaus Enterprises) but can be deferred.
Complexity: Hard (three-node stack, agent on every endpoint) RAM/CPU: 6-10GB RAM total for the full stack
6. Vulnerability Scanning¶
Trivy (Aqua Security)¶
What it does: Scans container images, filesystems, and git repos for CVEs against multiple vulnerability databases (NVD, GitHub Advisory, etc.). No persistent daemon; runs as a CLI command or scheduled job [9].
Recommended use: Weekly cron on Finn that scans each running container image, outputs SARIF/JSON, pipes critical/high findings to forge_notify.sh via Telegram. No persistent RAM cost.
vs OpenVAS: OpenVAS is a full network vulnerability scanner with a PostgreSQL backend. 2-4GB RAM permanently. Better for network-level scanning (open ports, misconfigs) but overkill for container CVE tracking. Trivy handles the container surface; OpenVAS is Tier 3.
Complexity: Easy (binary install, cron job) RAM/CPU: ~100-200MB spike during scan, zero at idle Docker/LXC ready: Yes, Docker image or static binary
7. Certificate Management¶
step-ca (Smallstep)¶
What it does: ACME-compatible internal CA that issues X.509 certificates for TLS, mTLS, and document signing. Automates issuance and renewal via standard ACME protocol [10]. Any service using certbot or acme.sh can get certs from step-ca without going to Let's Encrypt.
Relevance for Justin's setup: Most external-facing services use Cloudflare Tunnels (Cloudflare handles TLS termination). Internal service-to-service communication (e.g., n8n calling a local FastAPI, Prometheus scraping endpoints) currently uses HTTP or self-signed certs. step-ca would clean up the "click past certificate warning" problem for internal admin UIs and enable mTLS between services.
Complexity: Medium (configure CA, distribute root cert to browsers/services, configure ACME renewal) RAM/CPU: ~50MB RAM Verdict: Tier 3. Solve backups and monitoring first.
8. Audit Logging¶
ERPNext built-in: ERPNext has native audit trails for financial transactions (doctype version history, activity logs). For a 5-company structure this is the first line of audit compliance.
Loki for operational audit: Structured log ingestion from ERPNext + n8n + HA into Loki provides a searchable audit trail for "who did what when" at the application layer. Sufficient for most small-business compliance needs.
Wazuh for formal compliance: If GDPR or financial audit requirements ever become formal for Wiebelhaus Enterprises, Wazuh's compliance dashboards (GDPR, NIST 800-53) are the right tool. Tag this as Tier 3 and revisit when a compliance framework is actually required.
9. Network Monitoring¶
Prometheus SNMP Exporter (recommended for Mikrotik)¶
What it does: Scrapes SNMP OIDs from the Mikrotik CRS328 (and any other SNMP-capable device) and exposes them as Prometheus metrics. Visualized in Grafana alongside host metrics. Mikrotik has full SNMP v2c/v3 support.
Why this first: Adds Mikrotik interface counters, bandwidth, error rates, CPU/RAM into your existing Grafana stack with minimal overhead. Same stack, zero new tools.
Complexity: Easy-Medium (enable SNMP on Mikrotik, configure SNMP exporter scrape job) RAM/CPU: ~30MB RAM
LibreNMS (Tier 2)¶
Full network monitoring platform with SNMP auto-discovery, topology maps, alerting, Mikrotik MIBs. Heavier (~1-2GB RAM) but gives visual network topology and per-device health history. Useful once the Mikrotik environment grows.
ntopng (Tier 3)¶
Flow-based deep traffic analysis. Requires NetFlow/IPFIX export from Mikrotik (supported). Useful for bandwidth hogs and suspicious traffic patterns. Community edition is free but limited; the features that matter (historical flow analysis) require paid license [11].
Resource Budget¶
Recommended Tier 1 + 2 stack, running as LXCs on Finn (128GB RAM available):
| Component | RAM estimate |
|---|---|
| Grafana | 300MB |
| Prometheus | 400MB |
| Loki (single-node, local disk) | 400MB |
| Grafana Alloy (8 LXCs x 40MB) | 320MB |
| Uptime Kuma | 80MB |
| PBS LXC | 768MB |
| CrowdSec | 100MB |
| Prometheus SNMP exporter | 30MB |
| node_exporter (8 LXCs x 25MB) | 200MB |
| Total | ~2.6GB |
Trivy is cron-only (no idle RAM). Wazuh would add 6-10GB if deployed (Tier 3).
2.6GB out of 128GB available is 2% of total RAM. There is no resource constraint blocking any of these deployments.
Disagreements / Open Questions¶
-
Loki vs OpenSearch: Grafana's own comparison blog [1] (biased source) argues Loki; the counterargument is that Loki's label-only indexing makes ad-hoc grep across unstructured legacy logs painful. If ERPNext's raw PHP/Python logs are important to search, consider a lightweight OpenSearch single-node (2GB RAM, manageable) over Loki's LogQL. Unverified: whether ERPNext emits structured JSON logs by default.
-
PBS on Finn vs separate machine: PBS on the same Proxmox host means a catastrophic Finn failure (disk controller, power, fire) takes PBS and primary data simultaneously. Ideal is PBS on a separate machine or at minimum a separate ZFS pool with a different disk controller. Off-site Restic to B2 mitigates this but does not replace a true secondary PBS target.
-
CrowdSec data sharing: CrowdSec's crowdsourced model requires opt-in sharing of your attack telemetry with their central API. This is acceptable for most homelab/business use but worth noting: your Cloudflare Tunnel IPs and attack patterns are shared externally. Disable with
cscli config set --api.server.online_client.credentials_path=""if privacy is a concern; you lose the community blocklist but keep local detection. -
Netdata vs Prometheus: Netdata's benchmark of 88% less RAM [4] is at 4.6M metrics/second, a scale far above this homelab. At 10 services / ~50K metrics, the difference is likely 200-400MB vs 300-500MB. Not material on 128GB. Prometheus wins on ecosystem.
-
step-ca vs Caddy as internal CA proxy: Caddy's built-in ACME server can issue certs for internal services trivially with zero PKI knowledge required. For the specific use case of eliminating self-signed cert warnings on internal admin UIs (Proxmox, PBS, etc.), Caddy in front of those services is faster to deploy than a full step-ca setup.
Sources¶
- Grafana Loki Overview, official Loki architecture and design rationale
- Grafana Alloy Introduction, unified telemetry collector replacing Promtail + agent; OpenTelemetry distribution
- Uptime Kuma GitHub Wiki, features, notification methods, Docker install
- Netdata Open Source, Netdata vs Prometheus+Grafana comparison (vendor source; treat as directionally accurate, not precise)
- Proxmox Backup Server Introduction, PBS features: deduplication, SHA-256 verification, AES-256-GCM encryption, zstd compression
- Restic Introduction, restic backup tool;
restic check --read-datafor verification - CrowdSec Security Engine Docs, architecture: log processor + bouncers + community CTI
- Wazuh Components, Wazuh indexer (OpenSearch), server, dashboard, agent architecture; XDR+SIEM features
- Trivy GitHub, Aqua Security container + filesystem vulnerability scanner
- step-ca Smallstep Docs, ACME-compatible internal CA; X.509 issuance + renewal automation
- ntopng ntop, network traffic analysis, flow-based monitoring
Search trail¶
- Fetched: grafana.com/docs/loki/latest (Loki overview, architecture)
- Fetched: uptime.kuma.pet + github.com/louislam/uptime-kuma/wiki (features, notification channels)
- Fetched: docs.crowdsec.net/docs/intro (CrowdSec architecture, bouncers, community CTI)
- Fetched: pbs.proxmox.com/docs/introduction.html (PBS features, deduplication, verification)
- Fetched: restic.readthedocs.io (Restic quickstart, check command)
- Fetched: borgbackup.readthedocs.io (Borg quickstart, encryption, remote repos)
- Fetched: github.com/aquasecurity/trivy (Trivy scanner overview)
- Fetched: smallstep.com/docs/step-ca (step-ca PKI, ACME, X.509)
- Fetched: documentation.wazuh.com/current/getting-started/components (Wazuh components, RAM requirements)
- Fetched: grafana.com/docs/alloy/latest (Alloy unified collector, migration from Promtail)
- Fetched: netdata.cloud/open-source (Netdata vs Prometheus+Grafana comparison)
- Fetched: ntop.org/products/traffic-analysis/ntopng (ntopng network monitoring)
- Attempted (404): crowdsec.net/blog/crowdsec-vs-fail2ban (moved; content synthesized from docs.crowdsec.net)
- Attempted (failed): docs.wazuh.com use-cases (network error; synthesized from components page)