Adding Monitoring: Part 5 of "Building a Resilient Home Server" Series
*Part 5 of "Building a Resilient Home Server" series*
## Where We Left Off
In [Part 4](https://blog.ppb1701.com/evolving-the-server-adding-services-and-hardening-part-4-of-building-a-bulletproof-home-server-serie), I had a working NixOS server running AdGuard Home, Syncthing, Nginx, and Tailscale. Everything was humming along nicely, secured behind SSH keys and a properly configured firewall. The server was doing its job.
But here's the thing: I had absolutely no visibility into what was actually happening on that server.
Was CPU usage spiking during Syncthing syncs? Was disk space slowly filling up? Were services crashing and restarting without me knowing? I had no idea. The server could have been on fire (metaphorically speaking), and I wouldn't know until something stopped working.
More importantly, as I looked ahead to expanding this server with services like **Nextcloud** and potentially building out a **cybersecurity test environment**, I realized that monitoring wasn't just nice to have - it was essential. You can't manage what you can't measure, and you definitely can't troubleshoot what you can't see.
So before adding more complexity, I decided to add a proper monitoring stack.
## Why Monitoring Matters (More Than You Think)
When I first started this project, monitoring felt like overkill. "It's just a home server," I thought. "I'll know if something breaks."
But here's what changed my mind:
### The Silent Failures
Services can fail gracefully, restart automatically, and leave you none the wiser. Without monitoring, you might not notice that AdGuard Home is restarting every few hours due to memory pressure, or that Syncthing is struggling with large file syncs. These aren't catastrophic failures - they're the slow degradation that eventually leads to bigger problems.
### The Slow Degradation
Disk space doesn't run out overnight. Memory leaks don't crash systems immediately. These problems creep up slowly, and by the time you notice, you're in crisis mode instead of prevention mode. With monitoring, you get early warnings - "Hey, you're at 80% disk usage" - giving you time to fix things before they become emergencies.
### The Future-Proofing
Every service I add to this server increases complexity. Nextcloud will add database queries, file operations, and user sessions to monitor. A cybersecurity lab will need resource isolation verification and performance tracking. Setting up monitoring infrastructure *now* means I can easily extend it as I grow, rather than trying to retrofit monitoring into an already complex system.
## Choosing the Right Stack
I researched several monitoring solutions, and the industry-standard stack kept coming up: **Prometheus + Grafana + Alertmanager**. Here's why this combination made sense:
**Prometheus** handles metrics collection and storage. It's a time-series database specifically designed for monitoring, with a powerful query language (PromQL) for analyzing data. It works by "scraping" metrics from configured endpoints at regular intervals - think of it as periodically asking "Hey, what's your CPU usage right now?" and storing the answers.
**Grafana** provides visualization. Raw metrics are useful, but seeing CPU usage as a graph over time, or disk space trending toward zero, makes patterns immediately obvious. Grafana's dashboard system is incredibly flexible - you can create custom views for exactly what you need to see.
**Alertmanager** handles notifications. It's one thing to have metrics; it's another to be *alerted* when something goes wrong. Alertmanager can route alerts based on severity, deduplicate notifications (so you don't get spammed with 100 alerts for the same problem), and integrate with various notification systems.
**Node Exporter** exposes system metrics (CPU, memory, disk, network) in a format Prometheus can scrape. It's the bridge between your system and your monitoring stack - it translates "what's happening on this machine" into metrics Prometheus can understand.
**ntfy** delivers push notifications to my phone. When Alertmanager detects a problem, ntfy ensures I know about it immediately, even when I'm away from my desk. No need to constantly check dashboards - the server will tell me when something needs attention.
This stack is well-supported in NixOS, widely used in production environments, and gives me everything I need for both current services and future expansion.
## The NixOS Configuration
One of the beautiful things about NixOS is how cleanly monitoring integrates into the declarative configuration. Here's what I added to my `modules/services.nix`:
```nix
# Prometheus - metrics collection and storage
services.prometheus = {
enable = true;
port = 9090;
# Node Exporter - exposes system metrics
exporters = {
node = {
enable = true;
enabledCollectors = [ "systemd" ];
port = 9100;
};
};
# Configure what to scrape
scrapeConfigs = [
{
job_name = "node";
static_configs = [{
targets = [ "127.0.0.1:9100" ];
}];
}
];
};
# Grafana - visualization and dashboards
services.grafana = {
enable = true;
settings = {
server = {
http_addr = "127.0.0.1";
http_port = 3000;
domain = "grafana.local";
};
security = {
admin_user = "admin";
admin_password = (import /etc/nixos/private/secrets.nix).grafanaPassword;
};
};
# Automatically configure Prometheus as a data source
provision = {
enable = true;
datasources.settings.datasources = [{
name = "Prometheus";
type = "prometheus";
url = "http://127.0.0.1:9090";
isDefault = true;
}];
};
};
# Alertmanager - notification routing
services.prometheus.alertmanager = {
enable = true;
port = 9093;
# Load SMTP credentials from environment file
environmentFile = "/etc/nixos/private/alertmanager.env";
configuration = {
route = {
receiver = "email-and-ntfy";
group_by = [ "alertname" ];
group_wait = "30s";
group_interval = "5m";
repeat_interval = "4h";
};
receivers = [
{
name = "email-and-ntfy";
# Email notifications via SMTP
email_configs = [{
to = "$EMAIL_TO";
from = "$SMTP_USERNAME";
smarthost = "smtp.fastmail.com:587";
auth_username = "$SMTP_USERNAME";
auth_password = "$SMTP_PASSWORD";
headers = {
Subject = "Alert: {{ .GroupLabels.alertname }}";
};
}];
# Push notifications via ntfy
webhook_configs = [{
url = "http://127.0.0.1:8080";
send_resolved = true;
}];
}
];
};
};
# ntfy - push notification delivery
services.ntfy-sh = {
enable = true;
settings = {
base-url = "http://127.0.0.1:8080";
listen-http = ":8080";
behind-proxy = true;
};
};
# Nginx reverse proxy for Grafana
services.nginx.virtualHosts."grafana.local" = {
locations."/" = {
proxyPass = "http://127.0.0.1:3000";
proxyWebsockets = true;
};
};
```
That's it. **Create the required secret files** (`/etc/nixos/private/secrets.nix` and `/etc/nixos/private/alertmanager.env`), then run `sudo nixos-rebuild switch`, and the entire monitoring stack comes up. No manual service configuration, no hunting for config files scattered across the filesystem, no wondering if you forgot to enable something at boot. It's all declared right here.
**Note:** The repository includes template files for these secrets:
- `private/secrets.nix.template` - Copy and fill in your Grafana admin password
- `private/alertmanager.env.template` - Copy and fill in your SMTP credentials (optional for email alerts)
Without these files, Grafana and Alertmanager won't start, but the rest of the monitoring stack (Prometheus, node-exporter, ntfy) will work fine.
## Testing the Installation
After rebuilding the system, I verified all services were running:
```bash
for service in prometheus grafana alertmanager prometheus-node-exporter ntfy-sh; do
echo "=== $service ==="
systemctl is-active $service
done
```
**Expected output:**
- prometheus: **active** ✅
- grafana: **active** ✅
- alertmanager: **failed** ⚠️ (expected until SMTP credentials are configured)
- prometheus-node-exporter: **active** ✅
- ntfy-sh: **active** ✅
**Note:** Alertmanager will fail to start until you provide real SMTP credentials in `/etc/nixos/private/alertmanager.env`. This is intentional - the dummy credentials in the template won't work. All other monitoring services function independently and don't require email configuration. You can still use ntfy for push notifications without email alerts.
## What I Can Monitor Now
With this setup running, I have comprehensive visibility into:
**System Resources:**
- CPU usage per core and overall system load
- Memory consumption and swap usage
- Disk space across all partitions
- Network bandwidth and packet rates
- System load averages over time
**Service Health:**
- Uptime tracking for all systemd services
- Restart counts and failure rates
- Service-specific metrics (where available)
- Process resource consumption
**System Events:**
- Service starts, stops, and restarts
- Configuration changes (via NixOS generations)
- System reboots and uptime tracking
All of this is visualized in Grafana dashboards. I can see at a glance if something is trending in the wrong direction - disk space filling up, memory usage climbing, CPU consistently pegged. And with alerts configured, I don't even need to check the dashboards regularly - the server will tell me when something needs attention.
## Setting Up Alerts
The real power of this monitoring stack comes from proactive alerting. Here are some of the alerts I configured:
**Disk Space Alert:** Triggers when any partition exceeds 80% usage. This gives me plenty of time to clean up or expand storage before running out of space completely. No more surprise "disk full" errors.
**Memory Pressure Alert:** Notifies me when available memory drops below 10%. This could indicate memory leaks, runaway processes, or just that I need to add more RAM. Either way, I know about it before the system starts swapping heavily and performance tanks.
**Service Down Alert:** Fires immediately if any critical service (AdGuard Home, Syncthing, Nginx) stops running. If my ad blocking goes down, I want to know right away, not when someone in the house complains that ads are back.
**High CPU Alert:** Triggers if CPU usage stays above 80% for more than 5 minutes. Brief spikes are normal, but sustained high CPU usage indicates something is wrong - a runaway process, a misconfiguration, or resource contention that needs investigation.
These alerts give me peace of mind. I know I'll be notified if something goes wrong, even if I'm not actively watching the dashboards. The server is now self-aware enough to call for help when it needs it.
## Lessons Learned
**Start with monitoring early:** It's much easier to add monitoring before you have a complex service stack. Trying to retrofit monitoring into an existing system is harder and more error-prone. You're trying to understand what "normal" looks like while also setting up the monitoring infrastructure.
**NixOS makes it declarative:** The entire monitoring stack is defined in configuration files. If I need to rebuild this server from scratch, the monitoring comes back automatically with everything else. No manual setup, no forgotten steps, no "oh right, I need to configure that thing again."
**Industry-standard tools work great:** Prometheus and Grafana are well-supported in NixOS, with excellent module options and extensive documentation. I didn't have to fight with the system to get these working - they just integrated cleanly.
**Plan for expansion:** Setting up monitoring infrastructure now makes adding services like Nextcloud easier later. I can immediately see the impact of new services on system resources. No guessing about whether the server can handle the additional load - I'll have the data.
**Alerting is crucial:** Having metrics is good; being notified when something goes wrong is essential. The combination of email and push notifications ensures I never miss critical alerts. I can be away from my desk, away from home even, and still know if my server needs attention.
## What's Next?
Now that I have monitoring in place, I'm ready to expand:
**Nextcloud:** Self-hosted file sync and collaboration platform. With monitoring, I'll be able to track database performance, storage usage, and user activity from day one. I'll know immediately if the database is struggling or if storage is filling up faster than expected.
**Cybersecurity Lab:** Isolated testing environment for security tools and experiments. Monitoring will help me verify resource isolation and track the performance impact of various tools. I can experiment without worrying about accidentally taking down the whole server.
**Additional Services:** Whatever else I need, with full visibility into resource usage and service health. The monitoring foundation is flexible enough to grow with the server.
The monitoring foundation is laid. Time to build on it.
Main Server (nixos): Codeberg
Second Server (nixos2): Codeberg
ISO can be gotten here.
**Questions? Comments? Find me on Mastodon: [@ppb1701@ppb.social](https://ppb.social/@ppb1701)**