DNS Redundancy, an ISO Test, and the Inevitable XRDP Detour — Part 16 of Building a Resilient Home Server Series - ByteHaven

Where We Left Off

Part 15 was about moving the configs off GitHub to Codeberg. The infrastructure was in a decent place. Two servers running, monitoring up, backups humming. Things were... fine.

Then nixos decided to go silent.

Power light on. Sitting there looking perfectly fine. No SSH. No DNS. No VNC. Just... nothing. Completely unresponsive on the network while apparently alive and well physically.

The Incident

Couldn't reach it on the network at all. The ethernet port — on either nixos itself, the cable, or the router — was being flaky. With the machine having just gone dark unexpectedly, I wasn't in the mood to also debug hardware, so I did the pragmatic thing: swapped the router's DNS over to straight Control D while I sorted it out.

Got nixos back up. Moved it to WiFi. Everything came back online. Rolled DNS back.

(Yes, I know. DNS on WiFi. I know. There's a cable for the job sitting upstairs and sciatica is a cruel mistress. It'll get quietly rotated back to ethernet at some point. In the meantime it works, and by the end of this post it'll be a lot more resilient regardless of which interface it's on. Backlog note added: grab another cable and properly test the ethernet port when the back cooperates.)

But the whole thing highlighted something I'd been putting off. When nixos goes down, DNS goes with it.

The DNS Problem

nixos2 was running AdGuard Home but only doing DNS rewrites — the .home domain mappings. The router had Control D as a fallback. In theory, fine. Spoiler: it was not fine. Router DNS failover is sluggish — clients hang waiting for the primary, eventually give up, and by then something has already broken. The whole incident proved it.

The right fix is a floating IP — one address that lives on whichever machine is up. Clients point at that. Either machine dies, nothing notices.

Time to actually do this properly.

Making Both AdGuards Match

First step: parity. nixos2 needed the full AdGuard setup — same blocklists, same rules, same rewrites — not just the domain mappings.

The filters had a mixed history. A subset had originally been declared in the nix config, but somewhere along the way an edit dropped them and AdGuard — running with mutableSettings = true — just kept holding the state internally. New lists got added through the UI over time as I worked back toward my pre-generator-incident setup. The result was a working AdGuard instance with no nix declaration behind it. Pulled everything from the yaml, reconciled it, and declared it properly. Since the list is long and now identical on both machines, it lives in its own file:

nix

services.adguardhome = {
  enable = true;
  mutableSettings = true;
  host = "0.0.0.0";
  port = 3000;

  settings = import /etc/nixos/modules/adguard-filters.nix;
  # (yes, most modules are imported via configuration.nix, but this felt
  # more like the syncthing devices config — configuration data rather than
  # a service definition, so importing it inline in services.nix felt right)
};

NixOS-specific note: if you're following along on any other platform, your filters and rewrites live in AdGuardHome.yaml or can be added through the UI. NixOS users get to declare it alongside everything else and have it reproduced automatically on rebuild — pull from git, rebuild, the server just knows what it should look like. For everyone else, the concepts are the same, the mechanism is just different.

One gotcha that bit me on nixos2: it's running libvirt for VMs, which means dnsmasq binds to port 53 for the virtual network bridge. AdGuard can't start because the port is already taken. nixos doesn't have this problem since libvirt isn't enabled there, but on nixos2 the fix is restricting AdGuard to specific interfaces instead of 0.0.0.0:

nix

dns.bind_hosts = [
  "192.168.50.53"
  "127.0.0.1"
  "100.110.182.6"  # Tailscale IP
];

This fix had actually been in the original config as a comment from a previous encounter with the same problem. It just didn't survive the refactor when the AdGuard settings got split out into their own file — the new file started fresh with 0.0.0.0 and the comment stayed behind in the old one. Easy thing to miss when moving chunks of config around.

Rewrites on Both Machines

Both machines carry the full rewrite list. The domains point to whichever server hosts the service — that doesn't change based on which DNS machine answers the query. cloud.home always resolves to nixos. git.home always resolves to nixos2. Doesn't matter which AdGuard you asked.

keepalived: The Floating IP

keepalived uses VRRP to maintain a virtual IP that floats between machines. Both machines heartbeat each other constantly. Primary handles all DNS. Primary goes silent long enough — backup grabs the IP and starts answering. Clients talking to 192.168.50.10 never need to know which physical machine is behind it.

Normally VRRP uses multicast. Mesh WiFi systems are hit or miss with multicast — the ASUS ZenWiFi being no exception. Unicast VRRP bypasses that entirely: the machines talk directly to each other's real IPs instead of broadcasting.

nixos — PRIMARY (yes, keepalived still uses the term MASTER in its config syntax — the terminology needs updating but unless you want to fork a C project over a config keyword, you're stuck with it)

nix

services.keepalived = {
  enable = true;
  vrrpInstances.DNS = {
    interface = "wlo1";
    state = "MASTER";
    virtualRouterId = 51;
    priority = 100;
    unicastSrcIp = "192.168.50.154";
    unicastPeers = [ "192.168.50.53" ];
    virtualIps = [{ addr = "192.168.50.10/24"; }];
  };
};

networking.firewall.extraCommands = ''
  iptables -A INPUT -p 112 -j ACCEPT
  iptables -A INPUT -s 192.168.50.53 -j ACCEPT
  iptables -A INPUT -s 192.168.50.0/24 -p tcp --dport 3389 -j ACCEPT
  iptables -A INPUT -i tailscale0 -p tcp --dport 3389 -j ACCEPT
  iptables -A INPUT -p tcp --dport 3389 -j DROP
'';

nixos2 — BACKUP:

nix

services.keepalived = {
  enable = true;
  vrrpInstances.DNS = {
    interface = "wlp1s0";
    state = "BACKUP";
    virtualRouterId = 51;
    priority = 50;
    unicastSrcIp = "192.168.50.53";
    unicastPeers = [ "192.168.50.154" ];
    virtualIps = [{ addr = "192.168.50.10/24"; }];
  };
};

networking.firewall.extraCommands = ''
  iptables -A INPUT -p 112 -j ACCEPT
  iptables -A INPUT -s 192.168.50.154 -j ACCEPT
  iptables -A INPUT -s 192.168.50.0/24 -p tcp --dport 3389 -j ACCEPT
  iptables -A INPUT -i tailscale0 -p tcp --dport 3389 -j ACCEPT
  iptables -A INPUT -p tcp --dport 3389 -j DROP
'';

Note: the NixOS keepalived module uses addr not ip for virtual IPs. The error message will tell you this if you get it wrong, but now you know before you get there.

The VIP doesn't need a DHCP reservation — keepalived adds it directly to the interface, bypassing DHCP entirely. You might think "I'll just reserve it in the router" — and you'd be right to want to, but most routers including ASUS require a MAC address to make a reservation, and a floating IP doesn't have one. It just appears on the network courtesy of keepalived. What you should do is make sure your DHCP pool range excludes that IP so nothing else accidentally gets assigned it.

After rebuild, verify:

bash

ip addr show wlo1  # Should show 192.168.50.10 on nixos

Router DNS pointed at 192.168.50.10. Done. Either machine can die and DNS keeps working.

For the non-NixOS homelab crowd: keepalived works identically on Ubuntu, Debian, or any Linux — config goes in /etc/keepalived/keepalived.conf directly. The unicast VRRP trick for mesh WiFi applies regardless of OS. You don't need two NixOS machines for this — two Raspberry Pis, an old laptop and a Pi, two Docker hosts running AdGuard — the pattern holds as long as both can run AdGuard and keepalived. What you don't get is the git clone and rebuild recovery, but the architecture is the same.

Warm Standby for Everything Else

With DNS sorted, thinking about redundancy naturally spreads to everything else. Vaultwarden in particular — DNS going down is annoying, password manager going down is a different category of problem.

nixos2 already has Vaultwarden in the config, just disabled. The data directory is already synced by Syncthing. The missing piece is the database. SQLite doesn't want two writers, so live replication is out. Instead: warm restore. nixos backs up hourly, nixos2 restores from that backup automatically, also hourly but offset by 30 minutes to give Syncthing time to propagate the backup repo first.

nix

systemd.services.restic-restore-vaultwarden = lib.mkIf enableRestores.vaultwarden {
  description = "Restore Vaultwarden from restic backup";
  after = [ "network.target" ];
  serviceConfig = {
    Type = "oneshot";
    User = "root";
    ExecStart = ''
      ${pkgs.restic}/bin/restic \
        -r /var/local/backups/restic \
        --password-file /etc/nixos/private/restic-password \
        restore latest \
        --target / \
        --include /var/local/vaultwarden
    '';
  };
};

systemd.timers.restic-restore-vaultwarden = lib.mkIf enableRestores.vaultwarden {
  wantedBy = [ "timers.target" ];
  timerConfig = {
    OnCalendar = "hourly";
    RandomizedDelaySec = "30min";
    Persistent = true;
  };
};

Same pattern for Gitea in the other direction — nixos2 is primary, nixos keeps a warm restore. Nextcloud and Linkwarden have the restore jobs defined but disabled — when needed, the SQL dump will be sitting there ready, and the data directory is already synced.

Emergency Vaultwarden failover is now: enable the service on nixos2, rebuild, and it's accessible immediately on the LAN via the web portal. Update the Tailscale DNS entry if you need external access too. Either way, under 5 minutes.

Okay, Now Let's Test the ISO

With all of that sorted, it was time to actually test the ISO builder. Quick recap of where that lives: the original VM tests were done on my dev machine, which subsequently got taken out by broken Razer firmware. Thanks Razer. The VM work got moved to nixos2, which has the overhead and storage to handle it. During those initial rounds there was a monitor attached left over from earlier VM testing sessions.

This was the first time attempting to boot and interact with the ISO entirely headless — no monitor, no fallback, just SSH and whatever remote access we could get working.

The ISO booted. Services came up. Then the reality of "now what" set in.

Attempt 1: Dummy Virtual Display

First thought: create a virtual display and map noVNC to it. Got it running. Desktop appeared. Sort of. Icons were mostly just labels. The applications button only showed up on hover. The whole desktop was in this half-rendered limbo state where things technically existed but weren't really there.

Not usable.

Attempt 2: RDP to LXQt

nix

services.xrdp = {
  enable = true;
  defaultWindowManager = "dbus-launch --exit-with-session startlxqt";
  openFirewall = true;
};

Connected. Authentication succeeded. Immediately booted back out. No session, no error, just rejected.

Not usable either.

Attempt 3: TigerVNC

The next suggestion was TigerVNC running its own isolated display, with noVNC mapped to access it — bypassing the broken dummy display entirely.

This one actually got further. A desktop appeared. LXQt was technically running — dbus was up, you could SSH in and launch applications from the terminal and they'd appear on screen, or grab whatever happened to be sitting there as a desktop shortcut. Which sounds promising until you realize the LXQt panel was completely missing. No taskbar. No application menu. Just a desktop with wallpaper and whatever you could manually coax into existence.

That's not a remote desktop. That's a geek parlor trick. Or an oh-shit workaround when you're truly desperate. Not the goal.

At this point the correct response was to step away for five minutes.

The Regrouping

Came back, made a decision: revert. Either wait for the headless HDMI plug that was on order, or plug in a monitor and do it the old fashioned way. But with a bit of calmer thinking, there was one more thing worth trying first.

The LXQt issues had been consistent across every single attempt. Something about LXQt's compositor and virtual displays just didn't want to cooperate. What if the answer was simply: don't use LXQt for the RDP session.

Attempt 4: XFCE — Actually Works

Both desktop environments can coexist. LXQt stays for physical sessions. XFCE handles RDP.

nix

services.xserver.desktopManager.xfce.enable = true;
services.xserver.desktopManager.lxqt.enable = true;

services.xrdp = {
  enable = true;
  openFirewall = false;
  defaultWindowManager = let
    startScript = pkgs.writeShellScript "start-xfce-xrdp" ''
      . /etc/set-environment
      exec dbus-launch --exit-with-session xfce4-session
    '';
  in "${startScript}";
};

A couple of things worth unpacking:

The startScript sources /etc/set-environment explicitly. NixOS puts environment variables there but XRDP's startwm.sh doesn't source it — so anything you set in your NixOS config never reaches the RDP session. Sourcing it manually fixes that.

openFirewall = false with explicit rules in networking.nix keeps RDP locked to LAN and Tailscale only — already shown in the keepalived config snippet above.

One last gotcha: after switching from LXQt to XFCE, XRDP reconnected to the cached old session. Kill it:

bash

sudo systemctl restart xrdp xrdp-sesman
pkill -u ppb1701 Xvnc || true
pkill -u ppb1701 startlxqt || true

Also delete ~/.xsession if it exists — it overrides startwm.sh and sends you right back to the broken session.

Full desktop. Taskbar. Application menu. Everything rendering. ISO tested, XRDP working, configs updated, committed.

Lessons Learned

Router DNS failover is too slow to be useful. A proper floating IP with keepalived is the right answer. The "secondary DNS" field on most routers is a last resort, not actual redundancy.

Mesh WiFi and VRRP multicast don't mix. Use unicast VRRP. Saved the whole setup from not working at all.

addr not ip. NixOS keepalived module. You'll find out the hard way otherwise.

Config refactors eat comments. When splitting configs into separate files, double check that fixes buried in comments made the journey too.

LXQt and XRDP have a rocky relationship. Across every approach — dummy display, direct RDP, TigerVNC — LXQt's compositor caused problems. XFCE just works. Save yourself the detour.

Always source /etc/set-environment in your XRDP start script. NixOS puts environment variables there and XRDP ignores it by default.

Warm standby beats cold recovery. Database already restored on the standby machine means failover is flip a switch, not restore under pressure while stressed.

Sometimes you need five minutes. The solution was there. Just needed to stop fighting the same broken thing and try something different.

← Back to Series

The full configs are on Codeberg: nixos and nixos2. The ISO can be gotten here. If you're following along on Mastodon: @ppb1701@ppb.social