Building a highly available DNS using NixOS and CoreDNS

||
#CoreDNS#Nix#Linux#BGP

Why?

Why not. When George Mallory was asked why he climbed Mount Everest he said "because it is there". DNS is there. Ever looming. Ever the cause of issues. I'd like to finally be at least 50% certain it's not DNS, probably.

You've got my attention, what do I need?

  • At least two machines, preferably in separate failure zones
  • A BGP capable router
  • Knowledge of NixOS
  • Preferably a separate subnet/vlan with at least 4 free IPs
  • A deep-seated appreciation for NixOS
  • DNS issues

The stack and architecture

For this project I (obviously) went with NixOS to manage multiple nodes simultaneously and ensure consistent state between them. The nodes run CoreDNS because it's simple, lightweight, and rock solid. For BGP announcements FRR is a solid choice.

graph TD; accTitle: Vertical Anycast DNS Topology accDescr: A vertically aligned network diagram showing the UniFi router peering with two DNS nodes which announce shared Anycast IP addresses. Router[UniFi Router<br/>10.2.0.254 / fd10:2::fe] %% BGP Peering Path Router --- BGP_L((BGP)) Router --- BGP_R((BGP)) BGP_L --- Node1 BGP_R --- Node2 subgraph Node1 [ns-01.svc.cows.homes @ 10.2.0.1 / fd10:2::1] direction TB FRR1[FRR Instance] Core1[CoreDNS: TLS, DoH, DoT, DoQ] FRR1 --- Core1 end subgraph Node2 [ns-02.svc.cows.homes @ 10.2.0.2 / fd10:2::2] direction TB FRR2[FRR Instance] Core2[CoreDNS: TLS, DoH, DoT, DoQ] FRR2 --- Core2 end %% Announcement Path Node1 --- Announce((Announce)) Node2 --- Announce Announce --- VIPs subgraph VIPs [Shared Anycast IPs] v4[10.2.0.3/32 & 10.2.0.4/32] v6[fd10:2::3/128 & fd10:2::4/128] end
Figure 1: A diagram showing the logical setup

Things I had to solve

Suboptimal domain partitioning

I own and use the cows.homes domain for my internal network. But I designated the root domain .cows.homes. and *.cows.homes. to K8S hosted services and I'm using *.<vlan_name>.cows.homes. for devices on my network.
I have an authoritative dns deployed to my K8S cluster to manage the root domain and subdomains specific to the cluster services. My UniFi Gateway is also my DHCP server and is aware of the device FQDNs.

That leaves the question: How do I forward K8S cluster NS requests to the NS servers in the cluster, keep resolution of network devices working AND avoid loops in the DNS resolution?

The compromise

I ultimately settled on forwarding .cows.homes. to my K8S cluster's DNS. But to keep resolution of internal device FQDNs I decided to periodically scrape my Gateway's API for IPs and FQDNs. Sounds simple enough right?

Excursion 1: The frustrating reality of using UniFi devices :(

I went to the official api documentation and immediately saw GET /v1/sites/{siteId}/clients sounds perfect, doesn't it? So I try it on my local network...

http get 
    -k 
    --headers {X-API-Key: $env.API_KEY} 
    https://($env.IP)/proxy/network/integration/v1/sites/($env.SITE_ID)/clients 
        | get data 
        | first

╭────────────────┬──────────────────────────────────────╮
 type WIRED
 id b94f34b2-917d-3ddf-9e7f-6701ad2c7357
 name argon dd:04
 connectedAt 2026-03-28T15:45:58Z
 ipAddress 10.30.0.153
 macAddress bc:24:11:e3:dd:04
 uplinkDeviceId 17662d09-0bde-3ad3-9d78-df4053f45a45
 ╭──────┬─────────╮
 access type DEFAULT
 ╰──────┴─────────╯
╰────────────────┴──────────────────────────────────────╯

I can't get a clean hostname and I'd have to rely on the IP to designate the proper FQDN. I also have 0 access to any ipv6 information. That's... suboptimal. But there's a saving grace: The unofficial legacy API, specifically the GET proxy/network/s/default/stat/alluser endpoint.

let auth = { username: $env.UNIFI_USER, password: $env.UNIFI_PASSWORD }

let login_full = ($auth |
    http post
        --full
        --insecure
        --content-type application/json
        https://($env.IP)/api/auth/login
)

let cookie = ($login_full.headers.response
    | where name == "set-cookie"
    | get value.0
    | split row ";"
    | get 0
)

let csrf = ($login_full.headers.response
    | where name == "x-csrf-token"
    | get value.0
)

http get
    -k
    --headers { "Cookie": $cookie, "X-CSRF-Token": $csrf }
    https://($env.IP)/proxy/network/api/s/default/stat/alluser
    | get data
    | where hostname? != null
    | where ($it.use_fixedip? == true)
    | select last_ip last_ipv6? hostname
    | first

╭───────────┬────────────────────────╮
 last_ip 10.30.0.153
 ╭───┬────────────────╮
 last_ipv6 0 fd::10:30:0:dd
 ╰───┴────────────────╯
 hostname argon
╰───────────┴────────────────────────╯

That's complicated but I can work with that! With some prioritization of existing FQDNs, VLAN information, and culling of old entries I can generate a pretty good hosts file using a single, not overly complicated nushell script.

/var/lib/coredns/router.hosts
10.30.0.153                 argon.dmz.cows.homes
fe80::be24:11ff:fee3:dd04   argon.dmz.cows.homes

Excursion 2: I have a hosts file already, what if...

I just add blocklists to it? That shouldn't be too hard. And it isn't. I can just use a similar nushell script to scrape some blocklist sources and write them to a hosts file. For this to work properly I'm having the lease scraper write to /var/lib/coredns/router.hosts and the blocklist scraper to /var/lib/coredns/blocklist.hosts.

Whenever any of the two scripts is executed it updates its own file and then merges both files into a coredns.hosts which is then used by coredns via the hosts plugin.

Hot take, I mean Excursion 3: Systemd

Now the only thing that might be obvious is that these scripts don't loop. They just write the file and exit. To solve this we can just set up these files as services and start timers to run them every-so-often.

For instance, the blocklist updater runs once a day, roughly.

jobs.nix Lines 2 - 26
  systemd.services.coredns-blocklist-update = {
    description = "Update CoreDNS blocklist";
    after = [ "network-online.target" ];
    wants = [ "network-online.target" ];

    serviceConfig = {
      Type = "oneshot";
      User = "coredns";
      Group = "coredns";
      Environment = "STATIC_HOSTS_FILE=${staticHosts}";
      StateDirectory = "coredns";
      ReadWritePaths = [ "-/var/lib/coredns" ];
      ExecStart = "${pkgs.nushell}/bin/nu ${dnsScripts}/blocklist-update.nu";
    };
  };

  systemd.timers.coredns-blocklist-update = {
    wantedBy = [ "timers.target" ];
    timerConfig = {
      OnBootSec = "2m";
      OnUnitActiveSec = "1d";
      RandomizedDelaySec = "30m";
      Persistent = true;
    };
  };

Now back to the nix of it all

With all the information I've gathered so far, I can start creating the dns nodes.

I use my own nix-scaffold flake module to help keep clutter out of my nix configuration and allow me to focus on just having implementation in my repo. For the DNS I created a new template that will be shared by both nodes. Think of templates as very dumb nixos modules, you can allocate one per machine and it semantically brings everything that a DNS, Workstation, or Server might need. Kind of like a profile, or a kind of "type".

The template for the DNS nodes has around three distinct areas:

  • Static files that are necessary, like the scripts and static hosts
  • CoreDNS configuration
  • FRR configuration

Static files

I use a static.hosts file that maps nameserver IPs to hostnames, I also define the hostname for my K8S Cluster's VIP here. This file is just prepended to the coredns.hosts file generated by the scripts from the excursions.

# NS
10.2.0.1          ns-01.svc.cows.homes
fd10:2::1         ns-01.svc.cows.homes
10.2.0.2          ns-02.svc.cows.homes
fd10:2::2         ns-02.svc.cows.homes
10.2.0.3          ns.svc.cows.homes
fd10:2::3         ns.svc.cows.homes
10.2.0.4          ns.svc.cows.homes
fd10:2::4         ns.svc.cows.homes

I attempted to allow Windows to automatically detect that my DNS is DoH compatible via Discovery of Designated Resolvers, I'm not sure if I configured the zone file wrong or if Windows is weird, but it ultimately didn't work.

$ORIGIN resolver.arpa.
@ 3600 IN SOA ns-01.svc.cows.homes. admin.cows.homes. ( 2026042401 7200 3600 1209600 3600 )
@ 3600 IN SOA ns-02.svc.cows.homes. admin.cows.homes. ( 2026042401 7200 3600 1209600 3600 )
@ 3600 IN SOA ns.svc.cows.homes.    admin.cows.homes. ( 2026042401 7200 3600 1209600 3600 )

_dns 3600 IN SVCB 1 ns-01.svc.cows.homes. alpn="h2" port=443 dohpath="/dns-query" ipv4hint=10.2.0.1 ipv6hint=fd10:2::1
_dns 3600 IN SVCB 1 ns-02.svc.cows.homes. alpn="h2" port=443 dohpath="/dns-query" ipv4hint=10.2.0.2 ipv6hint=fd10:2::2
_dns 3600 IN SVCB 1 ns.svc.cows.homes.    alpn="h2" port=443 dohpath="/dns-query" ipv4hint=10.2.0.3 ipv6hint=fd10:2::3

_dns 3600 IN SVCB 2 ns-01.svc.cows.homes. alpn="dot" port=853 ipv4hint=10.2.0.1 ipv6hint=fd10:2::1
_dns 3600 IN SVCB 2 ns-02.svc.cows.homes. alpn="dot" port=853 ipv4hint=10.2.0.2 ipv6hint=fd10:2::2
_dns 3600 IN SVCB 2 ns.svc.cows.homes.    alpn="dot" port=853 ipv4hint=10.2.0.3 ipv6hint=fd10:2::3

_dns 3600 IN SVCB 3 ns-01.svc.cows.homes. alpn="doq" port=853 ipv4hint=10.2.0.1 ipv6hint=fd10:2::1
_dns 3600 IN SVCB 3 ns-02.svc.cows.homes. alpn="doq" port=853 ipv4hint=10.2.0.2 ipv6hint=fd10:2::2
_dns 3600 IN SVCB 3 ns.svc.cows.homes.    alpn="doq" port=853 ipv4hint=10.2.0.3 ipv6hint=fd10:2::3

If anyone has an idea why this doesn't work, please contact me and let me know :)

CoreDNS

The coredns configuration is relatively straight forward.

core.nix Lines 27 - 56
  services.coredns = {
    enable = true;
    config = ''
      tls://.:853 https://.:443 quic://.:853 .:53 {
        tls /var/lib/acme/svc.cows.homes/fullchain.pem /var/lib/acme/svc.cows.homes/key.pem

        health 127.0.0.1:8080

        prometheus 0.0.0.0:9153

        file ${ddrFile} resolver.arpa {
            reload 5s
        }

        hosts ${hostsPath} {
          fallthrough
        }

        forward cows.homes ${k8sIPsStr}

        forward . ${upstreamIPsStr} {
          tls_servername cloudflare-dns.com
          health_check 5s
        }

        cache 3600
        errors
      }
    '';
  };

This configuration enables DoT, DoH, DoQ, and generic DNS via port 53/tcp & 53/udp, then sets up health checks (important for FRR later), prometheus endpoints and includes the forwards we went over in things I had to solve.

You may notice that I use Cloudflare DoT as my upstream resolver. I'm not sure if that's a good idea yet, the latency is definitely noticeable. I'll maybe change it, maybe not. I'm also omitting the acme configuration here, since it's fairly standard and simple.

FRR

Now the magic sauce: ✨✨ BGP ✨✨. The point of using BGP is twofold:

  • I want to keep the logical IP(s) for the nameserver(s) separate from the actual hosts
  • I want to be able to reboot or even pull the plug on one host without (much of a) disruption

To accomplish this the nodes announce to my router very tight BGP timings (3 second keepalive, 9 second timeout):

default.nix Lines 32 - 35
        
        ! Define upstream neighbors (Assumes gateways are the upstream routers)
        neighbor 10.2.0.254 remote-as 64513
        neighbor 10.2.0.254 timers 3 9

and I created a job that continuously monitors the aforementioned coredns health endpoint and withdraws the announced routes within a second should coredns have issues on a node.

jobs.nix Lines 12 - 42
      # Initial state assumption
      STATE="up"
      
      while true; do
        if curl -sf --max-time 1 http://127.0.0.1:8080/health > /dev/null; then
          if [ "$STATE" = "down" ]; then
            vtysh -c 'conf t' -c 'router bgp 65053' \
                  -c 'address-family ipv4 unicast' \
                  -c 'network 10.2.0.3/32' \
                  -c 'network 10.2.0.4/32' \
                  -c 'exit-address-family' \
                  -c 'address-family ipv6 unicast' \
                  -c 'network fd10:2::3/128' \
                  -c 'network fd10:2::4/128'
            STATE="up"
          fi
        else
          if [ "$STATE" = "up" ]; then
            vtysh -c 'conf t' -c 'router bgp 65053' \
                  -c 'address-family ipv4 unicast' \
                  -c 'no network 10.2.0.3/32' \
                  -c 'no network 10.2.0.4/32' \
                  -c 'exit-address-family' \
                  -c 'address-family ipv6 unicast' \
                  -c 'no network fd10:2::3/128' \
                  -c 'no network fd10:2::4/128'
            STATE="down"
          fi
        fi
        sleep 1
      done

It might seem redundant to announce two IPs per address family, but this is to ensure that no "bad" secondary dns is configured by whatever OS you're using, which might otherwise lead to issues with resolving internal domains.

Conclusion time

I have had this setup running like this for a few weeks now and did some fault tolerance testing. I have restarted nodes freely, killed vms abruptly. Pulled the plug on the other host. All without even noticing it.

I feel confident that I will continue using this or a similar architecture for my DNS for the foreseeable future, and will probably expand this concept to other core infrastructure services where it might seem useful :)