Cache Consul DNS TTL With Dnsmasq

While testing Consul DNS caching, I have noticed that Dnsmasq does not cache Consul responses. If you are also using Dnsmasq to forward DNS queries to Consul, you may run into the same issue. This guide will walk though all required steps to enable Consul DNS TTL cache with Dnsmasq.

TL;DR

Problem: Dnsmasq won’t cache responses from non recursive name servers. Consul refuses recursive queries without recursor configuration.

Solution: Configure fake recursor in Consul by adding "recursors": [ "127.0.0.2" ] to /etc/consul/consul.json.

Detailed

Enable Consul DNS Caching

Edit /etc/consul/consul.json and add the config below. It should enable TTL values of 1 minute for all services and nodes, 5 minutes for test service. You can replace it with any other service you prefer.

{
  "dns_config": {
    "service_ttl": {
      "*": "60s",
      "test": "5m"
    },
    "node_ttl": "60s"
  }
}

Reload Consul with consul reload.\

Verify Consul returns DNS response with TTL value. Note the Consul DNS port 8600.

$ dig +nocmd +noall +answer test.service.consul @127.0.0.1 -p 8600
test.service.consul.	300	IN	A	10.135.205.232
test.service.consul.	300	IN	A	10.135.174.59

We can see that Consul DNS cache is working, we have 300 seconds TTL in the second column for test service.
Now that Consul is responding with TTL values, we need to cache them on client level. That’s where Dnsmasq comes in.

Enable Dnsmasq Caching

We use Dnsmasq for forwarding client DNS requests to Consul. It also forwards non Consul requests to other DNS servers. We need to enable caching for Dnsmasq, in order to cache Consul replies with TTL.

Edit Dnsmasq config file /etc/dnsmasq.conf and add/uncomment cache-size parameter (change 150 to the size you need, but for this example it’s enough):

cache-size=150

Also, enable logging so we can check which records Dnsmasq cached:

log-facility=/var/log/dnsmasq.log
log-queries

Restart Dnsmasq service:

$ systemctl restart dnsmasq

Follow Dnsmasq logs:

$ tail -f /var/log/dnsmasq.log | grep cached

Verify that Dnsmasq caches requested domains. Note the Dnsmasq DNS port 53.
First response should contain the IP address and TTL value in the second column.

$ dig +nocmd +noall +answer google.com @127.0.0.1 -p 53
google.com.		118	IN	A	216.58.206.78

Second response should show lower TTL if the cache is working:

$ dig +nocmd +noall +answer google.com @127.0.0.1 -p 53
google.com.		114	IN	A	216.58.206.78

After the second command, you should also see in Dnsmasq logs that cached response was returned for google.com:

Nov 16 20:49:04 dnsmasq[22925]: cached google.com is 216.58.206.78

Problem

Dnsmasq doesn’t cache Consul domain records. No matter how many times I lookup the address, TTL doesn’t seem to change. Dnsmasq logs are also not showing any cached responses for .consul domain.

After some head banging I found out the answer here. Turns out Dnsmasq won’t cache responses from non recursive name servers. Consul refuses recursive queries without recursor configuration. From Dnsmasq man page:

Dnsmasq accepts DNS queries and either answers them from a small, local, cache or forwards them to a real, recursive, DNS server.

Here is the source code for caching in Dnsmasq:

/* Don't put stuff from a truncated packet into the cache.
   Don't cache replies from non-recursive nameservers, since we may get a
   reply containing a CNAME but not its target, even though the target
   does exist. */
if (!(header->hb3 & HB3_TC) &&
    !(header->hb4 & HB4_CD) &&
    (header->hb4 & HB4_RA) &&
    !no_cache_dnssec)
  cache_end_insert();

return 0;

Solution

Configure dummy recursor in Consul.
Edit /etc/consul/consul.json and add recursors entry with dead address (in your environment):

"recursors": [ "127.0.0.2" ]

After consul reload Dnsmasq should cache responses from .consul domain.

First query:

$ dig +nocmd +noall +answer test.service.consul @127.0.0.1 -p 53
test.service.consul.	300	IN	A	10.135.205.232
test.service.consul.	300	IN	A	10.135.174.59

Second query:

$ dig +nocmd +noall +answer test.service.consul @127.0.0.1 -p 53
test.service.consul.	297	IN	A	10.135.174.59
test.service.consul.	297	IN	A	10.135.205.232

Dnsmasq log:

Nov 16 20:49:04 dnsmasq[22925]: cached google.com is 216.58.206.78
Nov 16 21:53:49 dnsmasq[22925]: cached test.service.consul is 10.135.174.59
Nov 16 21:53:49 dnsmasq[22925]: cached test.service.consul is 10.135.205.232

Ending Notes

It sounds awful to fake something in production. There is an issue created in hashicorp/consul repository, but it doesn’t have any responses (as of 2020-11-17).
Another option is to contribute to Consul source code and fix this issue properly or just address it in documentation.

Comments