[How2Tip] Varnish : dynamic backend DNS resolution in a Docker swarm context

Published on

Aug 21, 2019

how2tips

Nicolas is a Symfony & React developer and trainer at KNPLabs and is currently working on i24news.tv as Devops. He shares with you how his team solved the issue they had when using varnish with a backend defined as a named host.

To add more precisions about the context, we were using docker swarm, with one varnish service (scaled to one replica), and a backend service (scaled to two replicas, running nginx).

During the deploy, we had HTTP 503 errors from varnish which was unable to reach the backend. This was due to the way varnish resolves the IP of the backend. It is done on startup, once and for all, which makes the use of a dynamic backend impossible.

Here's what our backend.vcl file looked like so far:

backend default {
  # the host is the name of the docker service of the backend
  .host = "app_nginx";
  .port = "80";
}

Dynamic DNS resolution

We're using varnish 6.2. You may have read that a goto VMOD [1] (or named VMOD [2]) exists, but they were implemented for varnish 4. Additionally, the named VMOD does not exist anymore, and the goto VMOD is now only available for varnish cache plus, which is a paying version of varnish with extra features [3].

There is still the built-in directors VMOD [4], but this VMOD does not offer a DNS resolution TTL mechanism (i.e. it does not remake a DNS lookup to resolve the backend IP again). So we can't use dynamic backend IPs with this VMOD.

Luckily, there is a varnish list of community maintained VMODs [5], in which we can spot out the dynamic VMOD [6].

This VMOD reproduces the behavior of the named VMOD for recent varnish versions. It provides the DNS resolution we needed for backends with dynamic IPs. So let's use it :)

Dockerfile

As we're using docker, we declared a Dockerfile for our varnish service. In this Dockerfile, we'll install the dynamic VMOD.

For the base image, we're using the cooptilleuls/varnish:6.2.0-alpine image [7]. This image is built with the strict minimum : the varnish cache server [8], so it's perfect for us.

For the dynamic VMOD installation, we can see that it can be installed as a rpm or a deb. As we're using an alpine base image, none of these installation methods are suited for us. Moreover, the links to the package repos seems to be outdated. So we'll have to compile the VMOD by ourselves.

Regarding the VMOD releases, the latest is the v0.4 one. However, this release is for varnish 6.0, and we're using v6.2. We'll have to use the 6.2 branch of the VMOD to install it. There is no release for that branch yet, so we'll pick the latest commit on this branch: b8731c42f73075a112d4b3475c1da08a5e85fcec.

When looking at the dynamic VMOD's README, we can see that it requires two environment variables for the compilation:

export PKG_CONFIG_PATH=${PREFIX}/lib/pkgconfig
export ACLOCAL_PATH=${PREFIX}/share/aclocal

were ${PREFIX} is the directory used to configure varnish compilation.

But how can we know what's the value of such $PREFIX var ? There is no indication about it in the base image's Dockerfile. To figure it out, let's spin a container of the base image and search for one of the directories mentioned in the env vars:

$ docker run --rm -it cooptilleuls/varnish:6.2.0-alpine \
    find / -type d -name pkgconfig
/usr/local/lib/pkgconfig
/usr/lib/pkgconfig

Nice ! Now let's find out what's in these directories:

$ docker run --rm -it cooptilleuls/varnish:6.2.0-alpine \
    ls /usr/lib/pkgconfig /usr/local/lib/pkgconfig
/usr/lib/pkgconfig:
libgcj-6.pc
/usr/local/lib/pkgconfig:
varnishapi.pc

Something interesting here :). It looks like our $PREFIX value is /usr/local. Let's be sure about it by printing the content of the varnishapi.pc file:

$ docker run --rm -it cooptilleuls/varnish:6.2.0-alpine \
    cat /usr/local/lib/pkgconfig/varnishapi.pc
prefix=/usr/local
exec_prefix=${prefix}
bindir=${exec_prefix}/bin
sbindir=${exec_prefix}/sbin
libdir=${exec_prefix}/lib
sysconfdir=${prefix}/etc
pkgsysconfdir=${sysconfdir}/varnish
includedir=${prefix}/include
pkgincludedir=${includedir}/varnish
datarootdir=${prefix}/share
datadir=${datarootdir}
pkgdatadir=${datadir}/varnish
vcldir=${pkgdatadir}/vcl
vmoddir=${libdir}/varnish/vmods
vmodtool=${pkgdatadir}/vmodtool.py
vsctool=${pkgdatadir}/vsctool.py
Name: VarnishAPI
Description: Varnish API
Version: 6.2.0
Cflags: -I${includedir}/varnish
Libs: -L${libdir} -lvarnishapi

Good shot 8). We can set the required env vars for the VMOD compilation, and tell which prefix to use when running the configure script for the VMOD, prior to the compilation.

This leads us to the following Dockerfile:

Just build the image and we're ready to go :).

Note that the build dependencies were determined by multiple build attempts.

Backend VCL

Now that we have our dynamic VMOD installed, we can update our backend.vcl file to use the VMOD. There are examples of configuration of this VMOD in its doc [9].

# Import the `dynamic` VMOD
import dynamic;
# Varnish still requires a backend to start
backend default {
  # The `.host` parameter is not very relevant here as we'll use the dynamic
  # director as a DNS resolver for that domain.
  .host = "";
  .port = "80";
}
# @see https://github.com/nigoroll/libvmod-dynamic/blob/master/src/vmod_dynamic.vcc#L237
sub vcl_init {
  new ddir = dynamic.director(
    port = "80",
    # The DNS resolution is done in the background,
    # see https://github.com/nigoroll/libvmod-dynamic/blob/master/src/vmod_dynamic.vcc#L48
    ttl = 10s,
  );
}
sub vcl_recv {
  set req.backend_hint = ddir.backend("app_nginx");
}

You can see that we'll use a 10s TTL for our resolved backend IP address. This address will be resolved again by a background thread. When a request comes to varnish, we indicate which backend IP to use by setting the backend_hintproperty to the request.

And voilà! As simple as that :). We now have a DNS resolution for our dynamic backend.

This solution to the dynamic backend DNS resolution can be used in any context. It's not related to this particular docker swarm context.

However, we want to detail more about our use case and the integration of this solution with docker swarm. So this leads us to the next chapter.

Combining with Swarm's virtual IP and load balancing

If you ever have used docker swarm, you may have noticed about the endpoint_mode setting that you can apply on a service to deploy [10]. This setting indicates how the DNS discovery of the service is done. By default, the value is vip, which means that a DNS request for that service will always resolve to a virtual IP (VIP), which is managed by docker to apply a load balancing (LB) between the replicas of the service.

Doing so, the replicas are never exposed directly, and docker manages the LB bewteen replicas by exposing a single IP : the virtual IP.

For our use case, it's this VIP which is used by varnish when a DNS resolution of the nginx backend is done. So varnish does not talk directly to a backend replica, but instead to the docker's VIP of the backend service. This allows us to keep using the load balancing between backend replicas, as docker handles it :). Pretty neat.

Now, regarding the TTL value that we have set to our dynamic resolver in the prior chapter, we have to be careful about it. Setting it to a too small value will create extra load on the DNS resolver to resolve the VIP too often, and in the opposite, setting it to a too high value will put us in our original position where varnish could make requests to an expired VIP.

With swarm, you can specify how much time to wait between each replica to replace during a deploy. Knowing this, you can easily know your deploy rollup duration:

rollup_duration = (replicas_count -1) * wait_time

e.g. for two replicas with a 15s wait time, the rollup_duration is 15s, as the first replica is deployed at 0s, and the second at 15s.

To be sure to still use a valid backend VIP in varnish, we have to set a DNS TTL value to lower than the rollup duration, to be able to make a new DNS resolution request during the rollup window. This DNS request will resolve the VIP of the newly deployed service.

That's why out TTL value is set to 10s, as our rollup_duration is of 15s.

Final note:

If you're not using the VIP system (e.g. you're not using docker swarm or you deployed your backend with a dnsrrendpoint_mode), the VMOD can handle the load balancing by itself. You still have to set the appropriate DNS TTL, and the VMOD will load balance the traffic between the IPs that are resolved for your backend's domain.

Thanks for reading ;)

If you have questions => https://twitter.com/KNPLabs