High Availability

High Availability is one of Traefik Enterprise's core features. Since Traefik Enterprise is running on top of diverse infrastructure and/or orchestration engines, some of them have special limitations in regards to which SLA Traefik Enterprise can achieve.

Kubernetes

In Kubernetes, healtchecks are performed through the kubelet. While the pods behind a service are healthchecked every couple seconds by the kubelet, it can take up to 45 seconds until Kubernetes realises that a kubelet itself is down. As Kubenetes will not start to react on that situation before it knows that a kubelet is down, that might have an impact on your SLA.

However, there are some possible workarounds, depending on the current setup:

  • Modifying the Kubernetes configuration and/or adding tooling to improve Kubernetes internal health monitoring, such as for example node-problem-detector
  • Customizing the Traefik Enterprise installation, by exposing the proxy directly without using a LoadBalancer service. This will allow for a direct connection to Traefik Enterprise which availability could potentially be checked faster from an external LB, as it bypasses the Kubernetes internal route update mechanism to the Traefik Enterprise nodes.

Swarm Mode

In Swarm, healthchecks for the swarm nodes themselves are performed by the manager. Typically, Swarm recognizes a worker being down within 15 seconds. However, Docker Swarm recognizes a missing replicate within ~5 seconds, which is much faster.

In the default Swarm setup, high availability is achieved differently: the Traefik Enterprise data plane is published via the Ingress Routing Mesh, which handles the incoming requests. This routing mesh is composed of IPTables and IPVS, which provide healthcheck endpoints and TCP connection retries. If a swarm node holding a Traefik Enterprise data plane container dies, all in flight requests to that node will fail as well. Luckily, the internal IPVS will pick up failing requests and attempt to redeliver them to a different node. This means that for the given time of a missing or failing replica, some requests might be retried internally. However, as long as there is at least one available node in the data plane, a node will answer the retried requests and no requests will be dropped from the users perspective.

Alternatively, it is also possible to publish the proxies ports with a host bind directly. This allows your external load balancer to communicate directly with the nodes and provides better performance. However, you will loose benefits such as the port being available on every node.