Disaster Recovery

In a multi-controller deployment it can take only a single controller failure for a cluster to become unavailable. A cluster becomes unavailable when a quorum is not longer able to be reached by the Controllers, preventing a new cluster leader from being elected. The number of controllers needed for a quorum to be reached is (N/2)+1 where N is the initial number of controllers. For example, if there are 3 Controllers initially, there would need to be at least 2 controllers to form a quorum.

While a cluster is unavailable, the proxies will continue to operate with their last known configuration.

Important

In a single Controller deployment, if the state of the controller is lost during a controller failure, it must be restored from a backup.

Learn more about Backup and Restore.

Recovering a Cluster

In order for the cluster to be recoverable, at least one controller must remain with its state intact. If this is not the case, see the next section.

The cluster can be recovered by using teectl:

teectl recover 

Important

If teectl cannot connect due to a loss of credentials, the recover command can be run from traefikee directly.

This connects to the remaining controller and perform the recovery operation. Watch the controller logs to see the progress of the recovery. Look for the following log message to know the recovery completed successfully:

"Node is recovered and ready"

Once the recovery is complete, get the nodes, and remove the controllers that are down.

teectl get nodes
ID                         NAME                           STATUS  ROLE
7022qjifgp2srv3tq6rxyjhdm  default-proxy-d55569575-n8mmb  Ready   Proxy
muaxom14euley813tv40t9xxs  default-controller-1           Down    Controller
q01yagt5suwv2tooy8amm248k  default-controller-0           Ready   Controller (Leader)
s4eg90svby3aty5r6o9xkcpeh  default-proxy-d55569575-kzn82  Ready   Proxy
w9aggc8l9lmwni7l5y2l1z4ww  default-controller-2           Down    Controller
teectl delete node --id="muaxom14euley813tv40t9xxs"
teectl delete node --id="w9aggc8l9lmwni7l5y2l1z4ww"

The deployment is now acting as a single controller deployment. Additional controllers must now be started to bring the cluster back into highly availability again.

Congratulations! Your Traefik Enterprise cluster is recovered.

Recovering from State Loss

If all controllers and their state were lost, it is not possible to use the recovery procedure above to recover the cluster. At this point the cluster must be restored from a backup.

For more information on restoring from a backup, refer to Backup and Restore.