What to do when the Quorum is Lost¶
This only applies to multi control node clusters
Single control node clusters don't have to deal with the quorum and are always able to recover from temporary failures while the control node state directory is reachable (for example, a node restart), so this guide only applies to multi control node clusters.
It can happen that due to hardware failures or other unlucky events, most of your cluster's control nodes undergo temporary or permanent failures simultaneously.
As long as less than (N-1)/2
(N
being the total number of control nodes in your cluster) are affected at the same time, everything is fine and the cluster will not experience any service interruption.
However, if more than (N-1)/2
control nodes in your cluster are down at the same time, the quorum will be lost, even if the affected control nodes restart successfully after that point.
This page will give you tips on how to limit the potential damage of such an outage as much as possible.
Keep in mind that the data plane is not directly affected by such outages and will keep delivering traffic. However, it will no longer be dynamically updated, which means it might use outdated routing configurations.
Preventing Data Loss¶
In order to prevent data loss, make sure to backup your clusters as often as possible. This can be done automatically in a number of ways, and we recommend setting up an automated backup schedule for your cluster, in order to be able to recover a recent backup in the event of a failure.
Recovering a Lost Cluster¶
As previously stated, the only way to recover your cluster is to use the restore feature to create a new identical cluster to take its place.