what is the recommended way to achieve HA with gravity, and if master node goes down what should we expect customer can do? should we always ask customer for multi master setup?
HA with gravity is something that doesn’t necessarily have a universal answer. It depends somewhat on the application running within gravity, and the capabilities around restoring the application state. I’ll try and generalize into two main approaches I’ve seen.
In the first approach, if the application state can be rebuilt or restored easily, we do see gravity get deployed with a single node, and if there is a problem, just destroy the cluster, spin up a new one, and restore the application state on top of a new gravity cluster. If the failure of the single master node is unrecoverable, gravity will be unrecoverable as well.
The second approach is based on building a HA cluster which requires a minimum of 3 nodes for control plane. The way this works in gravity, is if unspecified in the application manifest, gravity will automatically assign the first 3 nodes in the cluster as control plane nodes and the rest as workers.
To have explicit control over which nodes get assigned as masters is exposed by configuring installation flavors and node profiles within the application manifest (https://gravitational.com/gravity/docs/pack/#application-manifest) used to build the gravity application. Commonly the flavors are defined as a single node cluster and a ha cluster, but can also be defined as small, medium, large, etc. A single node cluster can be used for labs, trials, etc that don’t required redundancy, and an HA profile for production deployments.
Within an installation flavor, you then indicate numbers for each type of node within the cluster. So depending on the app, a small cluster might require 1 master and 1 db server, where as a large cluster needs 3 masters, 6 workers, and 2 DB servers. The installer will enforce the minimum required hardware is provided in order to run the installation.
The definitions around a type of node get configured as a node profile, that can be referenced from multiple flavors. And it’s in the node profile, that you can identify if a node should be a master or worker node from the kubernetes/gravity perspective (https://gravitational.com/gravity/docs/cluster/#node-roles).
When specifying a HA cluster, you need to provide at least 3 master nodes. The reason for this, is the underlying etcd database is based around a voting model, where a majority wins. So 1/2 is only 50%, not enough to win a majority, where as 2/3 nodes being online can form a majority and a functioning cluster.
With this in mind, a common deployment model we see for on-prem clusters is where the application is configured to run on the master nodes. So a 3 node HA cluster can be only masters, with no workers defined. For small clusters, this does work well, except if the application is disk IO heavy and sharing a disk with etcd, there can be stability issues. For HA deployments, we do generally recommend a dedicated disk be assigned for etcd (https://gravitational.com/gravity/docs/requirements/#etcd-disk).
The procedure for a customer or your support team to replace a failed node is covered here: https://gravitational.com/gravity/docs/cluster/#recovering-a-node
I hope that explanation helps, but without insight into your application and customers it’s difficult to give a more specific recommendation. This is the type of thing we generally assist with through our enterprise offering and professional services, so please feel free to reach out if you need deeper assistance.