Teleport nodes disconnected after replacing all master nodes

Description

After replacing all master nodes, Teleport service on regular worker nodes can no longer connect. Disconnected Teleport nodes also disappear from the cluster control panel UI. The cluster is functional otherwise.

The issue stems from the fact that Teleport on worker nodes is configured with addresses of auth servers (master nodes) available at the time of the node join. If all of the original master nodes are removed, the workers that have only those nodes as auth servers will no longer be able to connect.

Workaround

Gravity 5.5.50 includes a CLI command to manipulate Teleport node auth servers configuration.

The commands shown below can be executed on a Gravity cluster of any version, as long as they’re run using 5.5.50 gravity binary. For example, you can download the 5.5.50 binary to a 5.5.38 or a 7.0.10 cluster and use it to update the Teleport node configuration.

To see the current Teleport node configuration, including auth servers, the following command is available:

gravity system teleport show-config --package=node

To reconnect Teleport nodes, update auth servers on each regular node with addresses of new master nodes using the following command (this needs to be done only on regular nodes):

gravity system teleport set-auth-servers --auth-server=<ip1> --auth-server=<ip2> ...

Note, that this command appends the provided auth servers to the existing configuration, not replaces them. As a result of this command, the original auth servers will still be a part of the configuration but it won’t affect the node’s ability to connect to the new ones.

After that, restart the Teleport node service:

sudo systemctl restart *teleport*

Long-term solution

For a longer-term solution to prevent nodes from disconnecting in cases when all original master nodes are replaced, Gravity has a tracking Github issue which will be addressed in a future release.