Recover Teleport nodes failing to join due to bad token

Description

Gravity version 5.5.40 has a regression issue which results in Teleport on the nodes joined the cluster via graivity join not being able to join the cluster despite the “gravity join” operation completing successfully.

Impact

  • Nodes joined to a 5.5.40 cluster are not accessible via Teleport.
  • Future upgrade attempts fail because Teleport is used to deploy upgrade agents.

Confirmation

On affected nodes, the Teleport systemd unit will be logging the following line over and over:

INFO [PROC:1]    Node failed attempt connecting to auth server: "<name>" [<id>] can not join the cluster with role Node, the token is not valid.

An attempt to launch an upgrade on such a cluster will result in an error similar to:

[ERROR]: Teleport is unavailable on the following cluster nodes: node-1, node-2. Please make sure that the Teleport service is running and try again.

Workaround

There are a few possible workarounds for the issue.

Removing affected nodes

  • Drain/cordon the affected node and uninstall Gravity software on it using gravity system uninstall.
  • Remove the affected nodes using gravity remove <node> --force.
  • After removing the node, the cluster may enter degraded status with “overlay network” checker failing. To workaround this, remove nethealth pods using kubectl -nmonitoring delete pods -lk8s-app=nethealth.
  • Once the cluster has become active, perform an upgrade to version 5.5.41 or later.
  • Rejoin the nodes back (remember using the gravity binary of same version 5.5.41 or later for join command).

Recovering affected nodes

Starting from version 5.5.41 Gravity includes the commands to directly manipulate teleport configuration.

Note, to be able to execute commands described here, you will need a gravity binary of version 5.5.41 or later. It can be downloaded directly from Gravitational default distribution portal:

curl https://get.gravitational.io/telekube/bin/5.5.41/linux/x86_64/gravity -o gravity && chmod +x gravity

Obtain a valid join token by running gravity status command.

$ sudo gravity status
...
Join token:		3cd602b15238
...

Use the following command with the token from above to update Teleport configuration. This command has to be executed on every node where Teleport node can’t join the cluster due to a bad token.

$ gravity system teleport set-node-token --package=node --token=<token>

After that restart the Teleport systemd service for changes to take effect.

If the Teleport nodes still complain about invalid token, it might be that Teleport auth servers use different tokens for authentication, not the join token. This may happen if you added a master node to your 5.5.40 cluster for example.

You can confirm this by using the following command to display Teleport auth server configuration on master nodes and looking for a token in auth_service.tokens section:

$ gravity system teleport show-config --package=master

In such a case, also update auth server tokens by executing the following commands on the master nodes so they use the same join token we grabbed above:

$ gravity system teleport set-master-tokens --package=master --token=<token>

After that restart gravity-site pods (which act as Teleport auth servers Teleport nodes connect to):

$ kubectl -nkube-system delete pods -lapp=gravity-site

After the pods have restarted, Teleport nodes should be able to join the cluster.

If after restart gravity-site pods are in CrashLoopBackOff state and their logs show “permission denied” error, the “set-master-tokens” command was likely executed as a root and thus gravity-sites, that run a “planet” user, do are not able to read the package with Teleport configuration. The permissions can be fixed by executing chown -R planet:planet /var/lib/gravity/local/packages on master nodes.

You can confirm a node has joined successfully by looking in the Teleport systemd unit logs again (using journalctl -u <teleport-unit-name> --no-pager), they should say Node has successfully registered with the cluster.