Recovering 3-node cluster (2 masters/1 worker)

What happened:

Gravitational team got called in to help recover a cluster which was in the following state:

  • It was a 3-node cluster, 2 masters / 1 worker.
  • 1 master node and the worker node were down, “gravity leave --force” was run on them before so they weren’t decommissioned gracefully.
  • Thus, the cluster was degraded and the remaining master node also wasn’t working due to etcd being down (since 1 out of 2 etcd nodes was down).
  • So all nodes in the gravity status output were unhealthy.
  • One of the nodes where “leave --force” was run was also not cleaned up fully and further “leave --force” runs produced failure to stop planet service.
    Also it seems like there was some kind of internal maintenance going on on these machines today so we saw selinux being turned on automatically by someone as well as issues with mounted NFS volumes.

What we did to recover:

1. Since etcd wasn’t working completely, we had to recover it manually by forcing it to form a new single-node cluster on the remaining online node:

  • Stop etcd: systemctl stop etcd
  • Backup: etcdctl backup --data-dir /ext/etcd --backup-dir /ext/etcd-backup
  • Edit etcd systemd unit file to add Environment=“ETCD_FORCE_NEW_CLUSTER=true”
  • Reload systemd: systemctl daemon-reload
  • Start etcd: systemctl start etcd

2. Etcd started fine and became a single-node healthy cluster so kubectl was working again and gravity status was reporting this node as healthy and two other as offline.

3. Went to clean up the node where leave was failing with planet service error. We disabled the service manually (systemctl disable ) and after that gravity leave --force worked fine here.

4. Then we went to remove the two offline nodes from the remaining master node by running gravity remove --force. They were removed fine and evicted from Kubernetes too.

  • Also cleaned up past unfinished shrink operations using the attached script.

5. We also restarted “serf” and “planet-agent” services inside planet to evict old members too.

6. At this point we basically had a single-node healthy cluster again.

  • Edited etcd service unit file again to remove Environment=“ETCD_FORCE_NEW_CLUSTER=true” that we added before.

7. Tried to expand the cluster via curl command from the UI but the node wasn’t appearing in the UI so operation couldn’t be started.

8. Eventually we found out (using telekube-system.log) that agent couldn’t collect system information due to “NFS stale file handle” error and hence wasn’t proceeding to connect to the cluster.

9. To fix the NFS issue we had to find the offending NFS volume (there is a lot of them on these machines) and unmount it. Hint: “ls” also produces “stale file handle” error so it can help identify the offending volume.

10. After that the agent joined successfully and the cluster expanded successfully too.

Notes:

  • gravity leave --force should be used as a last-resort only. While it will try to remove the node gracefully first, in an unhealthy cluster it’s basically an equivalent of “rm -rf” for all gravity data/services on a node. I think the fact that it was run on 1 of 2 masters exacerbated the issue so we had to manually recover.
  • Cluster with 2 master nodes is not HA - when one master goes down, the whole cluster goes down. We recommend 3 master nodes for HA.

Prepared by: @r0mant

ops.sh (308 Bytes)

@abdu where can we get the attached script, just in case shrink operation fails?

@maaz I’ve now attached the script

1 Like