Improving gravity install failures diagnostics


#1

We have encountered several cases when failed gravity installs are hard to diagnose and looking for ways to improve the error reporting.

One of the ways to improve it is to run gravity status at the end of failed installs, and provide combined logs from the failed pods/jobs and units.

Any other ideas/suggestions are appreciated here in the comments. What most common install failures have you encountered so far? What are your use-cases and install targets? How did you find out the root cause? Please post here.


#2

I’m assessing gravity right now on AWS, I had three problems installing a new cluster:

  • When I was adding a second master node I lost quorum of etcd, I issued gravity shell and understood that etcd was failing and recover it using force-new-cluster option of etcd
  • I had problems with quickstart of mattermost because it was using devicemapper instead of overlay2, I found an argument on gravity install which I could choose a new storage opt for docker.
  • The installation/join of a new node fails when you don’t have the KubernetesCluster tag set with the proper name of the cluster

It’s easy to troubleshooting some problems because you could always issue a gravity shell and understand the problem.


#3

First of all, thanks for giving gravity a try!

Here are some thoughts:

  • When I was adding a second master node I lost quorum of etcd, I issued gravity shell and understood that etcd was failing and recover it using force-new-cluster option of etcd

We should make it more clear in the CLI/Docs that it’s better to go from 1 to 3 than from 1 to 2 for the reasons you’ve mentioned.

  • I had problems with quickstart of mattermost because it was using devicemapper instead of overlay2, I found an argument on gravity install which I could choose a new storage opt for docker.

We’ve updated all versions starting 5.2 to use overlayfs2 as a default driver, so this should be resolved.

  • The installation/join of a new node fails when you don’t have the KubernetesCluster tag set with the proper name of the cluster

What error did you see that helped you understand the missing tag problem?


#4

I don’t think we actually have a path to go directly from a 1 node cluster to a 3 node cluster in the current version. If two nodes are joined, the cluster will expand to a two node cluster, and when that’s completed, the third node will begin to install to a three node cluster.

Outside of this topic around troubleshooting, I would be curious on more information around the issue encountered with etcd, what version was used, etc. As if it failed in the 1 -> 2 master case, it’s definitely a software bug we need to addressed. Although, that should be posted in a separate topic, since this topic is meant to be around improving our troubleshooting tools to be able to identify root cause of failures more easily.

Detecting the AWS integrations is a hard one. For the simple presence of the tag, we can probably add a satellite check that will flag this when AWS integrations are turned on. But it does require node IAM credentials to allow listing of the node, which might not always be enabled.


#5

It was not something clear, I had these kind of problems in a pure Kubernetes installation also and the problem wasn’t obvious. It was the only thing that was different, so I gave a second try using the correct cluster name.

Great product by the way!