We have several Gravity clusters (test clusters, etc) that we deploy, and I would say invariably they all go into a degraded state for no discernible reason, and don’t seem to correct themselves.
In this case, here is
gravity status for a 6 node cluster:
$ sudo gravity status Cluster name: bravemestorf2902 Cluster status: degraded (one or more of cluster nodes are not healthy) Application: ... Gravity version: 6.1.39 (client) / 6.1.39 (server) Join token: ... Last completed operation: * Remove node ip-10-1-10-74.us-west-2.compute.internal (10.1.10.74) ID: 9affc744-d94f-46b1-a85a-a2853084a07d Started: Thu Oct 15 23:14 UTC (1 hour ago) Completed: Thu Oct 15 23:14 UTC (1 hour ago) Cluster endpoints: * Authentication gateway: - 10.1.10.13:32009 * Cluster management URL: - https://10.1.10.13:32009 Cluster nodes: Masters: * ip-10-1-10-13.us-west-2.compute.internal / 10.1.10.13 / master Status: healthy Remote access: online Nodes: * ip-10-1-10-66.us-west-2.compute.internal / 10.1.10.66 / worker Status: healthy Remote access: online * ip-10-1-10-64.us-west-2.compute.internal / 10.1.10.64 / worker Status: healthy Remote access: online * ip-10-1-10-152.us-west-2.compute.internal / 10.1.10.152 / worker Status: healthy Remote access: online * ip-10-1-10-199.us-west-2.compute.internal / 10.1.10.199 / worker Status: healthy Remote access: online * ip-10-1-10-68.us-west-2.compute.internal / 10.1.10.68 / worker Status: healthy Remote access: online
I can’t tell from the logs or audit log why it went into degraded state, and all the nodes report as healthy. I end up having to do a
status-reset in order to restore it so I can grow the cluster.
Is there something I should be looking at to figure out what’s going on? This has happening with different versions, including 6.1.12.