Unable to install gravity Planet fails

Unable to install gravity on AWS node, it seems to be failing at planet health check but not sure what is going wrong, is there any way we can debug planet stuff, I am getting the following error

at Sep 21 00:49:29 UTC Executing “/health” locally
Sat Sep 21 00:49:29 UTC Waiting for the planet to start
Sat Sep 21 00:49:29 UTC Wait for cluster to pass health checks
Sat Sep 21 00:49:39 UTC Still waiting for the planet to start (10 seconds elapsed)
Sat Sep 21 00:50:29 UTC Still waiting for the planet to start (1 minute elapsed)
Sat Sep 21 00:51:29 UTC Still waiting for the planet to start (2 minutes elapsed)
Sat Sep 21 00:55:49 UTC Still waiting for the planet to start (6 minutes elapsed)
Sat Sep 21 00:57:49 UTC Still waiting for the planet to start (8 minutes elapsed)
Sat Sep 21 00:57:51 UTC Executing operation finished in 12 minutes
Sat Sep 21 00:57:51 UTC Saving debug report to /home/ubuntu/crashreport.tgz
[ERROR]: not all planets have come up yet: &{degraded []}, failed to execute phase “/health”

Gravity Plan

Phase                     Description                                                                   State         Node              Requires                   Updated
-----                     -----------                                                                   -----         ----              --------                   -------
✓ checks                  Execute preflight checks                                                      Completed     -                 -                          Fri Sep 20 17:01 UTC
✓ configure               Configure packages for all nodes                                              Completed     -                 -                          Fri Sep 20 17:01 UTC
✓ bootstrap               Bootstrap all nodes                                                           Completed     -                 -                          Fri Sep 20 17:01 UTC
  ✓ ip-10-151-20-200      Bootstrap master node ip-10-151-20-200                                        Completed     10.151.20.200     -                          Fri Sep 20 17:01 UTC
✓ pull                    Pull configured packages                                                      Completed     -                 /configure,/bootstrap      Fri Sep 20 17:02 UTC
  ✓ ip-10-151-20-200      Pull packages on master node ip-10-151-20-200                                 Completed     10.151.20.200     /configure,/bootstrap      Fri Sep 20 17:02 UTC
✓ masters                 Install system software on master nodes                                       Completed     -                 /pull                      Fri Sep 20 17:02 UTC
  ✓ ip-10-151-20-200      Install system software on master node ip-10-151-20-200                       Completed     -                 /pull/ip-10-151-20-200     Fri Sep 20 17:02 UTC
    ✓ teleport            Install system package teleport:3.2.7 on master node ip-10-151-20-200         Completed     10.151.20.200     /pull/ip-10-151-20-200     Fri Sep 20 17:02 UTC
    ✓ planet              Install system package planet:6.0.6-11402 on master node ip-10-151-20-200     Completed     10.151.20.200     /pull/ip-10-151-20-200     Fri Sep 20 17:02 UTC
✓ wait                    Wait for Kubernetes to become available                                       Completed     -                 /masters                   Fri Sep 20 17:03 UTC
✓ rbac                    Bootstrap Kubernetes roles and PSPs                                           Completed     -                 /wait                      Fri Sep 20 17:03 UTC
✓ coredns                 Configure CoreDNS                                                             Completed     -                 /wait                      Fri Sep 20 17:03 UTC
✓ resources               Create user-supplied Kubernetes resources                                     Completed     -                 /rbac                      Fri Sep 20 17:03 UTC
✓ export                  Export applications layers to Docker registries                               Completed     -                 /wait                      Fri Sep 20 17:04 UTC
  ✓ ip-10-151-20-200      Populate Docker registry on master node ip-10-151-20-200                      Completed     10.151.20.200     /wait                      Fri Sep 20 17:04 UTC
× health                  Wait for cluster to pass health checks                                        Failed        -                 /export                    Fri Sep 20 17:13 UTC
* runtime                 Install system applications                                                   Unstarted     -                 /rbac                      -
  * dns-app               Install system application dns-app:0.3.0                                      Unstarted     -                 /rbac                      -
  * logging-app           Install system application logging-app:6.0.2                                  Unstarted     -                 /rbac                      -
  * monitoring-app        Install system application monitoring-app:6.0.4                               Unstarted     -                 /rbac                      -
  * tiller-app            Install system application tiller-app:6.0.0                                   Unstarted     -                 /rbac                      -
  * site                  Install system application site:6.0.1                                         Unstarted     -                 /rbac                      -
  * kubernetes            Install system application kubernetes:6.0.1                                   Unstarted     -                 /rbac                      -
* app                     Install user application                                                      Unstarted     -                 /runtime                   -
  * test-appliance     Install application test-appliance:0.0.1                                         Unstarted     -                 /runtime                   -
* connect-installer       Connect to installer                                                          Unstarted     -                 /runtime                   -
* election                Enable cluster leader elections                                               Unstarted     -                 /app                       -
The /health phase ("Wait for cluster to pass health checks") has failed
  not all planets have come up yet: &{unknown [{ 10.151.19.194 master  offline []} { 10.151.20.200 master  degraded []}]}

Planet status

{"nodes":[{"name":"10_151_19_194.awesomeleakey7861","member_status":{"name":"10_151_19_194.awesomeleakey7861","addr":"10.151.19.194:7496","status":"alive","tags":{"publicip":"10.151.19.194","role":"master"}}},{"name":"10_151_20_200.youthfulshannon8034","member_status":{"name":"10_151_20_200.youthfulshannon8034","addr":"10.151.20.200:7496","status":"alive","tags":{"publicip":"10.151.20.200","role":"master"}},"status":"degraded","probes":[{"checker":"br-netfilter","status":"running"},{"checker":"docker","status":"running"},{"checker":"ip-forward","status":"running"},{"checker":"disk-space","detail":"disk utilization on /var/lib/gravity is below 80 percent (55 GB is available out of 83 GB)","status":"running","checker_data":"eyJoaWdoX3dhdGVybWFyayI6ODAsInBhdGgiOiIvdmFyL2xpYi9ncmF2aXR5IiwidG90YWxfYnl0ZXMiOjgzMjA0MTQxMDU2LCJhdmFpbGFibGVfYnl0ZXMiOjU0Nzg3MzEzNjY0fQ=="},{"checker":"etcd-healthz","status":"running"},{"checker":"dns","status":"running"},{"checker":"ping-checker","status":"running"},{"checker":"kube-apiserver","status":"running"},{"checker":"nodestatus","status":"running"},{"checker":"docker-registry","status":"running"},{"checker":"system-version","detail":"Linux ip-10-151-20-200 4.4.0-1094-aws #105-Ubuntu SMP Mon Sep 16 13:08:01 UTC 2019 x86_64 GNU/Linux\n","status":"running"},{"checker":"systemd-version","detail":"systemd 241 (241)\n+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2 default-hierarchy=hybrid\n","status":"running"},{"checker":"docker-version","detail":"Containers: 0\n Running: 0\n Paused: 0\n Stopped: 0\nImages: 2\nServer Version: 18.09.5\nStorage Driver: overlay2\n Backing Filesystem: extfs\n Supports d_type: true\n Native Overlay Diff: true\nLogging Driver: json-file\nCgroup Driver: cgroupfs\nPlugins:\n Volume: local\n Network: bridge host macvlan null overlay\n Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog\nSwarm: inactive\nRuntimes: runc\nDefault Runtime: runc\nInit Binary: docker-init\ncontainerd version: bb71b10fd8f58240ca47fbb579b9d1028eea7c84\nrunc version: 2b18fe1d885ee5083ef9f0838fee39b62d653e30\ninit version: fec3683\nSecurity Options:\n seccomp\n  Profile: default\nKernel Version: 4.4.0-1094-aws\nOperating System: Debian GNU/Linux 9 (stretch)\nOSType: linux\nArchitecture: x86_64\nCPUs: 4\nTotal Memory: 15.67GiB\nName: ip-10-151-20-200\nID: CHHW:54CN:XNYI:A6XL:UP62:DVJW:VDC3:WUIV:FHS3:RYXT:4ZYM:OXKG\nDocker Root Dir: /ext/docker\nDebug Mode (client): false\nDebug Mode (server): false\nNo Proxy: 0.0.0.0/0,.local\nRegistry: https://index.docker.io/v1/\nLabels:\nExperimental: false\nInsecure Registries:\n 127.0.0.0/8\nLive Restore Enabled: false\nProduct License: Community Engine\n\nWARNING: No swap limit support\n","status":"running"},{"checker":"etcd-version","detail":"etcd Version: 3.3.12\nGit SHA: d57e8b8\nGo Version: go1.10.8\nGo OS/Arch: linux/amd64\n","status":"running"},{"checker":"kubelet-version","detail":"Kubernetes v1.14.2\n","status":"running"},{"checker":"coredns-version","detail":"CoreDNS-1.3.1\nlinux/amd64, go1.11.4, 6b56a9c\n","status":"running"},{"checker":"dbus-version","detail":"D-Bus Message Bus Daemon 1.10.28\nCopyright (C) 2002, 2003 Red Hat, Inc., CodeFactory AB, and others\nThis is free software; see the source for copying conditions.\nThere is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.\n","status":"running"},{"checker":"serf-version","detail":"Serf v0.8.0\nAgent Protocol: 4 (Understands back to: 2)\n","status":"running"},{"checker":"flanneld-version","detail":"0.5.3+git\n","status":"running"},{"checker":"registry-version","detail":"/usr/bin/registry planet/docker/distribution v2.7.1-gravitational\n","status":"running"}]}],"timestamp":"2019-09-20T17:12:32.703532082Z"}[ERROR]: status degraded

In Gravity system logs

2019-09-20T17:12:58Z DEBU             Unsuccessful attempt 99/100: not all planets have come up yet: &{degraded []}, retry in 5s. utils/logginghook.go:56
2019-09-20T17:13:03Z DEBU             Unsuccessful attempt 100/100: not all planets have come up yet: &{unknown [{ 10.151.19.194 master  offline []} { 10.151.20.200 master  degraded []}]}, retry in 5s. utils/logginghook.go:56
2019-09-20T17:13:04Z DEBU [KEYGEN]    generated user key for [root] with expiry on (1569035584) 2019-09-21 03:13:04.859572377 +0000 UTC m=+36724.079232598 utils/logginghook.go:56
2019-09-20T17:13:04Z INFO [CA]        Generating TLS certificate {0x6065a68 0xc0001b8260 CN=opscenter@gravitational.io,O=@teleadmin+O=default-implicit-role,L=root 2019-09-21 03:13:04.863799398 +0000 UTC []}. common_name:opscenter@gravitational.io dns_names:[] locality:[root] not_after:2019-09-21 03:13:04.863799398 +0000 UTC org:[@teleadmin default-implicit-role] org_unit:[] utils/logginghook.go:56
2019-09-20T17:13:04Z DEBU [TELEPROXY] Renewed certificate for opscenter@gravitational.io. utils/logginghook.go:56
2019-09-20T17:13:08Z WARN             All attempts failed. error:[
ERROR REPORT:
Original Error: *trace.BadParameterError not all planets have come up yet: &{unknown [{ 10.151.19.194 master  offline []} { 10.151.20.200 master  degraded []}]}
Stack Trace:
	/gopath/src/github.com/gravitational/gravity/lib/install/phases/postsystem.go:177 github.com/gravitational/gravity/lib/install/phases.(*healthExecutor).Execute.func1
	/gopath/src/github.com/gravitational/gravity/lib/utils/retry.go:88 github.com/gravitational/gravity/lib/utils.Retry
	/gopath/src/github.com/gravitational/gravity/lib/install/phases/postsystem.go:168 github.com/gravitational/gravity/lib/install/phases.(*healthExecutor).Execute
	/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:453 github.com/gravitational/gravity/lib/fsm.(*FSM).executeOnePhase
	/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:385 github.com/gravitational/gravity/lib/fsm.(*FSM).executePhaseLocally
	/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:345 github.com/gravitational/gravity/lib/fsm.(*FSM).executePhase
	/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:206 github.com/gravitational/gravity/lib/fsm.(*FSM).ExecutePhase
	/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:163 github.com/gravitational/gravity/lib/fsm.(*FSM).ExecutePlan
	/gopath/src/github.com/gravitational/gravity/lib/install/operation.go:81 github.com/gravitational/gravity/lib/install.(*Installer).ExecuteOperation
	/gopath/src/github.com/gravitational/gravity/lib/install/engine/cli/cli.go:111 github.com/gravitational/gravity/lib/install/engine/cli.(*Engine).execute
	/gopath/src/github.com/gravitational/gravity/lib/install/engine/cli/cli.go:80 github.com/gravitational/gravity/lib/install/engine/cli.(*Engine).Execute
	/gopath/src/github.com/gravitational/gravity/lib/install/install.go:263 github.com/gravitational/gravity/lib/install.(*Installer).execute
	/gopath/src/github.com/gravitational/gravity/lib/install/install.go:204 github.com/gravitational/gravity/lib/install.(*Installer).startExecuteLoop.func1
	/go/src/runtime/asm_amd64.s:1333 runtime.goexit
User Message: not all planets have come up yet: &{unknown [{ 10.151.19.194 master  offline []} { 10.151.20.200 master  degraded []}]}
] utils/logginghook.go:56
2019-09-20T17:13:08Z ERRO             Phase execution failed: not all planets have come up yet: &{unknown [{ 10.151.19.194 master  offline []} { 10.151.20.200 master  degraded []}]}. phase:/health utils/logginghook.go:56
2019-09-20T17:13:08Z DEBU [FSM:INSTA] Applied StateChange(Phase=/health, State=failed, Error=not all planets have come up yet: &{unknown [{ 10.151.19.194 master  offline []} { 10.151.20.200 master  degraded []}]}). opid:77aede92-9d19-4c66-822c-1ff6869f32c7 utils/logginghook.go:56
2019-09-20T17:13:08Z WARN [INSTALLER] Failed to execute operation plan. error:[
ERROR REPORT:
Original Error: *trace.BadParameterError not all planets have come up yet: &{unknown [{ 10.151.19.194 master  offline []} { 10.151.20.200 master  degraded []}]}
Stack Trace:
	/gopath/src/github.com/gravitational/gravity/lib/install/phases/postsystem.go:177 github.com/gravitational/gravity/lib/install/phases.(*healthExecutor).Execute.func1
	/gopath/src/github.com/gravitational/gravity/lib/utils/retry.go:88 github.com/gravitational/gravity/lib/utils.Retry
	/gopath/src/github.com/gravitational/gravity/lib/install/phases/postsystem.go:168 github.com/gravitational/gravity/lib/install/phases.(*healthExecutor).Execute
	/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:453 github.com/gravitational/gravity/lib/fsm.(*FSM).executeOnePhase
	/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:385 github.com/gravitational/gravity/lib/fsm.(*FSM).executePhaseLocally
	/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:345 github.com/gravitational/gravity/lib/fsm.(*FSM).executePhase
	/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:206 github.com/gravitational/gravity/lib/fsm.(*FSM).ExecutePhase
	/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:163 github.com/gravitational/gravity/lib/fsm.(*FSM).ExecutePlan
	/gopath/src/github.com/gravitational/gravity/lib/install/operation.go:81 github.com/gravitational/gravity/lib/install.(*Installer).ExecuteOperation
	/gopath/src/github.com/gravitational/gravity/lib/install/engine/cli/cli.go:111 github.com/gravitational/gravity/lib/install/engine/cli.(*Engine).execute
	/gopath/src/github.com/gravitational/gravity/lib/install/engine/cli/cli.go:80 github.com/gravitational/gravity/lib/install/engine/cli.(*Engine).Execute
	/gopath/src/github.com/gravitational/gravity/lib/install/install.go:263 github.com/gravitational/gravity/lib/install.(*Installer).execute
	/gopath/src/github.com/gravitational/gravity/lib/install/install.go:204 github.com/gravitational/gravity/lib/install.(*Installer).startExecuteLoop.func1
	/go/src/runtime/asm_amd64.s:1333 runtime.goexit
User Message: not all planets have come up yet: &{unknown [{ 10.151.19.194 master  offline []} { 10.151.20.200 master  degraded []}]}, failed to execute phase "/health"
] utils/logginghook.go:56

Gravity provides a way to obtain a shell inside the Planet container by: sudo gravity shell

Once inside the Planet container you can see all systemd units running inside it:

  • planet status --pretty to see cluster health data collected
  • systemctl status can be used to look for any failed units
  • journalctl to inspect unit logs
  • gravity status to see overall status of cluster/application on running on it
  • Install logs may also report something more specific

gravity report will also collect the cluster diagnostics into an archive

Tried all commands nothing gives any clue except journalctl, which say bad tls cert? where this cert/IP came from?
Planet status

ip-10-151-20-200:/$ planet status --pretty
{
   "status": "degraded",
   "timestamp": "2019-09-23T22:23:21.920323422Z",
   "summary": "no status received from nodes (10_151_20_200.cleverardinghelli5621,10_151_19_194.awesomeleakey7861,)"
}[ERROR]: status degraded

System Status

ip-10-151-20-200:/$ systemctl status
● ip-10-151-20-200
    State: running
     Jobs: 0 queued
   Failed: 0 units

Gravity status

ip-10-151-20-200:/$ gravity status
Cluster status:	degraded
Cluster endpoints:
    * Authentication gateway:
    * Cluster management URL:
Cluster nodes:	<unknown>
Failed to collect system status from nodes

Install log

Mon Sep 23 21:56:45 UTC [INFO] [ip-10-151-20-200] Executing phase: /health.
Mon Sep 23 21:56:45 UTC [INFO] [ip-10-151-20-200] Waiting for the planet to start.
Mon Sep 23 22:05:06 UTC [ERROR] [ip-10-151-20-200] Phase execution failed: not all planets have come up yet: &{degraded []}.

journalctl

Sep 23 21:56:44 ip-10-151-20-200 registry[104]: 127.0.0.1 - - [23/Sep/2019:21:56:44 +0000] "PUT /v2/ubuntu/manifests/16.04 HTTP/1.1" 201 0 "" "Go-http-client/1.1"
Sep 23 21:56:46 ip-10-151-20-200 planet[827]: 2019/09/23 21:56:46 http: TLS handshake error from 10.151.19.194:45772: remote error: tls: bad certificate
Sep 23 21:56:47 ip-10-151-20-200 planet[827]: 2019/09/23 21:56:47 http: TLS handshake error from 10.151.19.194:45784: remote error: tls: bad certificate
Sep 23 21:56:47 ip-10-151-20-200 planet[827]: 2019/09/23 21:56:47 http: TLS handshake error from 10.151.19.194:45794: remote error: tls: bad certificate
Sep 23 21:56:51 ip-10-151-20-200 /usr/bin/planet[827]: ERRO [TIME-DRIF] rpc error: code = DeadlineExceeded desc = context deadline exceeded monitoring/timedrift.go:130
Sep 23 21:56:51 ip-10-151-20-200 /usr/bin/planet[827]: WARN             Timed out collecting test results: context deadline exceeded. agent/agent.go:338
Sep 23 21:56:51 ip-10-151-20-200 /usr/bin/planet[827]: WARN             Timed out collecting node statuses: context deadline exceeded. agent/agent.go:481
Sep 23 21:56:52 ip-10-151-20-200 planet[827]: 2019/09/23 21:56:52 http: TLS handshake error from 10.151.19.194:45830: remote error: tls: bad certificate
Sep 23 21:56:53 ip-10-151-20-200 planet[827]: 2019/09/23 21:56:53 http: TLS handshake error from 10.151.19.194:45844: remote error: tls: bad certificate
Sep 23 21:56:56 ip-10-151-20-200 planet[827]: 2019/09/23 21:56:56 http: TLS handshake error from 10.151.19.194:45866: remote error: tls: bad certificate
Sep 23 21:56:59 ip-10-151-20-200 planet[827]: 2019/09/23 21:56:59 http: TLS handshake error from 10.151.19.194:46172: remote error: tls: bad certificate
Sep 23 21:57:03 ip-10-151-20-200 planet[827]: 2019/09/23 21:57:03 http: TLS handshake error from 10.151.19.194:46202: remote error: tls: bad certificate

Sep 23 22:00:34 ip-10-151-20-200 planet[827]: 2019/09/23 22:00:34 http: TLS handshake error from 10.151.19.194:51150: remote error: tls: bad certificate
Sep 23 22:00:36 ip-10-151-20-200 planet[2979]: WARN             Failed to run. error:[
                                               ERROR REPORT:
                                               Original Error: *errors.errorString status degraded
                                               Stack Trace:
                                                       /gopath/src/github.com/gravitational/planet/tool/planet/main.go:484 main.run
                                                       /gopath/src/github.com/gravitational/planet/tool/planet/main.go:61 main.main
                                                       /opt/go/src/runtime/proc.go:198 runtime.main
                                                       /opt/go/src/runtime/asm_amd64.s:2361 runtime.goexit
                                               User Message: status degraded
                                               ] planet/main.go:736
Sep 23 22:00:38 ip-10-151-20-200 planet[827]: 2019/09/23 22:00:38 http: TLS handshake error from 10.151.19.194:51464: remote error: tls: bad certificate
Sep 23 22:00:43 ip-10-151-20-200 planet[827]: 2019/09/23 22:00:43 http: TLS handshake error from 10.151.19.194:51506: remote error: tls: bad certificate
Sep 23 22:00:46 ip-10-151-20-200 planet[827]: 2019/09/23 22:00:46 http: TLS handshake error from 10.151.19.194:51546: remote error: tls: bad certificate
Sep 23 22:00:47 ip-10-151-20-200 planet[827]: 2019/09/23 22:00:47 http: TLS handshake error from 10.151.19.194:51552: remote error: tls: bad certificate
Sep 23 22:00:51 ip-10-151-20-200 /usr/bin/planet[827]: WARN             Timed out collecting test results: context deadline exceeded. agent/agent.go:338
Sep 23 22:00:51 ip-10-151-20-200 /usr/bin/planet[827]: WARN             Failed to query node 10_151_19_194.awesomeleakey7861(10.151.19.194) status: rpc error: code = D
eadlineExceeded desc = context deadline exceeded. agent/agent.go:475
Sep 23 22:00:51 ip-10-151-20-200 /usr/bin/planet[827]: ERRO [TIME-DRIF] rpc error: code = DeadlineExceeded desc = context deadline exceeded monitoring/timedrift.go:130
Sep 23 22:00:51 ip-10-151-20-200 /usr/bin/planet[827]: WARN             Timed out collecting node statuses: context deadline exceeded. agent/agent.go:481
Sep 23 22:05:06 ip-10-151-20-200 planet[827]: 2019/09/23 22:05:06 http: TLS handshake error from 10.151.19.194:57788: remote error: tls: bad certificate
Sep 23 22:05:06 ip-10-151-20-200 planet[827]: 2019/09/23 22:05:06 http: TLS handshake error from 10.151.19.194:57790: remote error: tls: bad certificate
Sep 23 22:05:06 ip-10-151-20-200 planet[827]: 2019/09/23 22:05:06 http: TLS handshake error from 10.151.19.194:57792: remote error: tls: bad certificate
Sep 23 22:05:06 ip-10-151-20-200 planet[827]: 2019/09/23 22:05:06 http: TLS handshake error from 10.151.19.194:57794: remote error: tls: bad certificate
Sep 23 22:05:08 ip-10-151-20-200 planet[827]: 2019/09/23 22:05:08 http: TLS handshake error from 10.151.19.194:57812: remote error: tls: bad certificate
Sep 23 22:05:08 ip-10-151-20-200 planet[827]: 2019/09/23 22:05:08 http: TLS handshake error from 10.151.19.194:57816: remote error: tls: bad certificate
Sep 23 22:05:08 ip-10-151-20-200 planet[827]: 2019/09/23 22:05:08 http: TLS handshake error from 10.151.19.194:57820: remote error: tls: bad certificate
Sep 23 22:05:08 ip-10-151-20-200 planet[827]: 2019/09/23 22:05:08 http: TLS handshake error from 10.151.19.194:57822: remote error: tls: bad certificate
Sep 23 22:05:09 ip-10-151-20-200 planet[827]: 2019/09/23 22:05:09 http: TLS handshake error from 10.151.19.194:57824: remote error: tls: bad certificate
Sep 23 22:05:09 ip-10-151-20-200 planet[827]: 2019/09/23 22:05:09 http: TLS handshake error from 10.151.19.194:57828: remote error: tls: bad certificate
Sep 23 22:05:09 ip-10-151-20-200 planet[827]: 2019/09/23 22:05:09 http: TLS handshake error from 10.151.19.194:57832: remote error: tls: bad certificate
Sep 23 22:05:10 ip-10-151-20-200 planet[827]: 2019/09/23 22:05:10 http: TLS handshake error from 10.151.19.194:57840: remote error: tls: bad certificate
Sep 23 22:05:13 ip-10-151-20-200 planet[827]: 2019/09/23 22:05:13 http: TLS handshake error from 10.151.19.194:57868: remote error: tls: bad certificate
Sep 23 22:05:13 ip-10-151-20-200 planet[827]: 2019/09/23 22:05:13 http: TLS handshake error from 10.151.19.194:57870: remote error: tls: bad certificate
Sep 23 22:05:13 ip-10-151-20-200 planet[827]: 2019/09/23 22:05:13 http: TLS handshake error from 10.151.19.194:57882: remote error: tls: bad certificate
Sep 23 22:05:14 ip-10-151-20-200 planet[827]: 2019/09/23 22:05:14 http: TLS handshake error from 10.151.19.194:57884: remote error: tls: bad certificate
Sep 23 22:05:18 ip-10-151-20-200 /usr/bin/planet[4659]: WARN             Failed to run. error:[
                                                        ERROR REPORT:
                                                        Original Error: *errors.errorString status degraded
                                                        Stack Trace:
                                                                /gopath/src/github.com/gravitational/planet/tool/planet/main.go:484 main.run
                                                                /gopath/src/github.com/gravitational/planet/tool/planet/main.go:61 main.main
                                                                /opt/go/src/runtime/proc.go:198 runtime.main
                                                                /opt/go/src/runtime/asm_amd64.s:2361 runtime.goexit
                                                        User Message: status degraded
                                                        ] planet/main.go:736
Sep 23 22:05:18 ip-10-151-20-200 planet[827]: 2019/09/23 22:05:18 http: TLS handshake error from 10.151.19.194:58052: remote error: tls: bad certificate
Sep 23 22:05:18 ip-10-151-20-200 systemd[1]: var-lib-gravity-local-packages-unpacked-gravitational.io-planet-6.0.6\x2d11402-rootfs-tmp-journal.mount: Succeeded.
Sep 23 22:05:18 ip-10-151-20-200 systemd[1]: tmp-journal.mount: Succeeded.
Sep 23 22:05:19 ip-10-151-20-200 planet[827]: 2019/09/23 22:05:19 http: TLS handshake error from 10.151.19.194:58108: remote error: tls: bad certificate

I think the following logs are quite interesting, the Time Drift check is relatively new. The timeouts in agent/agent.go for collecting test results / node statuses getting timeouts, I think will be reflected in the cluster never being seen as healthy by the installer.

I’m not sure about the tls: bad certificate errors off hand / I don’t know what 10.151.19.194 might be. Sometimes we get noise in the logs due to port pings from load balancers and internal tests, things like that, sometimes it can be real if that IP belongs to a node in the cluster.

A couple of questions:

  • What version of gravity are you using?
  • Are you using custom planet images at all?

@knisbet @abdu I am using Gravity version 6.0.1 and not using custom planet images, I am using the following command to install

sudo ./gravity install --advertise-addr=10.151.20.200 --token=secret123 --cloud-provider=generic

The interesting thing is 10.151.19.194 is another node where I install it sometime back and search on logs for 10.151.19.194 shows this,

root@ip-10-151-20-200:/home/ubuntu# grep -Rn 10.151.19.194 /var/log/gravity-*
/var/log/gravity-system.log:2811:2019-09-23T22:00:56Z DEBU             Unsuccessful attempt 51/100: planet is not running yet: &{degraded [{ 10.151.19.194 master  offline []}]}, retry in 5s. utils/logginghook.go:56
/var/log/gravity-system.log:2812:2019-09-23T22:01:01Z DEBU             Unsuccessful attempt 52/100: planet is not running yet: &{degraded [{ 10.151.19.194 master  offline []}]}, retry in 5s. utils/logginghook.go:56
/var/log/gravity-system.log:2813:2019-09-23T22:01:06Z DEBU             Unsuccessful attempt 53/100: planet is not running yet: &{degraded [{ 10.151.19.194 master  offline []}]}, retry in 5s. utils/logginghook.go:56
/var/log/gravity-system.log:2814:2019-09-23T22:01:11Z DEBU             Unsuccessful attempt 54/100: planet is not running yet: &{degraded [{ 10.151.19.194 master  offline []}]}, retry in 5s. utils/logginghook.go:56
/var/log/gravity-system.log:2815:2019-09-23T22:01:16Z DEBU             Unsuccessful attempt 55/100: planet is not running yet: &{degraded [{ 10.151.19.194 master  offline []}]}, retry in 5s. utils/logginghook.go:56
/var/log/gravity-system.log:2816:2019-09-23T22:01:21Z DEBU             Unsuccessful attempt 56/100: planet is not running yet: &{degraded [{ 10.151.19.194 master  offline []}]}, retry in 5s. utils/logginghook.go:56
/var/log/gravity-system.log:2820:2019-09-23T22:01:26Z DEBU             Unsuccessful attempt 57/100: planet is not running yet: &{degraded [{ 10.151.19.194 master  offline []}]}, retry in 5s. utils/logginghook.go:56
/var/log/gravity-system.log:2821:2019-09-23T22:01:31Z DEBU             Unsuccessful attempt 58/100: planet is not running yet: &{degraded [{ 10.151.19.194 master  offline []}]}, retry in 5s. utils/logginghook.go:56
/var/log/gravity-system.log:2822:2019-09-23T22:01:36Z DEBU             Unsuccessful attempt 59/100: planet is not running yet: &{degraded [{ 10.151.19.194 master  offline []}]}, retry in 5s. utils/logginghook.go:56
/var/log/gravity-system.log:2823:2019-09-23T22:01:41Z DEBU             Unsuccessful attempt 60/100: planet is not running yet: &{degraded [{ 10.151.19.194 master  offline []}]}, retry in 5s. utils/logginghook.go:56
/var/log/gravity-system.log:2824:2019-09-23T22:01:46Z DEBU             Unsuccessful attempt 61/100: planet is not running yet: &{degraded [{ 10.151.19.194 master  offline []}]}, retry in 5s. utils/logginghook.go:56
/var/log/gravity-system.log:2825:2019-09-23T22:01:51Z DEBU             Unsuccessful attempt 62/100: planet is not running yet: &{degraded [{ 10.151.19.194 master  offline []}]}, retry in 5s. utils/logginghook.go:56

did it somehow auto-discover this node and tried to connect to it?

Ran planet status on 10.151.19.194 gave following output

during installation

ip-10-151-19-194:/$ planet status --pretty
{
   "status": "degraded",
   "nodes": [
      {
         "name": "10_151_20_200.clevertorvalds8077",
         "member_status": {
            "name": "10_151_20_200.clevertorvalds8077",
            "addr": "10.151.20.200:7496",
            "status": "failed",
            "tags": {
               "publicip": "10.151.20.200",
               "role": "master"
            }
         }
      },
      {
         "name": "10_151_20_200.elegantwiles3321",
         "member_status": {
            "name": "10_151_20_200.elegantwiles3321",
            "addr": "10.151.20.200:7496",
            "status": "failed",
            "tags": {
               "publicip": "10.151.20.200",
               "role": "master"
            }
         }
      }
   ],
   "timestamp": "2019-09-24T00:00:03.884241916Z",
   "summary": "no status received from nodes (10_151_20_200.romanticblackwell5789,10_151_20_200.trustingcori6282,10_151_20_200.cleverardinghelli5621,10_151_19_194.awesomeleakey7861,10_151_20_200.ferventgoodall605,10_151_20_200.test,)"
}[ERROR]: status degraded

After installation failed on 10.151.20.200

ip-10-151-19-194:/$ planet status --pretty
    {
       "status": "degraded",
       "timestamp": "2019-09-24T00:07:03.889703245Z",
       "summary": "no status received from nodes (10_151_20_200.trustingcori6282,10_151_20_200.elegantwiles3321,10_151_20_200.cleverardinghelli5621,10_151_20_200.clevertorvalds8077,10_151_19_194.awesomeleakey7861,10_151_20_200.ferventgoodall605,10_151_20_200.test,10_151_20_200.romanticblackwell5789,)"

gravity status

ip-10-151-19-194:/$ gravity status
Cluster status:		degraded
Application:		test-appliance, version 0.0.1
Join token:		E17Xxl7pru
Periodic updates:	Not Configured
Remote support:		Not Configured
Last completed operation:
    * operation_install (c7c4da69-9787-409f-bf29-69b779b57bea)
      started:		Wed Sep 18 05:40 UTC (5 days ago)
      completed:	Wed Sep 18 05:40 UTC (5 days ago)
Cluster endpoints:
    * Authentication gateway:
        - 10.151.19.194:32009
    * Cluster management URL:
        - https://10.151.19.194:32009
Cluster nodes:	awesomeleakey7861
    Nodes:
        * ip-10-151-19-194 / 10.151.19.194
            Status:	offline

@mtariq discussing this a bit with @eldios what we suspect is happening, is we use hashicorp serf (https://www.serf.io) for cluster discovery of the monitoring stack, so that the health status continues to be tested and exchanged regardless of the health of the cluster. IE if etcd is down, you can still get the health of etcd from gravity/planet status.

The old node 10.151.19.194 presumably has some state on it that points to one of the nodes in the new cluster attempting to be installed, that is allowing it to join the serf cluster. Even though the serf port is unprotected, the new cluster uses a new TLS root, so the actual exchange of health info doesn’t work due to the TLS certificates being untrusted.

So I would make sure the new nodes you’re installing onto are fully removed from the old cluster.

Cool thanks @knisbet , is there a way we can remove the state from the old cluster, I tried gravity remove Node but did not work.

ip-10-151-19-194:/$ gravity remove 10_151_20_200.test --force
[ERROR]: could not find server matching [10_151_20_200.test] among registered cluster nodes
ip-10-151-19-194:/$ gravity remove 10.151.20.200 --force
[ERROR]: could not find server matching [10.151.20.200] among registered cluster nodes

If the node was offline when trying to run remove from the cluster, it may not have been able to trigger the unistall process remotely.

The node can be uninstalled locally from the server itself by using the gravity leave command. See https://gravitational.com/gravity/docs/cluster/#removing-a-node for more info.

@knisbet This did not work, I have tried uninstalling multiple times on this Node 10.151.20.200 , I do not want to uninstall on 10.151.19.194 since it is a different cluster (I assume this can fix it too but will lose all data), I thought to join it back to node and then leave, but it did not work, see below messages, Is there a way to clean 10.151.19.194 state or see what going on there.

root@ip-10-151-20-200:/home/ubuntu# sudo ./gravity join 10.151.19.194 --advertise-addr=10.151.20.200 --token=XXXXXXX --cloud-provider=generic
Tue Sep 24 19:12:47 UTC	Starting agent

To abort the agent and clean up the system,
press Ctrl+C two times in a row.

If you get disconnected from the terminal, you can reconnect to the installer
agent by issuing 'gravity resume' command.
See https://gravitational.com/gravity/docs/cluster/#managing-an-ongoing-operation for details.

Tue Sep 24 19:12:47 UTC	Connecting to agent
Tue Sep 24 19:12:48 UTC	Connected to agent
Tue Sep 24 19:12:48 UTC	Connecting to cluster
Tue Sep 24 19:12:48 UTC	Waiting for another operation to finish at 10.151.19.194
Tue Sep 24 19:12:48 UTC	Connecting to cluster
Tue Sep 24 19:12:48 UTC	Waiting for another operation to finish at 10.151.19.194

any thoughts on above error @knisbet @eldios?

Hey @mtariq, sorry I had to travel for a couple days last week to an event.

It depends on the status of that existing cluster, and what is currently locked. This should be displayed on the gravity status output, that there is an operation underway, such as an upgrade.

There is a command to force the cluster lock into a working state if you think it’s erroneous, but in this case it’s a ymmv command since I don’t usually recommend using this command unless you are on with support who has had an opportunity to confirm the forced reset won’t leave the cluster in a worse state.

gravity status-reset will remove the cluster lock, which should allow the node to join.

Additionally, some operations can also be detected as a conflict, so if there is a join operation of what appears to be a master node, this may still prevent the join from happening. To workaround that issue, gravity status will print the uuid of the currently in-progress operations, which can fail or hang from time to time. In that case, using etcdctl within planet on a master, can be used to remove the tree under /gravity/local/sites/<cluster_name>/ops/<operation_uuid>. An example of this would be etcdctl rm --recursive /gravity/local/sites/aev523-test/ops/9ba8377f-b5e5-4489-9a15-ffb3dba4877f

One last item, although I’m under the impression we fixed it, it’s possible the new node only exists as part of the cluster from the perspective of serf. In which case serf itself has a force-leave command, to force a node out of the serf cluster: https://www.serf.io/docs/commands/force-leave.html