Gravity-site failing to generate TLS certificate for joined node

Hi,

I’ve been working with Gravity for about a month now and have really enjoyed it! I’m running in AWS, but due to company restrictions install with cloud-provider=generic. My install command is something like this:

./install --advertise-addr=<REDACTED> --token=<REDACTED> --cloud-provider=generic

We re-roll our dev clusters once every 2 weeks. The cluster I’m working with has been up for a few days. When I run “sudo gravity status” everything comes back as green and healthy.

We have some pods (Harbor, Sonatype Nexus, Gitlab, ELK) up right now w/ services & ingress brokered by an NGINX ingress controller behind an Elastic Load Balancer. This all seems to be working just fine.

However, the lone issue we’ve had is with connecting to the gravity-site. Because we cannot connect to our EC2 instances over HTTP/HTTPS, I set up an ingress object for gravity-site.

According to the nginx logs, traffic flows to gravity-site-proxy, but nothing renders in our UI. We actually get a 400 error back. If I look at the logs for my gravity-site pod, I see the following:

2020-08-12T20:02:43Z INFO             Runtime configuration. args:[--node-name=<REDACTED> --hostname=<REDACTED> --master-ip= --public-ip=<REDACTED> --cluster-id=cockyyonath5487 --etcd-proxy=off --etcd-member-name=<REDACTED>.cockyyonath5487 --initial-cluster=<REDACTED>.cockyyonath5487:<REDACTED>,<REDACTED>.cockyyonath5487:<REDACTED>,<REDACTED>.cockyyonath5487:<REDACTED>--secrets-dir=/var/lib/gravity/secrets --etcd-initial-cluster-state=existing --volume=/var/lib/gravity/planet/etcd:/ext/etcd --volume=/var/lib/gravity/planet/registry:/ext/registry --volume=/var/lib/gravity/planet/docker:/ext/docker --volume=/var/lib/gravity/planet/share:/ext/share --volume=/var/lib/gravity/planet/state:/ext/state --volume=/var/lib/gravity/planet/kubelet:/var/lib/kubelet --volume=/var/lib/gravity/planet/log:/var/log --volume=/var/lib/gravity:/var/lib/gravity --service-uid=980665 --no-election-enabled --role=master --vxlan-port=8472 --dns-listen-addr=127.0.0.2 --dns-port=53 --docker-backend=overlay2 --docker-options=--storage-opt=overlay2.override_kernel_check=1 --node-label=gravitational.io/advertise-ip=<REDACTED> --node-label=gravitational.io/k8s-role=master --node-label=kubernetes.io/hostname=<REDACTED> --node-label=role=node --node-label=kubernetes.io/arch=amd64 --node-label=kubernetes.io/os=linux --allow-privileged --service-subnet=100.100.0.0/16 --pod-subnet=100.96.0.0/16] opsservice/configure.go:1045
2020-08-12T20:02:43Z INFO             Generate configuration package. args:[--node-name=<REDACTED> --hostname=<REDACTED> --master-ip= --public-ip=<REDACTED> --cluster-id=cockyyonath5487 --etcd-proxy=off --etcd-member-name=<REDACTED>.cockyyonath5487 --initial-cluster=<REDACTED>.cockyyonath5487:<REDACTED>,<REDACTED>.cockyyonath5487:<REDACTED>,<REDACTED>.cockyyonath5487:<REDACTED>--secrets-dir=/var/lib/gravity/secrets --etcd-initial-cluster-state=existing --volume=/var/lib/gravity/planet/etcd:/ext/etcd --volume=/var/lib/gravity/planet/registry:/ext/registry --volume=/var/lib/gravity/planet/docker:/ext/docker --volume=/var/lib/gravity/planet/share:/ext/share --volume=/var/lib/gravity/planet/state:/ext/state --volume=/var/lib/gravity/planet/kubelet:/var/lib/kubelet --volume=/var/lib/gravity/planet/log:/var/log --volume=/var/lib/gravity:/var/lib/gravity --service-uid=980665 --no-election-enabled --role=master --vxlan-port=8472 --dns-listen-addr=127.0.0.2 --dns-port=53 --docker-backend=overlay2 --docker-options=--storage-opt=overlay2.override_kernel_check=1 --node-label=gravitational.io/advertise-ip=<REDACTED> --node-label=gravitational.io/k8s-role=master --node-label=kubernetes.io/hostname=<REDACTED> --node-label=role=node --node-label=kubernetes.io/arch=amd64 --node-label=kubernetes.io/os=linux --allow-privileged --service-subnet=100.100.0.0/16 --pod-subnet=100.96.0.0/16] manifest:&pack.Manifest{Version:"0.0.1", Config:(*schema.Config)(0xc00150ac80), Commands:[]pack.Command{pack.Command{Name:"start", Description:"", Args:[]string{"rootfs/usr/bin/planet", "start"}}, pack.Command{Name:"stop", Description:"", Args:[]string{"rootfs/usr/bin/planet", "stop"}}, pack.Command{Name:"enter", Description:"", Args:[]string{"rootfs/usr/bin/planet", "enter"}}, pack.Command{Name:"exec", Description:"", Args:[]string{"rootfs/usr/bin/planet", "exec"}}, pack.Command{Name:"status", Description:"", Args:[]string{"rootfs/usr/bin/planet", "status"}}, pack.Command{Name:"local-status", Description:"", Args:[]string{"rootfs/usr/bin/planet", "status", "--local"}}, pack.Command{Name:"secrets-init", Description:"", Args:[]string{"rootfs/usr/bin/planet", "secrets", "init"}}, pack.Command{Name:"gen-cert", Description:"", Args:[]string{"rootfs/usr/bin/planet", "secrets", "gencert"}}}, Labels:[]pack.Label{pack.Label{Name:"os", Value:"linux"}, pack.Label{Name:"version-etcd", Value:"v3.4.9"}, pack.Label{Name:"version-k8s", Value:"v1.17.6"}, pack.Label{Name:"version-flannel", Value:"v0.10.1-gravitational"}, pack.Label{Name:"version-docker", Value:"18.09.9"}, pack.Label{Name:"version-helm", Value:"2.15.2"}, pack.Label{Name:"version-coredns", Value:"1.3.1"}, pack.Label{Name:"version-node-problem-detector", Value:"v0.6.4"}}, Service:(*systemservice.NewPackageServiceRequest)(0xc000733800)} package:gravitational.io/planet:7.0.30-11706 pack/utils.go:146
2020-08-12T20:02:45Z INFO             Generate configuration package. args:[--config-string=dGVsZXBvcnQ6CiAgbm9kZW5hbWU6IDEwXzE0OF85Xzc5LmNvY2t5eW9uYXRoNTQ4NwogIGRhdGFfZGlyOiAvdmFyL2xpYi9ncmF2aXR5L3RlbGVwb3J0CiAgYXV0aF90b2tlbjogTklOcEx6TDcKICBhdXRoX3NlcnZlcnM6CiAgLSAxMjcuMC4wLjE6MzAyNQogIC0gMTAuMTQ4LjkuODU6MzAyNQogIC0gMTAuMTQ4LjkuNjg6MzAyNQogIGxvZzoKICAgIHNldmVyaXR5OiBpbmZvCiAgYWR2ZXJ0aXNlX2lwOiAxMC4xNDguOS43OQogIGNhY2hlOgogICAgZW5hYmxlZDogInllcyIKICAgIHR0bDogODc2MGgwbTBzCiAgY2lwaGVyczoKICAtIGFlczEyOC1nY21Ab3BlbnNzaC5jb20KICAtIGFlczEyOC1jdHIKICAtIGFlczE5Mi1jdHIKICAtIGFlczI1Ni1jdHIKICBrZXhfYWxnb3M6CiAgLSBjdXJ2ZTI1NTE5LXNoYTI1NkBsaWJzc2gub3JnCiAgLSBlY2RoLXNoYTItbmlzdHAyNTYKICAtIGVjZGgtc2hhMi1uaXN0cDM4NAogIC0gZWNkaC1zaGEyLW5pc3RwNTIxCiAgbWFjX2FsZ29zOgogIC0gaG1hYy1zaGEyLTI1Ni1ldG1Ab3BlbnNzaC5jb20KICAtIGhtYWMtc2hhMi0yNTYKICBjYV9waW46ICIiCmF1dGhfc2VydmljZToKICBlbmFibGVkOiAibm8iCiAgc2Vzc2lvbl9yZWNvcmRpbmc6ICIiCiAgY2xpZW50X2lkbGVfdGltZW91dDogMHMKICBkaXNjb25uZWN0X2V4cGlyZWRfY2VydDogZmFsc2UKICBrZWVwX2FsaXZlX2NvdW50X21heDogMApzc2hfc2VydmljZToKICBlbmFibGVkOiAieWVzIgogIGxhYmVsczoKICAgIGFkdmVydGlzZS1pcDogMTAuMTQ4LjkuNzkKICAgIGFwcC1yb2xlOiBub2RlCiAgICBkaXNwbGF5LXJvbGU6IERlZmF1bHQgbm9kZQogICAgZnFkbjogMTBfMTQ4XzlfNzkuY29ja3l5b25hdGg1NDg3CiAgICBncmF2aXRhdGlvbmFsLmlvL2s4cy1yb2xlOiBtYXN0ZXIKICAgIGhvc3RuYW1lOiBpcC0xMC0xNDgtOS03OS5nZC1tcy51cwogICAgaW5zdGFuY2UtdHlwZTogIiIKICAgIHJvbGU6IG5vZGUKcHJveHlfc2VydmljZToKICBlbmFibGVkOiAibm8iCg==] manifest:&pack.Manifest{Version:"0.0.1", Config:(*schema.Config)(0xc00135e360), Commands:[]pack.Command{pack.Command{Name:"start", Description:"", Args:[]string{"rootfs/usr/bin/teleport", "start"}}, pack.Command{Name:"tctl", Description:"", Args:[]string{"rootfs/usr/bin/tctl"}}}, Labels:[]pack.Label{pack.Label{Name:"os", Value:"linux"}}, Service:(*systemservice.NewPackageServiceRequest)(0xc001da6c00)} package:gravitational.io/teleport:3.2.14 pack/utils.go:146
2020-08-12T20:03:01Z INFO [CA]        Generating TLS certificate {0x6d9d918 0xc001460980 CN=opscenter@gravitational.io,O=@teleadmin,POSTALCODE=null,L=root 2020-08-13 06:03:01.395258763 +0000 UTC []}. common_name:opscenter@gravitational.io dns_names:[] locality:[root] not_after:2020-08-13 06:03:01.395258763 +0000 UTC org:[@teleadmin] org_unit:[] tlsca/ca.go:186
2020-08-12T20:03:01Z INFO [PROCESS]   Synced cluster cockyyonath5487 to local backend. mode:site process/process.go:630
2020-08-12T20:03:01Z INFO [PROCESS]   Synced operation bd0b0fae-f5c1-4a50-b6af-1254d346bd59/operation_expand to local backend. mode:site process/process.go:645
2020-08-12T20:03:01Z INFO [PROCESS]   Synced operation f32f5e74-85f2-4943-b1d9-6f54b08d0503/operation_expand to local backend. mode:site process/process.go:645
2020-08-12T20:03:07Z INFO [OPS]       Status checks are paused, cluster is expanding. opsservice/status.go:44
2020-08-12T20:03:22Z INFO             getPackages(cockyyonath5487) webpack/webpack.go:171
2020-08-12T20:04:01Z INFO [CA]        Generating TLS certificate {0x6d9d918 0xc001ac4a80 CN=opscenter@gravitational.io,O=@teleadmin,POSTALCODE=null,L=root 2020-08-13 06:04:01.392807203 +0000 UTC []}. common_name:opscenter@gravitational.io dns_names:[] locality:[root] not_after:2020-08-13 06:04:01.392807203 +0000 UTC org:[@teleadmin] org_unit:[] tlsca/ca.go:186
2020-08-12T20:04:07Z INFO [OPS]       Status checks are paused, cluster is expanding. opsservice/status.go:44
2020-08-12T20:04:55Z INFO [OPS]       ops.SetOperationStateRequest{State:"completed", Progress:(*ops.ProgressEntry)(0xc0016ce900)} opsservice/service.go:1054
2020-08-12T20:04:55Z INFO [OPS]       AuditEvent(Event={operation.completed G0004I}, Fields=map[cluster:cockyyonath5487 hostname:<REDACTED> id:bd0b0fae-f5c1-4a50-b6af-1254d346bd59 ip:10.148.9.79 role:node type:operation_expand user:agent@cockyyonath5487]). opsservice/service.go:1425
2020-08-12T20:04:55Z INFO [AGENT-SER] PeerLeave. mode:site req:PeerLeaveRequest(addr=<REDACTED>:3012, config=RuntimeConfig(role=node, addr=, system-dev="", state-dir=, temp-dir=, token=225e5492cc4b24df52696afbd39a56c79e584dbc704feac74152d60cd49bdf01, key-values=map[], mounts=, cloud=CloudMetadata(<empty>))) server/agent.go:73
2020-08-12T20:04:55Z INFO [PROCESS]   RemovePeer. mode:site peer:<REDACTED>:3012 process:10.148.9.68 opsservice/agents.go:399
2020-08-12T20:04:55Z INFO [AGENT-GRO] Monitoring loop closing. monitored:10.148.9.79:3012 server/peers.go:172
2020-08-12T20:04:55Z INFO [AGENT-GRO] Reconnect loop closing. reconnected:peer(addr=10.148.9.79:3012) server/peers.go:244
2020-08-12T20:05:01Z INFO [CA]        Generating TLS certificate {0x6d9d918 0xc0010ced00 CN=opscenter@gravitational.io,O=@teleadmin,POSTALCODE=null,L=root 2020-08-13 06:05:01.39310893 +0000 UTC []}. common_name:opscenter@gravitational.io dns_names:[] locality:[root] not_after:2020-08-13 06:05:01.39310893 +0000 UTC org:[@teleadmin] org_unit:[] tlsca/ca.go:186
2020-08-12T20:05:01Z INFO [PROCESS]   Synced operation bd0b0fae-f5c1-4a50-b6af-1254d346bd59/operation_expand to local backend. mode:site process/process.go:645
2020-08-12T20:06:01Z INFO [CA]        Generating TLS certificate {0x6d9d918 0xc00118aa60 CN=opscenter@gravitational.io,O=@teleadmin,POSTALCODE=null,L=root 2020-08-13 06:06:01.391569627 +0000 UTC []}. common_name:opscenter@gravitational.io dns_names:[] locality:[root] not_after:2020-08-13 06:06:01.391569627 +0000 UTC org:[@teleadmin] org_unit:[] tlsca/ca.go:186
...
2020-08-13T11:09:07Z INFO [OPS]       Deactivating cluster cockyyonath5487 with reason "cluster_degraded". opsservice/service.go:1229
2020-08-13T11:09:07Z INFO [OPS]       AuditEvent(Event={cluster.degraded G3000W}, Fields=map[reason:cluster_degraded user:@statuschecker]). opsservice/service.go:1425
2020-08-13T11:09:07Z WARN [PROCESS]   Cluster status check failed. error:[
ERROR REPORT:
Original Error: *trace.BadParameterError cluster is not healthy: &amp;status.Agent{SystemStatus:degraded, Nodes:[]status.ClusterServer{status.ClusterServer{Hostname:&#34;&#34;, AdvertiseIP:&#34;10.148.9.85&#34;, Role:&#34;master&#34;, Profile:&#34;&#34;, Status:&#34;healthy&#34;, SELinux:(*bool)(nil), FailedProbes:[]string(nil), WarnProbes:[]string(nil), TeleportNode:(*ops.Node)(nil)}, status.ClusterServer{Hostname:&#34;&#34;, AdvertiseIP:&#34;10.148.9.79&#34;, Role:&#34;master&#34;, Profile:&#34;&#34;, Status:&#34;healthy&#34;, SELinux:(*bool)(nil), FailedProbes:[]string(nil), WarnProbes:[]string(nil), TeleportNode:(*ops.Node)(nil)}, status.ClusterServer{Hostname:&#34;&#34;, AdvertiseIP:&#34;10.148.9.68&#34;, Role:&#34;master&#34;, Profile:&#34;&#34;, Status:&#34;degraded&#34;, SELinux:(*bool)(nil), FailedProbes:[]string{&#34;Ready/KubeletNotReady ([container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful])&#34;}, WarnProbes:[]string(nil), TeleportNode:(*ops.Node)(nil)}}}
Stack Trace:
	/gopath/src/github.com/gravitational/gravity/lib/ops/opsservice/status.go:100 github.com/gravitational/gravity/lib/ops/opsservice.(*site).checkPlanetStatus
	/gopath/src/github.com/gravitational/gravity/lib/ops/opsservice/status.go:49 github.com/gravitational/gravity/lib/ops/opsservice.(*Operator).CheckSiteStatus
	/gopath/src/github.com/gravitational/gravity/lib/process/process.go:798 github.com/gravitational/gravity/lib/process.(*Process).runSiteStatusChecker
	/gopath/src/github.com/gravitational/gravity/lib/process/process.go:875 github.com/gravitational/gravity/lib/process.(*Process).startServiceWithContext.func1
	/go/src/runtime/asm_amd64.s:1337 runtime.goexit
User Message: cluster is not healthy: &amp;status.Agent{SystemStatus:degraded, Nodes:[]status.ClusterServer{status.ClusterServer{Hostname:&#34;&#34;, AdvertiseIP:&#34;10.148.9.85&#34;, Role:&#34;master&#34;, Profile:&#34;&#34;, Status:&#34;healthy&#34;, SELinux:(*bool)(nil), FailedProbes:[]string(nil), WarnProbes:[]string(nil), TeleportNode:(*ops.Node)(nil)}, status.ClusterServer{Hostname:&#34;&#34;, AdvertiseIP:&#34;10.148.9.79&#34;, Role:&#34;master&#34;, Profile:&#34;&#34;, Status:&#34;healthy&#34;, SELinux:(*bool)(nil), FailedProbes:[]string(nil), WarnProbes:[]string(nil), TeleportNode:(*ops.Node)(nil)}, status.ClusterServer{Hostname:&#34;&#34;, AdvertiseIP:&#34;10.148.9.68&#34;, Role:&#34;master&#34;, Profile:&#34;&#34;, Status:&#34;degraded&#34;, SELinux:(*bool)(nil), FailedProbes:[]string{&#34;Ready/KubeletNotReady ([container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful])&#34;}, WarnProbes:[]string(nil), TeleportNode:(*ops.Node)(nil)}}}] mode:site process/process.go:799
2020-08-13T11:10:01Z INFO [CA]        Generating TLS certificate {0x6d9d918 0xc001b42600 CN=opscenter@gravitational.io,O=@teleadmin,POSTALCODE=null,L=root 2020-08-13 21:10:01.398920484 +0000 UTC []}. common_name:opscenter@gravitational.io dns_names:[] locality:[root] not_after:2020-08-13 21:10:01.398920484 +0000 UTC org:[@teleadmin] org_unit:[] tlsca/ca.go:186
2020-08-13T11:10:07Z INFO [OPS]       Activating cluster cockyyonath5487. opsservice/service.go:1277

Based on this error, it seems like something is wrong with my cluster health, but all of the other services are working just fine? This error only occurs when joining the 3rd node to the cluster, the 1st (init) and the 2nd (join) worked just fine. It’s been looping on this process for the past 4 days and I haven’t had much success debugging or working with it.

Any insight into what is happening or how to possibly troubleshoot this would be greatly appreciated.

Thank you!

Hey, sorry for the delay.

The error you posted is not related to your original issue as far as I can tell (PLEG is not healthy). PLEG stands for Pod Lifecycle Event Generator. Basically it’s a loop inside the kubelet that allows it to check the state of all running containers managed by the container runtime (in gravity’s case docker). PLEG itself was created to abstract specific container runtimes in kubelet and its purpose is interpretation of container states. It does this by relisting the containers every now and then and emitting the lifecycle events.
Now, there’s also a time limit imposed on the duration of the relist operation which, if exceeded, turns into the PLEG is not healthy message. It is a good indicator that docker is overloaded or not responsive so I would check docker/kubelet logs for more detail. You’re welcome to share the relevant bits here.

Regarding your original issue - it would be helpful if you could share more specific details about the nginx configuration - specifically which port it is targeting.