Mattermost demo on GCP: unable to join existing cluster (timeout/iptables issue?)

I’ve successfully created the Gravity tarfile for the cluster deployment, and have installed the cluster on a fresh Debian 9 image in GCP with: sudo ./gravity install --advertise-addr=10.168.0.10 --token=secret

(I confirm I can gravity enter and then kubectl get pods shows the active pods).

But I then spin up an identically configured GCP instance on the same VPC, in the same region/AZ, with firewall rules enabling full networking between those instances. And I attempt to join the existing cluster with: sudo ./gravity join 10.168.0.10 --advertise-addr=10.168.0.11 --token=secret

However, this times out or fails:

Wed Aug 7 17:44:27 UTC Still waiting for the planet to start (8 minutes elapsed)
Wed Aug 7 17:44:27 UTC Saving debug report to /home/user/mattermost/crashreport.tgz

And, on the second host, after sudo gravity enter, I attempt to diagnose with kubectl get -o pods but that eventually produces: Unable to connect to the server: dial tcp 10.100.0.1:443: i/o timeout

Within the second host’s environment (after I run sudo gravity enter, that is), there is no indication of any running containers via docker ps -a. journalctl -xe gives: kube-proxy[109]: E0807 18:59:57.805298 109 proxier.go:1402] Failed to execute iptables-restore: exit status 2 (iptables-restore v1.6.0: Couldn't load targetKUBE-MARK-DROP’:No such file or directory, and thenkube-proxy[109]: Error occurred at line: 81`

I’m wondering what could cause all of this. Where would I look for further error messages? Is there a way to invoke a more failsafe or stable instance known to work?

I don’t think it explains the errors you’re seeing. But on GCP the nodes need to have a service account that allows editing the GCP routing and load balancers for kubernetes integrations. Alternatively, the cluster should be installed with the flag --cloud-provider=generic which will turn off any cloud integrations and run the cluster as if it were on-prem using vxlan networking.

I’ll try again with --cloud-provider=generic for now, although I wanted to note that both nodes were set up to have full API access. I’m not sure if that’s sufficient (or whether I need to run some local gcloud commands before). If there are any guidelines specific to GCP I might follow, that would be helpful, so I know how to take the right steps.

If the nodes have full API access, I believe that should be sufficient. Sorry, I only mentioned it because that isn’t the default setting, and is very easy to run into. I’m not aware of us having any guides around GCP.

The troubleshooting steps are as you did doing gravity enter and look for systemd units that are failing to start, and using the journal to see if any processes such as flannel, kubelet, docker, etcd, etc are complaing about something that isn’t allowing the node to see planet as started.

It looks like with --cloud-provider=generic the join succeeds just fine. However I’m still wondering what else I would need to do in order to make more use of native GCP features as applicable.

In other words, I’m still not clear on what cause the original errors, or if there are any steps that I’m missing. Would everyone attempting this on GCP without --cloud-provider=generic expect this same error? And are there any other workarounds or options for the native GCP case?

Also, when I attempted to create a third node, even with --cloud-provider=generic specified exactly as before, the join stalls again. Same i/o timeout error; kubectl get pods still stalls; and the debug report never finishes saving (because I believe it’s running information-gathering commands that never complete).

It’s unclear why this is happening, especially if --cloud-provider=generic isn’t fully sufficient to resolve.

In that last attempt, the coredns service Failed to list *v1.Pod: Get https://leader.telekube.local:6443 […] because no such host.

Then, wait-for-etcd.sh gives client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 127.0.0.1:2379: connect: connection refused

etcd gives error validating peerURLs and member count is unequal.

The coredns.conf on the last working (second one) and currently non-working (third attempt) node is the same, but on the third, there is no coredns.hosts file at all, whereas it’s populated on the second.

I’m not sure what the fundamental cause of all of these errors is.

I’m not sure based on the errors you’re getting. The coredns.hosts file get’s set by planet-agent, which monitors an election process to set the leader.telekube.local hosts entry for coredns. So this process is reliant on etcd.

Based on the etcd errors, it looks like etcd isn’t starting on the node that’s joining or is failing in some way?

BTW, I just remembered where we store the complete set of requirements for GCP integrations: https://gravitational.com/gravity/docs/installation/#google-compute-engine

Ok. Is there a clear “first cause” here (for example, one of the noted errors that, if encountered, would likely cause the others)?

I see indication that etcd is not brought up. I see: etcdctl[1234]: Error: client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 127.0.0.1:2379: connect: connection refused

I’d like to focus on the generic case rather than GCP specifics, since I presume this to be more reproducible by others.

That said, is it possible to attempt to reproduce the errors? I’m using the CLI method of installing and joining, and in generic mode - with three total nodes. The primary and secondary seem to install and join respectively, but the third fails. Maybe these problems exist for everyone for the current versions of respective dependencies.

Are there automated tests in place to ensure that issues like these don’t come up for the demo case? What do you suggest I do next?

Well, the first cause is the need to identify why etcd isn’t running. Based on the information provided, it’s not clear why that’s the case. etcdctl not being able to reach etcd on localhost, doesn’t indicate why etcd itself didn’t start or is malfunctioning.

The only case I can think of off hand where I’ve seen etcd die on a join, is when the first node used for installation was accidentally configured with an advertise address of 127.0.0.1, and later nodes joining would fail because they would try to join 127.0.0.1 instead of the real ip of the node. I don’t believe this matches you’re case though.

We do have automated integration tests that run on each PR and nightly that that run integration tests for install, expand, upgrade.

My suggestion for what to do next, would be to look at the etcd systemd service inside planet, to see why it didn’t start or isn’t binding the listening ports. All the symptoms you’ve outlined seem to indicate a failure to talk to etcd on localhost:2379, so that’s the next step I would look at.

It looks like the errors include the member count is unequal message.

I’m including the verbose output below:

Aug 08 17:22:08 ephemeral-gravity-instance-test-e etcd[1234]: peerTLS: cert = /var/state/etcd.cert, key = /var/state/etcd.key, ca = , trusted-ca = /var/state/root.cert, client-cert-auth = true, crl-file =
Aug 08 17:22:08 ephemeral-gravity-instance-test-e etcd[1234]: listening for peers on https://10.168.0.25:2380
Aug 08 17:22:08 ephemeral-gravity-instance-test-e etcd[1234]: listening for peers on https://10.168.0.25:7001
Aug 08 17:22:08 ephemeral-gravity-instance-test-e etcd[1234]: listening for client requests on 0.0.0.0:2379
Aug 08 17:22:08 ephemeral-gravity-instance-test-e etcd[1234]: listening for client requests on 0.0.0.0:4001
Aug 08 17:22:08 ephemeral-gravity-instance-test-e coredns[99]: E0808 17:22:08.789250      99 reflector.go:134] github.com/coredns/coredns/plugin/kubernetes/controller.go:320: Failed to list *v1.Pod: Get https://leader.telekube.local:6443/api/v1/pods?limit=500&resourceVersion=0: dial tcp: lookup leader.telekube.local on 127.0.0.2:53: no such host
Aug 08 17:22:08 ephemeral-gravity-instance-test-e coredns[99]: E0808 17:22:08.789316      99 reflector.go:134] github.com/coredns/coredns/plugin/kubernetes/controller.go:317: Failed to list *v1.Endpoints: Get https://leader.telekube.local:6443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp: lookup leader.telekube.local on 127.0.0.2:53: no such host
Aug 08 17:22:08 ephemeral-gravity-instance-test-e coredns[99]: E0808 17:22:08.789351      99 reflector.go:134] github.com/coredns/coredns/plugin/kubernetes/controller.go:315: Failed to list *v1.Service: Get https://leader.telekube.local:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp: lookup leader.telekube.local on 127.0.0.2:53: no such host
Aug 08 17:22:08 ephemeral-gravity-instance-test-e coredns[99]: E0808 17:22:08.789387      99 reflector.go:134] github.com/coredns/coredns/plugin/kubernetes/controller.go:322: Failed to list *v1.Namespace: Get https://leader.telekube.local:6443/api/v1/namespaces?limit=500&resourceVersion=0: dial tcp: lookup leader.telekube.local on 127.0.0.2:53: no such host
Aug 08 17:22:08 ephemeral-gravity-instance-test-e etcd[1234]: error validating peerURLs {ClusterID:7a5a109e3534c214 Members:[&{ID:cb4e3c308e36cd98 RaftAttributes:{PeerURLs:[https://10.168.0.20:2380]} Attributes:{Name: ClientURLs:[]}} &{ID:e040208b0e954b8a RaftAttributes:{PeerURLs:[https://10.168.0.10:2380]} Attributes:{Name:10_168_0_10.elasticwozniak3205 ClientURLs:[https://10.168.0.10:2379 https://10.168.0.10:4001]}} &{ID:f58c9596675f6820 RaftAttributes:{PeerURLs:[https://10.168.0.19:2380]} Attributes:{Name:10_168_0_19.elasticwozniak3205 ClientURLs:[https://10.168.0.19:2379 https://10.168.0.19:4001]}}] RemovedMemberIDs:[]}: member count is unequal
Aug 08 17:22:08 ephemeral-gravity-instance-test-e systemd[1]: etcd.service: Main process exited, code=exited, status=1/FAILURE
-- Subject: Unit process exited
-- Defined-By: systemd
-- Support: https://www.debian.org/support

I’d want to have more information in that error message. For example, member count N is not equal to M, and these two values came from […].

I can say though that the peer that ends in .20 doesn’t exist in the output of kubectl get nodes -owide when that was run on the master, a working node. That is, .20 was never properly registered, I believe. So I’m not sure why it’s there. Also, after the five or six consecutive attempts to join, I finally got a third node to join the cluster - for a total of three. There was no fundamental difference between node that finally successfully joined and the many before it that did not; they were all created from the same template.

Would etcd be a prerequisite to coredns? It seems problematic to not have DNS working, or to not have a coredns.hosts file. If etcd is a prerequisite, then that would be the first thing to focus on.

Etcd is a prequisite to coredns. The coredns.hosts file is based on a leader election, where the leader election is done through etcd.

Even though kubectl get nodes doesn’t shown the .20 host, doesn’t mean the .20 host isn’t configured as part of the cluster, these are two different things. In gravity, there are different layers, or different pieces of the configuration.

  1. Gravity itself, when you try and add a node, will update it’s internal data structures, to know about that node. Gravity leave, if successfull will remove the node from those data structures.
  2. Etcd - etcd is it’s own clustering technology. So when you try and join a master to a cluster, gravity will tell etcd about the new master, so that it can join the cluster. If the node isn’t cleanly removed or otherwise isn’t cleaned up in etcd for any reason, etcd will still know about the master, even if the install failed, just because the install was attempted.
  3. kubelet will create a node object within the kubernetes API. This happens when kubelet starts on the node, and is able to contact the kubernetes api. When it first contacts the API, it will self register, and this is what is seen in kubectl get nodes. If kubelet doesn’t start, or isn’t able to connect to the API, you won’t see the node as a member of the cluster from kubectl perspective.

So this particular nodes join failure, could be caused by a previous join failure, but some partial state is still left in the cluster somehow. It’s hard to speculate what that might be, without know the full history of the failures.