Site Post Install Hook Failing to get healthcheck

#1

Moving from github at request.

https://github.com/gravitational/gravity/issues/353

Install repeatedly fails and hangs on “site-app-post-install” job.

Version: 5.5.3 (issue was not present on 5.3.5 - it looks to be an issue with switch to coredns)
Environment: 3 RHEL 7.6 EC2 VMs on AWS

I have created a duplicate of the “site-app-post-install” with changing the gravity site status command so that I have access to the container.

I have also added the upstream dns from AWS onto --dns-zone on install.

In that container the when doing a nslookup on the service that is called via gravity site status it is not resolved.

Address 1: 10.100.14.135 kube-dns.kube-system.svc.cluster.local
/ # nslookup gravity-site.kube-system.svc.cluster.local
Server:    10.100.14.135
Address 1: 10.100.14.135 ip-10-100-14-135.eu-central-1.compute.internal

nslookup: can't resolve 'gravity-site.kube-system.svc.cluster.local'

the /etc/resolv.conf of the job container is
/ # cat /etc/resolv.conf 

nameserver 10.100.14.135

search default.svc.cluster.local svc.cluster.local cluster.local eu-central-1.compute.internal

options ndots:5

(note) 10.100.14.135 is the cluster ip of the kube-dns service.

When I exec into the “gravity-site” container I am able to lookup gravity-site.kube-system.svc.cluster.local.

All of the coredns daemonset containers are running. I turned on logs on coredns settings and services from the monitoring namespace are reaching the coredns, but the requests from the “site-app-post-install” do not reach the coredns logs.

0 Likes

#2

Thanks for the information, I see a couple possibilities for the failure:

  1. gravity-site pod resolution works differently than the job pod. Because gravity-site is running in the host network namespace, it actually uses a separate instance of coredns that is bound to localhost (127.0.0.2), and used for services running outside of kubernetes.

  2. The overlay network might not be functioning
    When running inside the re-created job pod, are you able to resolv anything?
    It’s possible you’re overlay network isn’t functioning. You mentioned running inside AWS, so have you verified that the nodes have the right IAM permissions to create routes? I would check inside planet, if flannel is reporting errors.

gravity enter
journalctl -u flanneld.service

Alternatively to run without AWS integrations, you would need to pass the --cloud-provider=generic flag to the installer, which will disable integrations.
If this is the problem, you do need to be careful, as a percentage of queries may work. We run coredns as a daemonset, which means a percentage of queries will be load balanced locally and not require the overlay network.

  1. External DNS may be unreachable
    We’ve seen this particular hook fail, when the gravity cluster is unable to reach upstream nameservers, due to the search domain. We copy the search domain into the cluster, so name resolution inside kubernetes works the same as it would on the host. However, because of the search domain, the first query attempted by the pod will be for gravity-site.kube-system.svc.cluster.local., which if the upstream resolvers are not reachable, will take time to timeout, which can be long enough for the post-install hook to fail.
0 Likes