Site:6.3.6 postInstall hook failing on /healthz endpoint

Installing on bare metal Ubuntu 18.04LTS (fresh installation).

I always seem to get an error when installing from healthz, but right now I am unable to install at all due to a crash in the site:6.3.6 postInstall hook.

This error keeps printing for 7 minutes then a crashreport.tgz is created:

Thu Mar 12 13:27:38 UTC [INFO] [castproxy] Executing postInstall hook for site:6.3.6.
Created Pod "site-app-post-install-e86d7b-8qtfb" in namespace "kube-system".

Container "post-install-hook" created, current state is "waiting, reason PodInitializing".

Pod "site-app-post-install-e86d7b-8qtfb" in namespace "kube-system", has changed state from "Pending" to "Running".
Container "post-install-hook" changed status from "waiting, reason PodInitializing" to "running".

[ERROR]: failed connecting to https://gravity-site.kube-system.svc.cluster.local:3009/healthz
Container "post-install-hook" changed status from "running" to "terminated, exit code 255".

Container "post-install-hook" restarted, current state is "running".

[ERROR]: failed connecting to https://gravity-site.kube-system.svc.cluster.local:3009/healthz
Container "post-install-hook" changed status from "running" to "terminated, exit code 255".

Container "post-install-hook" changed status from "terminated, exit code 255" to "waiting, reason CrashLoopBackOff".

Container "post-install-hook" restarted, current state is "running".

I haven’t added any post-install-hook for sites. Any idea why this would be causing installation failure?

Thanks in advance.

David

There are a couple reasons this tends to happen. On a multi-node install, this can occur because not all the ports have been opened between nodes: https://gravitational.com/gravity/docs/requirements/#cluster-ports

On a single node install, it can occur if the upstream DNS is unreachable (we currently have an open issue, where queries that include search domains get sent to DNS, if it’s not responding the total time to do DNS resolution can cause other connections to timeout), or if the iptables rules for the gravity-site service didn’t get setup for some reason (check kube-proxy inside planet). Missing iptables rules can be caused by endpoint security agents, or other security configuration on the system, such as selinux. Based on your description this seems unlikely, but I’m mentioning it just in case.

Depending on version, gravity status may be able to give you hints into some of these potential causes.

The first things I would check are gravity status for any indications of problems or missing config on the node, and then the iptables rule base iptables-save and kube-proxy logs gravity exec -it journalctl -u kube-proxy for any indications of problems.