Oracle Linux 7.6

While not explicitly listed as a supported OS, we were hopeful that an install on Oracle Linux (based on RHEL) would work out. This has turned out not to be the case, and I just wanted to see if this was known and if there are any workarounds. We’ve actually seen two failure scenarios: one with the Enterprise support version and one using an AMI provided by Oracle on AWS.

Bare metal OEL 7.6

For the first, with OEL: installation gets to Wait for kubernetes to become available but fails repeatedly with:

2019-06-04T09:30:15-04:00 DEBU             dialing leader.telekube.local:6443 install/hook.go:56
2019-06-04T09:30:15-04:00 WARN             "Failed to dial with local resolver: \nERROR REPORT:\nOriginal Error: *net.OpError read udp 127.0.0.1:16976->127.0.0.2:53: read: connection refused\nStack Trace:\n\t/gopath/src/github.com/gravitational/gravity/lib/utils/dns.go:45 github.com/gravitational/gravity/lib/utils.ResolveAddr\n\t/gopath/src/github.com/gravitational/gravity/lib/httplib/client.go:248 github.com/gravitational/gravity/lib/httplib.DialWithLocalResolver\n\t/gopath/src/github.com/gravitational/gravity/lib/httplib/client.go:220 github.com/gravitational/gravity/lib/httplib.DialFromEnviron.func1\n\t/go/src/net/http/transport.go:916 net/http.(*Transport).dial\n\t/go/src/net/http/transport.go:1240 net/http.(*Transport).dialConn\n\t/go/src/net/http/transport.go:999 net/http.(*Transport).getConn.func4\n\t/go/src/runtime/asm_amd64.s:1334 runtime.goexit\nUser Message: failed to resolve leader.telekube.local:6443\n." install/hook.go:56

Until eventually failing the phase:

2019-06-04T09:35:14-04:00 DEBU [FSM:INSTA] Applied StateChange(Phase=/wait, State=failed, Error=context deadline exceeded). opid:360acde5-7d55-4f65-85e4-38bb8125e36c install/hook.go:56
2019-06-04T09:35:14-04:00 ERRO [INSTALLER] "Failed to execute plan: \nERROR REPORT:\nOriginal Error: context.deadlineExceededError context deadline exceeded\nStack Trace:\n\t/gopath/src/github.com/gravitational/gravity/lib/install/phases/postsystem.go:82 github.com/gravitational/gravity/lib/install/phases.(*waitExecutor).Execute\n\t/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:421 github.com/gravitational/gravity/lib/fsm.(*FSM).executeOnePhase\n\t/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:355 github.com/gravitational/gravity/lib/fsm.(*FSM).executePhaseLocally\n\t/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:315 github.com/gravitational/gravity/lib/fsm.(*FSM).executePhase\n\t/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:192 github.com/gravitational/gravity/lib/fsm.(*FSM).ExecutePhase\n\t/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:150 github.com/gravitational/gravity/lib/fsm.(*FSM).ExecutePlan\n\t/gopath/src/github.com/gravitational/gravity/lib/install/flow.go:335 github.com/gravitational/gravity/lib/install.(*Installer).startFSM\n\t/go/src/runtime/asm_amd64.s:1334 runtime.goexit\nUser Message: failed to execute phase \"/wait\"\n." install/hook.go:56
2019-06-04T09:35:14-04:00 INFO [OPS]       ops.SetOperationStateRequest{State:"failed", Progress:(*ops.ProgressEntry)(0xc0008a5f00)} install/hook.go:56
2019-06-04T09:35:14-04:00 DEBU [OPS]       Created: ops.ProgressEntry{ID:"", SiteDomain:"logrocket", OperationID:"360acde5-7d55-4f65-85e4-38bb8125e36c", Created:time.Time{wall:0x34976137, ext:63695252114, loc:(*time.Location)(nil)}, Completion:100, Step:9, State:"failed", Message:"Operation failure: context deadline exceeded"}. install/hook.go:56
2019-06-04T09:35:14-04:00 DEBU [FSM:INSTA] Marked operation complete. opid:360acde5-7d55-4f65-85e4-38bb8125e36c install/hook.go:56
2019-06-04T09:35:15-04:00 INFO             Operation failed. install/hook.go:56

I initially thought this may have been an issue with dnsmasq running on the node, but even after ensuring no such service was running the same issue repeated. We also had them try a different CIDR range inside 172.16.0.0/12 for pods and services as they use some ranges inside 10.0.0.0/8 for their internal network.

AWS OL7.6

On the AWS side, using an OL7.6 AMI, the failure scenario is slightly different. Instead of failing to even resolve the leader, the system breaks fails to “update a remote for tag” during the docker population phase:

2019-06-05T20:15:51Z ERRO             Phase execution failed: failed to update remote for tag "kubernetes-helm/tiller:v2.12.3". advertise-ip:172.31.14.26 hostname:ip-172-31-14-26.ec2.internal phase:/export/ip-172-31-14-26.ec2.internal install/hook.go:56
2019-06-05T20:15:51Z DEBU [FSM:INSTA] Applied StateChange(Phase=/export/ip-172-31-14-26.ec2.internal, State=failed, Error=failed to update remote for tag "kubernetes-helm/tiller:v2.12.3"). opid:e94fde77-fa9e-4f7f-b39e-1d2fc3d4afe7 install/hook.go:56
2019-06-05T20:15:51Z WARN             "Failed to execute phase \"/export/ip-172-31-14-26.ec2.internal\": \nERROR REPORT:\nOriginal Error: *client.UnexpectedHTTPStatusError received unexpected HTTP status: 500 Internal Server Error\nStack Trace:\n\t/gopath/src/github.com/gravitational/gravity/lib/app/docker/imageservice.go:501 github.com/gravitational/gravity/lib/app/docker.(*remoteStore).updateRepo\n\t/gopath/src/github.com/gravitational/gravity/lib/app/docker/imageservice.go:206 github.com/gravitational/gravity/lib/app/docker.(*imageService).Sync\n\t/gopath/src/github.com/gravitational/gravity/lib/install/phases/export.go:154 github.com/gravitational/gravity/lib/install/phases.(*exportExecutor).exportApp\n\t/gopath/src/github.com/gravitational/gravity/lib/install/phases/export.go:106 github.com/gravitational/gravity/lib/install/phases.(*exportExecutor).Execute\n\t/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:421 github.com/gravitational/gravity/lib/fsm.(*FSM).executeOnePhase\n\t/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:355 github.com/gravitational/gravity/lib/fsm.(*FSM).executePhaseLocally\n\t/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:315 github.com/gravitational/gravity/lib/fsm.(*FSM).executePhase\n\t/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:192 github.com/gravitational/gravity/lib/fsm.(*FSM).ExecutePhase\n\t/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:379 github.com/gravitational/gravity/lib/fsm.(*FSM).executeSubphasesConcurrently.func1\n\t/go/src/runtime/asm_amd64.s:1334 runtime.goexit\nUser Message: failed to update remote for tag \"kubernetes-helm/tiller:v2.12.3\"\n." install/hook.go:56
2019-06-05T20:15:51Z ERRO [INSTALLER] "Failed to execute plan: \nERROR REPORT:\nOriginal Error: trace.aggregate failed to update remote for tag \"kubernetes-helm/tiller:v2.12.3\", failed to execute phase \"/export/ip-172-31-14-26.ec2.internal\"\nStack Trace:\n\t/gopath/src/github.com/gravitational/gravity/lib/utils/collecterrors.go:65 github.com/gravitational/gravity/lib/utils.Collect\n\t/gopath/src/github.com/gravitational/gravity/lib/utils/collecterrors.go:27 github.com/gravitational/gravity/lib/utils.CollectErrors\n\t/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:387 github.com/gravitational/gravity/lib/fsm.(*FSM).executeSubphasesConcurrently\n\t/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:358 github.com/gravitational/gravity/lib/fsm.(*FSM).executePhaseLocally\n\t/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:287 github.com/gravitational/gravity/lib/fsm.(*FSM).executePhase\n\t/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:192 github.com/gravitational/gravity/lib/fsm.(*FSM).ExecutePhase\n\t/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:150 github.com/gravitational/gravity/lib/fsm.(*FSM).ExecutePlan\n\t/gopath/src/github.com/gravitational/gravity/lib/install/flow.go:335 github.com/gravitational/gravity/lib/install.(*Installer).startFSM\n\t/go/src/runtime/asm_amd64.s:1334 runtime.goexit\nUser Message: failed to execute phase \"/export\"\n." install/hook.go:56
2019-06-05T20:15:51Z INFO [OPS]       ops.SetOperationStateRequest{State:"failed", Progress:(*ops.ProgressEntry)(0xc00144eb00)} install/hook.go:56
2019-06-05T20:15:51Z DEBU [OPS]       Created: ops.ProgressEntry{ID:"", SiteDomain:"logrocket", OperationID:"e94fde77-fa9e-4f7f-b39e-1d2fc3d4afe7", Created:time.Time{wall:0x11436b0e, ext:63695362551, loc:(*time.Location)(nil)}, Completion:100, Step:9, State:"failed", Message:"Operation failure: failed to update remote for tag \"kubernetes-helm/tiller:v2.12.3\", failed to execute phase \"/export/ip-172-31-14-26.ec2.internal\""}. install/hook.go:56
2019-06-05T20:15:51Z DEBU [FSM:INSTA] Marked operation complete. opid:e94fde77-fa9e-4f7f-b39e-1d2fc3d4afe7 install/hook.go:56
2019-06-05T20:15:51Z INFO             Operation failed. install/hook.go:56

I struggled to find anywhere I could find logs that may have provided information on what the 500 error actually was.

Bare metal OLD 7.6

So this does look like an issue with starting or running the CoreDNS instance we spin up for internal resolution (which should be bound to 127.0.0.2). Something like dnsmasq bound to :53 can block coredns from binding even just to the loop back.

To troubleshoot the issue further, the process would be to gravity shell to get a shell session in the planet container we use to host kubernetes / docker / etc. This container is running systemd, so the starting point would be to investigate the coredns logs within the systemd journal journalctl -u coredns to give an indication on why coredns isn’t starting.

AWS OL7.6

So for this one, it seems to failing where it tries to sync the internal containers to the docker registry on ip-172-31-14-26.ec2.internal. So to me it looks like the 500 error was generated by the registry running on ip-172-31-14-26.ec2.internal.

To see what was going on with the registry, to troubleshoot further would be similar to above. Use gravity shell to enter the gravity container, and see if the registry logs journalctl -u registry show an indication on what’s going on with the registry.

Thanks so much for the response.

The first issue looks like it may be related to SELinux, CoreDNS is failing to start with status=218/CAPABILITIES. What does support for SELinux look like with Gravity? I haven’t found mention of it in the documentation.

Update my assumption on SELinux was incorrect, it is already disabled on the target system. They’re on 4.1.12-124.27.1.el7uek.x86_64 which should have what it’s asking for. We have a path forward for investigation, at least.

The AWS issue is a bit of a face-palm: the root disk ran out of space! Thanks for the pointer on those logs.

Re: Selinux

We don’t currently support running on selinux enabled systems. It’s something we should offer support for, but we haven’t seen a huge demand.

I haven’t seen it in awhile, but we used to see some similar failures previously, based around the following paramets in the /lib/systemd/system/coredns.service file.

CapabilityBoundingSet=CAP_NET_BIND_SERVICE
AmbientCapabilities=CAP_NET_BIND_SERVICE
User=planet

In the cases we’ve seen, if the capabilitiy lines are removed and the user switched to root, the coredns service will start. We’ve never committed to this though, as I believe the cases we’ve seen were bypassed by upgrading to a later kernel (where we were investigating similar issues with centos).

There must be something specific to the OEL setup, RHEL on the same kernel version doesn’t have the same issue, nor does the OL7.6 release available on AWS.

Changing the coredns service didn’t seem to help.

We’ve gotten further along with OL 7.6: kubernetes spins up and the registry populates. The current issue is that pods are unable to communicate with the kubernetes service through its ClusterIP, though this works fine on the host and in a gravity shell.

When the dns-app-install hook runs it fails with this error:

Assuming changeset from the environment: dns-030
Creating new resources
ERROR: Post https://10.100.0.1:443/apis/apiextensions.k8s.io/v1beta1/customresourcedefinitions: dial tcp 10.100.0.1:443: connect: no route to host
2019-06-27T16:25:06Z ERRO rig/main.go:31
  ERROR REPORT:
  Original Error: *url.Error Post https://10.100.0.1:443/apis/apiextensions.k8s.io/v1beta1/customresourcedefinitions: dial tcp 10.100.0.1:443: connect: no route to host
  Stack Trace:
    /gopath/src/github.com/gravitational/rigging/changeset.go:1771 github.com/gravitational/rigging.(*Changeset).Init
    /gopath/src/github.com/gravitational/rigging/changeset.go:83 github.com/gravitational/rigging.NewChangeset
    /gopath/src/github.com/gravitational/rigging/tool/rig/main.go:266 main.upsert
    /gopath/src/github.com/gravitational/rigging/tool/rig/main.go:122 main.run
    /gopath/src/github.com/gravitational/rigging/tool/rig/main.go:31 main.main
    /go/src/runtime/proc.go:207 runtime.main
    /go/src/runtime/asm_amd64.s:2362 runtime.goexit
  User Message: 

If I run an ubuntu shell pod with the same volumes and attempt to use kubectl I receive a similar error:

root@debug-shell:/# kubectl get pod
Unable to connect to the server: dial tcp 10.100.0.1:443: connect: no route to host

I haven’t found anything in k8s or gravity related logs that would indicate a problem during the setup. Any thoughts?

Does the system have a default route?

The no route to host error can occur on systems that don’t have a default route, because initial route selection occurs before the iptables NAT to the real IP address. I did introduce a fix for this earlier this year, that will create a null route for the service network, but it’s possible you’re using a release that doesn’t have this. The route doesn’t have to go anywhere, it just needs to pass initial route selection within the kernel.

I tried with release 5.5.10 and have the same issue. Looking at a fresh instance before gravity there is a default route:

default via 10.0.16.1 dev eth0
10.0.16.0/20 dev eth0 proto kernel scope link src 10.0.19.48
169.254.0.0/16 dev eth0 scope link metric 1002

After running the gravity installation the route table has a few additions:

default via 10.0.16.1 dev eth0
10.0.16.0/20 dev eth0 proto kernel scope link src 10.0.19.48
10.100.0.0/16 dev flannel.null scope link
10.244.70.0/24 dev cni0 proto kernel scope link src 10.244.70.1
169.254.0.0/16 dev eth0 scope link metric 1002
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown

edit: not sure if flanneld logs are useful, but I do only see mentions of the 10.244.70.0/24 route.

-- Logs begin at Thu 2019-06-27 18:44:40 UTC. --
Jun 27 18:44:52 ip-10-0-251-230.ec2.internal systemd[1]: Starting Flannel...
Jun 27 18:44:53 ip-10-0-251-230.ec2.internal wait-for-etcd.sh[914]: /usr/bin/etcdctl: line 10: /usr/bin/etcdctl-cmd: No such file or directory
Jun 27 18:44:56 ip-10-0-251-230.ec2.internal wait-for-etcd.sh[914]: cluster is healthy
Jun 27 18:44:56 ip-10-0-251-230.ec2.internal etcdctl[1585]: {"Network":"10.244.0.0/16", "Backend": {"Type": "vxlan", "RouteTableFilter": ["tag:KubernetesCluster=logrocket"], "Port": 8472}}
Jun 27 18:44:56 ip-10-0-251-230.ec2.internal systemctl[1594]: active
Jun 27 18:44:56 ip-10-0-251-230.ec2.internal wait-for-flannel.sh[1596]: "waiting for flannel to start."
Jun 27 18:44:56 ip-10-0-251-230.ec2.internal flanneld[1595]: I0627 18:44:56.169859    1595 main.go:439] Searching for interface using 10.0.251.230
Jun 27 18:44:56 ip-10-0-251-230.ec2.internal flanneld[1595]: I0627 18:44:56.170122    1595 main.go:516] Using interface with name eth0 and address 10.0.251.230
Jun 27 18:44:56 ip-10-0-251-230.ec2.internal flanneld[1595]: I0627 18:44:56.170138    1595 main.go:533] Defaulting external address to interface address (10.0.251.230)
Jun 27 18:44:56 ip-10-0-251-230.ec2.internal flanneld[1595]: warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated
Jun 27 18:44:56 ip-10-0-251-230.ec2.internal flanneld[1595]: I0627 18:44:56.171007    1595 main.go:250] Created subnet manager: Etcd Local Manager with Previous Subnet: None
Jun 27 18:44:56 ip-10-0-251-230.ec2.internal flanneld[1595]: I0627 18:44:56.171018    1595 main.go:253] Installing signal handlers
Jun 27 18:44:56 ip-10-0-251-230.ec2.internal flanneld[1595]: I0627 18:44:56.180723    1595 main.go:375] Found network config - Backend type: vxlan
Jun 27 18:44:56 ip-10-0-251-230.ec2.internal flanneld[1595]: I0627 18:44:56.180766    1595 vxlan.go:120] VXLAN config: VNI=1 Port=8472 GBP=false DirectRouting=false
Jun 27 18:44:56 ip-10-0-251-230.ec2.internal flanneld[1595]: I0627 18:44:56.399172    1595 local_manager.go:234] Picking subnet in range 10.244.1.0 ... 10.244.255.0
Jun 27 18:44:56 ip-10-0-251-230.ec2.internal flanneld[1595]: I0627 18:44:56.400070    1595 local_manager.go:220] Allocated lease (10.244.49.0/24) to current node (10.0.251.230)
Jun 27 18:44:56 ip-10-0-251-230.ec2.internal flanneld[1595]: I0627 18:44:56.400616    1595 main.go:315] Wrote subnet file to /run/flannel/subnet.env
Jun 27 18:44:56 ip-10-0-251-230.ec2.internal flanneld[1595]: I0627 18:44:56.400633    1595 main.go:325] Finished starting backend.
Jun 27 18:44:56 ip-10-0-251-230.ec2.internal flanneld[1595]: I0627 18:44:56.400639    1595 main.go:326] Running backend.
Jun 27 18:44:56 ip-10-0-251-230.ec2.internal flanneld[1595]: I0627 18:44:56.401354    1595 vxlan_network.go:60] watching for new subnet leases
Jun 27 18:44:56 ip-10-0-251-230.ec2.internal flanneld[1595]: I0627 18:44:56.402524    1595 main.go:418] Waiting for 22h59m59.997016985s to renew lease
Jun 27 18:44:56 ip-10-0-251-230.ec2.internal flanneld[1595]: I0627 18:44:56.406093    1595 iptables.go:115] Some iptables rules are missing; deleting and recreating rules
Jun 27 18:44:56 ip-10-0-251-230.ec2.internal flanneld[1595]: I0627 18:44:56.406123    1595 iptables.go:137] Deleting iptables rule: -s 10.244.0.0/16 -j ACCEPT
Jun 27 18:44:56 ip-10-0-251-230.ec2.internal flanneld[1595]: I0627 18:44:56.406717    1595 iptables.go:115] Some iptables rules are missing; deleting and recreating rules
Jun 27 18:44:56 ip-10-0-251-230.ec2.internal flanneld[1595]: I0627 18:44:56.406742    1595 iptables.go:137] Deleting iptables rule: -s 10.244.0.0/16 -d 10.244.0.0/16 -j RETURN
Jun 27 18:44:56 ip-10-0-251-230.ec2.internal flanneld[1595]: I0627 18:44:56.407163    1595 iptables.go:137] Deleting iptables rule: -d 10.244.0.0/16 -j ACCEPT
Jun 27 18:44:56 ip-10-0-251-230.ec2.internal flanneld[1595]: I0627 18:44:56.411478    1595 iptables.go:137] Deleting iptables rule: -s 10.244.0.0/16 ! -d 224.0.0.0/4 -j MASQUERADE
Jun 27 18:44:56 ip-10-0-251-230.ec2.internal flanneld[1595]: I0627 18:44:56.412069    1595 iptables.go:125] Adding iptables rule: -s 10.244.0.0/16 -j ACCEPT
Jun 27 18:44:56 ip-10-0-251-230.ec2.internal flanneld[1595]: I0627 18:44:56.412895    1595 iptables.go:137] Deleting iptables rule: ! -s 10.244.0.0/16 -d 10.244.49.0/24 -j RETURN
Jun 27 18:44:56 ip-10-0-251-230.ec2.internal flanneld[1595]: I0627 18:44:56.414157    1595 iptables.go:125] Adding iptables rule: -d 10.244.0.0/16 -j ACCEPT
Jun 27 18:44:56 ip-10-0-251-230.ec2.internal flanneld[1595]: I0627 18:44:56.414454    1595 iptables.go:137] Deleting iptables rule: ! -s 10.244.0.0/16 -d 10.244.0.0/16 -j MASQUERADE
Jun 27 18:44:56 ip-10-0-251-230.ec2.internal flanneld[1595]: I0627 18:44:56.417117    1595 iptables.go:125] Adding iptables rule: -s 10.244.0.0/16 -d 10.244.0.0/16 -j RETURN
Jun 27 18:44:56 ip-10-0-251-230.ec2.internal flanneld[1595]: I0627 18:44:56.419744    1595 iptables.go:125] Adding iptables rule: -s 10.244.0.0/16 ! -d 224.0.0.0/4 -j MASQUERADE
Jun 27 18:44:56 ip-10-0-251-230.ec2.internal flanneld[1595]: I0627 18:44:56.434396    1595 iptables.go:125] Adding iptables rule: ! -s 10.244.0.0/16 -d 10.244.49.0/24 -j RETURN
Jun 27 18:44:56 ip-10-0-251-230.ec2.internal flanneld[1595]: I0627 18:44:56.436248    1595 iptables.go:125] Adding iptables rule: ! -s 10.244.0.0/16 -d 10.244.0.0/16 -j MASQUERADE
Jun 27 18:44:59 ip-10-0-251-230.ec2.internal systemd[1]: Started Flannel.

edit2: Likely irrelevant but kubelet is logging this error every second:

Jun 27 18:57:05 ip-10-0-251-230.ec2.internal kubelet[2972]: E0627 18:57:05.152221    2972 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/runtimeclass/runtimeclass_manager.go:62: Failed to list *unstructured.Unstructured: the server could not find the requested resource

Hi! Any updates on this issue?