Oracle Linux 7.6

While not explicitly listed as a supported OS, we were hopeful that an install on Oracle Linux (based on RHEL) would work out. This has turned out not to be the case, and I just wanted to see if this was known and if there are any workarounds. We’ve actually seen two failure scenarios: one with the Enterprise support version and one using an AMI provided by Oracle on AWS.

Bare metal OEL 7.6

For the first, with OEL: installation gets to Wait for kubernetes to become available but fails repeatedly with:

2019-06-04T09:30:15-04:00 DEBU             dialing leader.telekube.local:6443 install/hook.go:56
2019-06-04T09:30:15-04:00 WARN             "Failed to dial with local resolver: \nERROR REPORT:\nOriginal Error: *net.OpError read udp 127.0.0.1:16976->127.0.0.2:53: read: connection refused\nStack Trace:\n\t/gopath/src/github.com/gravitational/gravity/lib/utils/dns.go:45 github.com/gravitational/gravity/lib/utils.ResolveAddr\n\t/gopath/src/github.com/gravitational/gravity/lib/httplib/client.go:248 github.com/gravitational/gravity/lib/httplib.DialWithLocalResolver\n\t/gopath/src/github.com/gravitational/gravity/lib/httplib/client.go:220 github.com/gravitational/gravity/lib/httplib.DialFromEnviron.func1\n\t/go/src/net/http/transport.go:916 net/http.(*Transport).dial\n\t/go/src/net/http/transport.go:1240 net/http.(*Transport).dialConn\n\t/go/src/net/http/transport.go:999 net/http.(*Transport).getConn.func4\n\t/go/src/runtime/asm_amd64.s:1334 runtime.goexit\nUser Message: failed to resolve leader.telekube.local:6443\n." install/hook.go:56

Until eventually failing the phase:

2019-06-04T09:35:14-04:00 DEBU [FSM:INSTA] Applied StateChange(Phase=/wait, State=failed, Error=context deadline exceeded). opid:360acde5-7d55-4f65-85e4-38bb8125e36c install/hook.go:56
2019-06-04T09:35:14-04:00 ERRO [INSTALLER] "Failed to execute plan: \nERROR REPORT:\nOriginal Error: context.deadlineExceededError context deadline exceeded\nStack Trace:\n\t/gopath/src/github.com/gravitational/gravity/lib/install/phases/postsystem.go:82 github.com/gravitational/gravity/lib/install/phases.(*waitExecutor).Execute\n\t/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:421 github.com/gravitational/gravity/lib/fsm.(*FSM).executeOnePhase\n\t/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:355 github.com/gravitational/gravity/lib/fsm.(*FSM).executePhaseLocally\n\t/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:315 github.com/gravitational/gravity/lib/fsm.(*FSM).executePhase\n\t/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:192 github.com/gravitational/gravity/lib/fsm.(*FSM).ExecutePhase\n\t/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:150 github.com/gravitational/gravity/lib/fsm.(*FSM).ExecutePlan\n\t/gopath/src/github.com/gravitational/gravity/lib/install/flow.go:335 github.com/gravitational/gravity/lib/install.(*Installer).startFSM\n\t/go/src/runtime/asm_amd64.s:1334 runtime.goexit\nUser Message: failed to execute phase \"/wait\"\n." install/hook.go:56
2019-06-04T09:35:14-04:00 INFO [OPS]       ops.SetOperationStateRequest{State:"failed", Progress:(*ops.ProgressEntry)(0xc0008a5f00)} install/hook.go:56
2019-06-04T09:35:14-04:00 DEBU [OPS]       Created: ops.ProgressEntry{ID:"", SiteDomain:"logrocket", OperationID:"360acde5-7d55-4f65-85e4-38bb8125e36c", Created:time.Time{wall:0x34976137, ext:63695252114, loc:(*time.Location)(nil)}, Completion:100, Step:9, State:"failed", Message:"Operation failure: context deadline exceeded"}. install/hook.go:56
2019-06-04T09:35:14-04:00 DEBU [FSM:INSTA] Marked operation complete. opid:360acde5-7d55-4f65-85e4-38bb8125e36c install/hook.go:56
2019-06-04T09:35:15-04:00 INFO             Operation failed. install/hook.go:56

I initially thought this may have been an issue with dnsmasq running on the node, but even after ensuring no such service was running the same issue repeated. We also had them try a different CIDR range inside 172.16.0.0/12 for pods and services as they use some ranges inside 10.0.0.0/8 for their internal network.

AWS OL7.6

On the AWS side, using an OL7.6 AMI, the failure scenario is slightly different. Instead of failing to even resolve the leader, the system breaks fails to “update a remote for tag” during the docker population phase:

2019-06-05T20:15:51Z ERRO             Phase execution failed: failed to update remote for tag "kubernetes-helm/tiller:v2.12.3". advertise-ip:172.31.14.26 hostname:ip-172-31-14-26.ec2.internal phase:/export/ip-172-31-14-26.ec2.internal install/hook.go:56
2019-06-05T20:15:51Z DEBU [FSM:INSTA] Applied StateChange(Phase=/export/ip-172-31-14-26.ec2.internal, State=failed, Error=failed to update remote for tag "kubernetes-helm/tiller:v2.12.3"). opid:e94fde77-fa9e-4f7f-b39e-1d2fc3d4afe7 install/hook.go:56
2019-06-05T20:15:51Z WARN             "Failed to execute phase \"/export/ip-172-31-14-26.ec2.internal\": \nERROR REPORT:\nOriginal Error: *client.UnexpectedHTTPStatusError received unexpected HTTP status: 500 Internal Server Error\nStack Trace:\n\t/gopath/src/github.com/gravitational/gravity/lib/app/docker/imageservice.go:501 github.com/gravitational/gravity/lib/app/docker.(*remoteStore).updateRepo\n\t/gopath/src/github.com/gravitational/gravity/lib/app/docker/imageservice.go:206 github.com/gravitational/gravity/lib/app/docker.(*imageService).Sync\n\t/gopath/src/github.com/gravitational/gravity/lib/install/phases/export.go:154 github.com/gravitational/gravity/lib/install/phases.(*exportExecutor).exportApp\n\t/gopath/src/github.com/gravitational/gravity/lib/install/phases/export.go:106 github.com/gravitational/gravity/lib/install/phases.(*exportExecutor).Execute\n\t/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:421 github.com/gravitational/gravity/lib/fsm.(*FSM).executeOnePhase\n\t/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:355 github.com/gravitational/gravity/lib/fsm.(*FSM).executePhaseLocally\n\t/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:315 github.com/gravitational/gravity/lib/fsm.(*FSM).executePhase\n\t/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:192 github.com/gravitational/gravity/lib/fsm.(*FSM).ExecutePhase\n\t/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:379 github.com/gravitational/gravity/lib/fsm.(*FSM).executeSubphasesConcurrently.func1\n\t/go/src/runtime/asm_amd64.s:1334 runtime.goexit\nUser Message: failed to update remote for tag \"kubernetes-helm/tiller:v2.12.3\"\n." install/hook.go:56
2019-06-05T20:15:51Z ERRO [INSTALLER] "Failed to execute plan: \nERROR REPORT:\nOriginal Error: trace.aggregate failed to update remote for tag \"kubernetes-helm/tiller:v2.12.3\", failed to execute phase \"/export/ip-172-31-14-26.ec2.internal\"\nStack Trace:\n\t/gopath/src/github.com/gravitational/gravity/lib/utils/collecterrors.go:65 github.com/gravitational/gravity/lib/utils.Collect\n\t/gopath/src/github.com/gravitational/gravity/lib/utils/collecterrors.go:27 github.com/gravitational/gravity/lib/utils.CollectErrors\n\t/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:387 github.com/gravitational/gravity/lib/fsm.(*FSM).executeSubphasesConcurrently\n\t/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:358 github.com/gravitational/gravity/lib/fsm.(*FSM).executePhaseLocally\n\t/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:287 github.com/gravitational/gravity/lib/fsm.(*FSM).executePhase\n\t/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:192 github.com/gravitational/gravity/lib/fsm.(*FSM).ExecutePhase\n\t/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:150 github.com/gravitational/gravity/lib/fsm.(*FSM).ExecutePlan\n\t/gopath/src/github.com/gravitational/gravity/lib/install/flow.go:335 github.com/gravitational/gravity/lib/install.(*Installer).startFSM\n\t/go/src/runtime/asm_amd64.s:1334 runtime.goexit\nUser Message: failed to execute phase \"/export\"\n." install/hook.go:56
2019-06-05T20:15:51Z INFO [OPS]       ops.SetOperationStateRequest{State:"failed", Progress:(*ops.ProgressEntry)(0xc00144eb00)} install/hook.go:56
2019-06-05T20:15:51Z DEBU [OPS]       Created: ops.ProgressEntry{ID:"", SiteDomain:"logrocket", OperationID:"e94fde77-fa9e-4f7f-b39e-1d2fc3d4afe7", Created:time.Time{wall:0x11436b0e, ext:63695362551, loc:(*time.Location)(nil)}, Completion:100, Step:9, State:"failed", Message:"Operation failure: failed to update remote for tag \"kubernetes-helm/tiller:v2.12.3\", failed to execute phase \"/export/ip-172-31-14-26.ec2.internal\""}. install/hook.go:56
2019-06-05T20:15:51Z DEBU [FSM:INSTA] Marked operation complete. opid:e94fde77-fa9e-4f7f-b39e-1d2fc3d4afe7 install/hook.go:56
2019-06-05T20:15:51Z INFO             Operation failed. install/hook.go:56

I struggled to find anywhere I could find logs that may have provided information on what the 500 error actually was.

Bare metal OLD 7.6

So this does look like an issue with starting or running the CoreDNS instance we spin up for internal resolution (which should be bound to 127.0.0.2). Something like dnsmasq bound to :53 can block coredns from binding even just to the loop back.

To troubleshoot the issue further, the process would be to gravity shell to get a shell session in the planet container we use to host kubernetes / docker / etc. This container is running systemd, so the starting point would be to investigate the coredns logs within the systemd journal journalctl -u coredns to give an indication on why coredns isn’t starting.

AWS OL7.6

So for this one, it seems to failing where it tries to sync the internal containers to the docker registry on ip-172-31-14-26.ec2.internal. So to me it looks like the 500 error was generated by the registry running on ip-172-31-14-26.ec2.internal.

To see what was going on with the registry, to troubleshoot further would be similar to above. Use gravity shell to enter the gravity container, and see if the registry logs journalctl -u registry show an indication on what’s going on with the registry.

Thanks so much for the response.

The first issue looks like it may be related to SELinux, CoreDNS is failing to start with status=218/CAPABILITIES. What does support for SELinux look like with Gravity? I haven’t found mention of it in the documentation.

Update my assumption on SELinux was incorrect, it is already disabled on the target system. They’re on 4.1.12-124.27.1.el7uek.x86_64 which should have what it’s asking for. We have a path forward for investigation, at least.

The AWS issue is a bit of a face-palm: the root disk ran out of space! Thanks for the pointer on those logs.

Re: Selinux

We don’t currently support running on selinux enabled systems. It’s something we should offer support for, but we haven’t seen a huge demand.

I haven’t seen it in awhile, but we used to see some similar failures previously, based around the following paramets in the /lib/systemd/system/coredns.service file.

CapabilityBoundingSet=CAP_NET_BIND_SERVICE
AmbientCapabilities=CAP_NET_BIND_SERVICE
User=planet

In the cases we’ve seen, if the capabilitiy lines are removed and the user switched to root, the coredns service will start. We’ve never committed to this though, as I believe the cases we’ve seen were bypassed by upgrading to a later kernel (where we were investigating similar issues with centos).

There must be something specific to the OEL setup, RHEL on the same kernel version doesn’t have the same issue, nor does the OL7.6 release available on AWS.

Changing the coredns service didn’t seem to help.