We have an instance where Gravity fails to install, specifically at the dns-app stage. It doesn’t seem like it’s related to dns-app per se, but rather that it is the first Kubernetes thing that is installed.
Environment: AWS (though it’s installed as generic)
AMI: RHEL and then Amazon Linux 2 (failed in both)
Originally, the base AMI that was being used was RHEL7, and there were several things failing (e.g. flannel didn’t come up properly), but we traced it back to
etcdctl segfaulting immediately on startup (you couldn’t even get it to print the help string). We chalked this up to RHEL being on an old kernel and got the environment to switch to Amazon Linux 2.
With Amazon Linux 2 (specifically,
Linux 2ENV-GC-en41105 4.14.193-149.317.amzn2.x86_64 #1 SMP Thu Sep 3 19:04:44 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux), we’re seeing a similar issue, where
runc is segfaulting. Some useful output from the journalctl output:
createPodSandbox for pod "dns-app-install-5f0068-5z92k_kube-system(57b0922d-3117-4027-bd87-199c40cf66d2)" failed: rpc error: code = Unknown desc = failed to start sandbox container for pod "dns-app-install-5f0068-5z92k": Error response from daemon: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/7a64b6f88d6a06acda1a2a870bfb41df22576ffd4dfedb97eece77eed972dd1c/log.json: no such file or directory): runc did not terminate sucessfully: unknown
And then the log has a bunch of these:
__CURSOR=s=6838af3268a2404e8449c0bc3a7a2922;i=33a;b=217dc2c7e5454d308f9e817be2c870f2;m=5b5bdd09e;t=5b0b3c955c25e;x=fd47293f36f0420 __REALTIME_TIMESTAMP=1601661107028574 __MONOTONIC_TIMESTAMP=24523952286 _BOOT_ID=217dc2c7e5454d308f9e817be2c870f2 _MACHINE_ID=7265fe765262551a676151a24c02b7b6 _HOSTNAME=2ENV-GC-en41105 PRIORITY=6 SYSLOG_FACILITY=3 _UID=0 _GID=0 _SYSTEMD_SLICE=system.slice _CAP_EFFECTIVE=3fffffffff _TRANSPORT=stdout _SYSTEMD_CGROUP=/system.slice/docker.service _SYSTEMD_UNIT=docker.service _SYSTEMD_INVOCATION_ID=849d67f956db40f6b4219b55c158ff54 _STREAM_ID=74178066059443ea884733bf6197b197 SYSLOG_IDENTIFIER=dockerd _PID=121 _COMM=dockerd _EXE=/usr/bin/dockerd _CMDLINE=/usr/bin/dockerd --bridge=none --iptables=false --ip-masq=false --graph=/ext/docker --storage-driver=overlay2 --exec-opt native.cgroupdriver=cgroupfs --log-opt max-size=50m --log-opt max-file=9 --storage-opt=overlay2.override_kernel_check=1 MESSAGE=time="2020-10-02T17:51:47.028519964Z" level=warning msg="failed to retrieve runc version: signal: segmentation fault"
We have no idea why it might be segfaulting - the same binaries work fine on other Amazon Linux 2 stock machines.
The only thing we can think of is that this is a hardened/modified AMI that is being used in this environment, and it’s possible SELinux is enabled (journalctl seems to say it’s in permissive mode, and we need to check what the runtime state is).
Any ideas? Is anything here expected? How would we debug this?