Error in installation

We have an instance where Gravity fails to install, specifically at the dns-app stage. It doesn’t seem like it’s related to dns-app per se, but rather that it is the first Kubernetes thing that is installed.

Gravity: 6.1.39
Environment: AWS (though it’s installed as generic)
AMI: RHEL and then Amazon Linux 2 (failed in both)

Originally, the base AMI that was being used was RHEL7, and there were several things failing (e.g. flannel didn’t come up properly), but we traced it back to etcdctl segfaulting immediately on startup (you couldn’t even get it to print the help string). We chalked this up to RHEL being on an old kernel and got the environment to switch to Amazon Linux 2.

With Amazon Linux 2 (specifically, Linux 2ENV-GC-en41105 4.14.193-149.317.amzn2.x86_64 #1 SMP Thu Sep 3 19:04:44 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux), we’re seeing a similar issue, where runc is segfaulting. Some useful output from the journalctl output:

createPodSandbox for pod "dns-app-install-5f0068-5z92k_kube-system(57b0922d-3117-4027-bd87-199c40cf66d2)" failed: rpc error: code = Unknown desc = failed to start sandbox container for pod "dns-app-install-5f0068-5z92k": Error response from daemon: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/7a64b6f88d6a06acda1a2a870bfb41df22576ffd4dfedb97eece77eed972dd1c/log.json: no such file or directory): runc did not terminate sucessfully: unknown

And then the log has a bunch of these:

__CURSOR=s=6838af3268a2404e8449c0bc3a7a2922;i=33a;b=217dc2c7e5454d308f9e817be2c870f2;m=5b5bdd09e;t=5b0b3c955c25e;x=fd47293f36f0420
__REALTIME_TIMESTAMP=1601661107028574
__MONOTONIC_TIMESTAMP=24523952286
_BOOT_ID=217dc2c7e5454d308f9e817be2c870f2
_MACHINE_ID=7265fe765262551a676151a24c02b7b6
_HOSTNAME=2ENV-GC-en41105
PRIORITY=6
SYSLOG_FACILITY=3
_UID=0
_GID=0
_SYSTEMD_SLICE=system.slice
_CAP_EFFECTIVE=3fffffffff
_TRANSPORT=stdout
_SYSTEMD_CGROUP=/system.slice/docker.service
_SYSTEMD_UNIT=docker.service
_SYSTEMD_INVOCATION_ID=849d67f956db40f6b4219b55c158ff54
_STREAM_ID=74178066059443ea884733bf6197b197
SYSLOG_IDENTIFIER=dockerd
_PID=121
_COMM=dockerd
_EXE=/usr/bin/dockerd
_CMDLINE=/usr/bin/dockerd --bridge=none --iptables=false --ip-masq=false --graph=/ext/docker --storage-driver=overlay2 --exec-opt native.cgroupdriver=cgroupfs --log-opt max-size=50m --log-opt max-file=9 --storage-opt=overlay2.override_kernel_check=1
MESSAGE=time="2020-10-02T17:51:47.028519964Z" level=warning msg="failed to retrieve runc version: signal: segmentation fault"

We have no idea why it might be segfaulting - the same binaries work fine on other Amazon Linux 2 stock machines.

The only thing we can think of is that this is a hardened/modified AMI that is being used in this environment, and it’s possible SELinux is enabled (journalctl seems to say it’s in permissive mode, and we need to check what the runtime state is).

Any ideas? Is anything here expected? How would we debug this?

I would try to see if runc binary fails immediately to execute which would provide a way to debug this. If that’s the case (e.g. /usr/bin/docker-runc segfaults to start), you could capture a crash dump by running this inside gdb.
We don’t have gdb inside the container, but it should be possible to test this outside the container by using the path to rootfs:

gdb /var/lib/gravity/local/packages/unpacked/gravitational.io/planet/<planet-version>/rootfs/usr/bin/docker-runc
$ run
...
$ gcore file.dmp

which you can then share.

@Dmitri_Shelenin thanks for the reply - sorry for the delay in getting back to you, it took us a little while to get another working session to get it.

The good news is - we figured it out. Conclusion at the end :slight_smile:

In regards to GDB - this unfortunately was not helpful. Specifically, the executable died immediately, before it even executed. We tried running it with strace, and we could see it was reporting an error from execve, specifically saying that there is an Exec format error.

However, we knew the binaries were good, and checked their sha1sum against another install we have locally (in our environment, not our customer environment) on the same kernel, and everything matched.

After a bunch of headscratching and some dead ends, we ended up diffing the loaded kernel modules between the machine that was working (in our environment) and this broken environment.

Short story - they had a kernel module from the Cylance anti-virus something or other that was rejecting binaries at the kernel level (I guess when doing execve) with the unuseful error that there was a format error, and thus the binary got rejected. Once we knew this was the problem (we confirmed it by disabling this kernel module and everything immediately worked), we checked the journalctl on the host for the Cylance service and could see that it was rejecting our binary.

One thing that would have helped is if in the gravity debug log tarball it included journalctl from the host as well and not just from the planet container. It’s possible it’s there and I’m just blind, but I couldn’t find it.

Thank you again for the help!

Itay,
the journal logs from the host should also be part of the report tarball. The file might be named differently depending on the version and contain a different time frame using defaults. The latest changes (ca. 7.0.17) renamed the files to gravity-journal<-export>.log.gz and include text dump and journal export formatted dump.
Prior to 7.0.17, there was a single file called gravity-journal.log.gz and was a capture of all host logs and prior to 7.0.12 it was called gravity-system.log.gz.

Good to hear that you sorted this out.

I’ll chalk it up to me missing the logs in the package :slight_smile: