Gravity Degraded State after first reboot

Every time we reboot the machine after installation, Gravity status changed to degraded and the only way seems to work is to do a reinstall. are there any steps or process to recover the node? ( it is a single node cluster), it is CentOS 7, I tried to debug it but kind of stuck likely seems docker is having some issues.

Masters:
        * localhost.localdomain (192.168.0.38, node)
            Status:	degraded
            [×]		docker (healthz check failed: Get http://docker/version: dial unix /var/run/docker.sock: connect: connection refused)
            [×]		br_netfilter module is either not loaded, or sysctl net.bridge.bridge-nf-call-iptables is not set, see https://www.gravitational.com/docs/faq/#bridge-driver (open /proc/sys/net/bridge/bridge-nf-call-iptables: no such file or directory)
            [×]		NodeStatusUnknown (Kubelet stopped posting node status.) (Node is not ready)
            [×]		docker-registry (healthz check failed: Get https://leader.telekube.local:5000/v2/: dial tcp 192.168.0.38:5000: connect: connection refused)
            [×]		failed to check devicemapper free space (failed to get docker info: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
localhost:/$ systemctl status docker.service
● docker.service - Docker Application Container Engine
   Loaded: loaded (/lib/systemd/system/docker.service; static; vendor preset: enabled)
  Drop-In: /etc/systemd/system/docker.service.d
           └─99-docker-promisc.conf
   Active: activating (auto-restart) (Result: exit-code) since Sat 2019-06-29 23:41:05 UTC; 4s ago
     Docs: https://docs.docker.com
  Process: 12887 ExecStartPre=/sbin/ip link set dev docker0 down (code=exited, status=1/FAILURE)
  Process: 12888 ExecStartPre=/sbin/brctl delbr docker0 (code=exited, status=1/FAILURE)
  Process: 12889 ExecStartPre=/bin/rm -f /var/run/docker.pid (code=exited, status=0/SUCCESS)
  Process: 12890 ExecStartPre=/bin/rm -r /ext/docker/network (code=exited, status=1/FAILURE)
  Process: 12891 ExecStart=/usr/bin/dockerd --iptables=false --ip-masq=false --graph=/ext/docker $DOCKER_OPTS (code=exited, status=1/FAILURE)
  Process: 12915 ExecStopPost=/usr/bin/gravity system disable-promisc-mode docker0 (code=exited, status=255/EXCEPTION)
 Main PID: 12891 (code=exited, status=1/FAILURE)
journalctl -xe
-- The job identifier is 31754.
Jun 29 23:41:49 localhost.localdomain ip[13402]: Cannot find device "docker0"
Jun 29 23:41:49 localhost.localdomain brctl[13403]: bridge docker0 doesn't exist; can't delete it
Jun 29 23:41:49 localhost.localdomain rm[13405]: /bin/rm: cannot remove '/ext/docker/network': No such file or directory
Jun 29 23:41:50 localhost.localdomain dockerd[13406]: time="2019-06-29T23:41:50Z" level=warning msg="The \"-g / --graph\" flag is deprecated. Please use \"--data-root\" instead"
Jun 29 23:41:50 localhost.localdomain dockerd[13406]: time="2019-06-29T23:41:50.328135243Z" level=warning msg="could not change group /var/run/docker.sock to docker: group docker not found"
Jun 29 23:41:50 localhost.localdomain dockerd[13406]: time="2019-06-29T23:41:50.328899768Z" level=info msg="libcontainerd: started new docker-containerd process" pid=13414
Jun 29 23:41:50 localhost.localdomain dockerd[13406]: time="2019-06-29T23:41:50.328983617Z" level=info msg="parsed scheme: \"unix\"" module=grpc
Jun 29 23:41:50 localhost.localdomain dockerd[13406]: time="2019-06-29T23:41:50.328992597Z" level=info msg="scheme \"unix\" not registered, fallback to default scheme" module=grpc 
Jun 29 23:41:50 localhost.localdomain dockerd[13406]: time="2019-06-29T23:41:50.329037132Z" level=info msg="ccResolverWrapper: sending new addresses to cc: [{unix:///var/run/docker/containerd/docker-containerd.sock 0 <nil>}]" module=grpc
Jun 29 23:41:50 localhost.localdomain dockerd[13406]: time="2019-06-29T23:41:50.329046983Z" level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpc
Jun 29 23:41:50 localhost.localdomain dockerd[13406]: time="2019-06-29T23:41:50.329093871Z" level=info msg="pickfirstBalancer: HandleSubConnStateChange: 0xc420227340, CONNECTING" module=grpc
Jun 29 23:41:50 localhost.localdomain dockerd[13406]: time="2019-06-29T23:41:50Z" level=info msg="starting containerd" revision=468a545b9edcd5932818eb9de8e72413e616e86e version=v1.1.2
Jun 29 23:41:50 localhost.localdomain dockerd[13406]: time="2019-06-29T23:41:50Z" level=info msg="loading plugin "io.containerd.content.v1.content"..." type=io.containerd.content.v1
Jun 29 23:41:50 localhost.localdomain dockerd[13406]: time="2019-06-29T23:41:50Z" level=info msg="loading plugin "io.containerd.snapshotter.v1.btrfs"..." type=io.containerd.snapshotter.v1
Jun 29 23:41:50 localhost.localdomain dockerd[13406]: time="2019-06-29T23:41:50Z" level=warning msg="failed to load plugin io.containerd.snapshotter.v1.btrfs" error="path /ext/docker/containerd/daemon/io.containerd.snapshotter.v1.btrfs mu
Jun 29 23:41:50 localhost.localdomain dockerd[13406]: time="2019-06-29T23:41:50Z" level=info msg="loading plugin "io.containerd.snapshotter.v1.aufs"..." type=io.containerd.snapshotter.v1
Jun 29 23:41:50 localhost.localdomain dockerd[13406]: time="2019-06-29T23:41:50Z" level=warning msg="failed to load plugin io.containerd.snapshotter.v1.aufs" error="modprobe aufs failed: "modprobe: FATAL: Module aufs not found in director
Jun 29 23:41:50 localhost.localdomain dockerd[13406]: time="2019-06-29T23:41:50Z" level=info msg="loading plugin "io.containerd.snapshotter.v1.native"..." type=io.containerd.snapshotter.v1
Jun 29 23:41:50 localhost.localdomain dockerd[13406]: time="2019-06-29T23:41:50Z" level=info msg="loading plugin "io.containerd.snapshotter.v1.overlayfs"..." type=io.containerd.snapshotter.v1
Jun 29 23:41:50 localhost.localdomain dockerd[13406]: time="2019-06-29T23:41:50Z" level=info msg="loading plugin "io.containerd.snapshotter.v1.zfs"..." type=io.containerd.snapshotter.v1
Jun 29 23:41:50 localhost.localdomain dockerd[13406]: time="2019-06-29T23:41:50Z" level=warning msg="failed to load plugin io.containerd.snapshotter.v1.zfs" error="path /ext/docker/containerd/daemon/io.containerd.snapshotter.v1.zfs must b
Jun 29 23:41:50 localhost.localdomain dockerd[13406]: time="2019-06-29T23:41:50Z" level=info msg="loading plugin "io.containerd.metadata.v1.bolt"..." type=io.containerd.metadata.v1
Jun 29 23:41:50 localhost.localdomain dockerd[13406]: time="2019-06-29T23:41:50Z" level=warning msg="could not use snapshotter zfs in metadata plugin" error="path /ext/docker/containerd/daemon/io.containerd.snapshotter.v1.zfs must be a zf
Jun 29 23:41:50 localhost.localdomain dockerd[13406]: time="2019-06-29T23:41:50Z" level=warning msg="could not use snapshotter btrfs in metadata plugin" error="path /ext/docker/containerd/daemon/io.containerd.snapshotter.v1.btrfs must be
Jun 29 23:41:50 localhost.localdomain dockerd[13406]: time="2019-06-29T23:41:50Z" level=warning msg="could not use snapshotter aufs in metadata plugin" error="modprobe aufs failed: "modprobe: FATAL: Module aufs not found in directory /lib
Jun 29 23:41:50 localhost.localdomain dockerd[13406]: time="2019-06-29T23:41:50Z" level=info msg="loading plugin "io.containerd.differ.v1.walking"..." type=io.containerd.differ.v1
Jun 29 23:41:50 localhost.localdomain dockerd[13406]: time="2019-06-29T23:41:50Z" level=info msg="loading plugin "io.containerd.gc.v1.scheduler"..." type=io.containerd.gc.v1
Jun 29 23:41:50 localhost.localdomain dockerd[13406]: time="2019-06-29T23:41:50Z" level=info msg="loading plugin "io.containerd.service.v1.containers-service"..." type=io.containerd.service.v1
Jun 29 23:41:50 localhost.localdomain dockerd[13406]: time="2019-06-29T23:41:50Z" level=info msg="loading plugin "io.containerd.service.v1.content-service"..." type=io.containerd.service.v1
Jun 29 23:41:50 localhost.localdomain dockerd[13406]: time="2019-06-29T23:41:50Z" level=info msg="loading plugin "io.containerd.service.v1.diff-service"..." type=io.containerd.service.v1
Jun 29 23:41:50 localhost.localdomain dockerd[13406]: time="2019-06-29T23:41:50Z" level=info msg="loading plugin "io.containerd.service.v1.images-service"..." type=io.containerd.service.v1
Jun 29 23:41:50 localhost.localdomain dockerd[13406]: time="2019-06-29T23:41:50Z" level=info msg="loading plugin "io.containerd.service.v1.leases-service"..." type=io.containerd.service.v1
Jun 29 23:41:50 localhost.localdomain dockerd[13406]: time="2019-06-29T23:41:50Z" level=info msg="loading plugin "io.containerd.service.v1.namespaces-service"..." type=io.containerd.service.v1
Jun 29 23:41:50 localhost.localdomain dockerd[13406]: time="2019-06-29T23:41:50Z" level=info msg="loading plugin "io.containerd.service.v1.snapshots-service"..." type=io.containerd.service.v1
Jun 29 23:41:50 localhost.localdomain dockerd[13406]: time="2019-06-29T23:41:50Z" level=info msg="loading plugin "io.containerd.monitor.v1.cgroups"..." type=io.containerd.monitor.v1
Jun 29 23:41:50 localhost.localdomain dockerd[13406]: time="2019-06-29T23:41:50Z" level=info msg="loading plugin "io.containerd.runtime.v1.linux"..." type=io.containerd.runtime.v1
Jun 29 23:41:50 localhost.localdomain dockerd[13406]: time="2019-06-29T23:41:50Z" level=info msg="loading plugin "io.containerd.service.v1.tasks-service"..." type=io.containerd.service.v1
Jun 29 23:41:50 localhost.localdomain dockerd[13406]: time="2019-06-29T23:41:50Z" level=info msg="loading plugin "io.containerd.grpc.v1.containers"..." type=io.containerd.grpc.v1
Jun 29 23:41:50 localhost.localdomain dockerd[13406]: time="2019-06-29T23:41:50Z" level=info msg="loading plugin "io.containerd.grpc.v1.content"..." type=io.containerd.grpc.v1
Jun 29 23:41:50 localhost.localdomain dockerd[13406]: time="2019-06-29T23:41:50Z" level=info msg="loading plugin "io.containerd.grpc.v1.diff"..." type=io.containerd.grpc.v1
Jun 29 23:41:50 localhost.localdomain dockerd[13406]: time="2019-06-29T23:41:50Z" level=info msg="loading plugin "io.containerd.grpc.v1.events"..." type=io.containerd.grpc.v1
Jun 29 23:41:50 localhost.localdomain dockerd[13406]: time="2019-06-29T23:41:50Z" level=info msg="loading plugin "io.containerd.grpc.v1.healthcheck"..." type=io.containerd.grpc.v1
Jun 29 23:41:50 localhost.localdomain dockerd[13406]: time="2019-06-29T23:41:50Z" level=info msg="loading plugin "io.containerd.grpc.v1.images"..." type=io.containerd.grpc.v1
Jun 29 23:41:50 localhost.localdomain dockerd[13406]: time="2019-06-29T23:41:50Z" level=info msg="loading plugin "io.containerd.grpc.v1.leases"..." type=io.containerd.grpc.v1
Jun 29 23:41:50 localhost.localdomain dockerd[13406]: time="2019-06-29T23:41:50Z" level=info msg="loading plugin "io.containerd.grpc.v1.namespaces"..." type=io.containerd.grpc.v1
Jun 29 23:41:50 localhost.localdomain dockerd[13406]: time="2019-06-29T23:41:50Z" level=info msg="loading plugin "io.containerd.grpc.v1.snapshots"..." type=io.containerd.grpc.v1
Jun 29 23:41:50 localhost.localdomain dockerd[13406]: time="2019-06-29T23:41:50Z" level=info msg="loading plugin "io.containerd.grpc.v1.tasks"..." type=io.containerd.grpc.v1
Jun 29 23:41:50 localhost.localdomain dockerd[13406]: time="2019-06-29T23:41:50Z" level=info msg="loading plugin "io.containerd.grpc.v1.version"..." type=io.containerd.grpc.v1
Jun 29 23:41:50 localhost.localdomain dockerd[13406]: time="2019-06-29T23:41:50Z" level=info msg="loading plugin "io.containerd.grpc.v1.introspection"..." type=io.containerd.grpc.v1
Jun 29 23:41:50 localhost.localdomain dockerd[13406]: time="2019-06-29T23:41:50Z" level=info msg=serving... address="/var/run/docker/containerd/docker-containerd-debug.sock"
Jun 29 23:41:50 localhost.localdomain dockerd[13406]: time="2019-06-29T23:41:50Z" level=info msg=serving... address="/var/run/docker/containerd/docker-containerd.sock"
Jun 29 23:41:50 localhost.localdomain dockerd[13406]: time="2019-06-29T23:41:50Z" level=info msg="containerd successfully booted in 0.006092s"
Jun 29 23:41:50 localhost.localdomain dockerd[13406]: time="2019-06-29T23:41:50.350199363Z" level=info msg="pickfirstBalancer: HandleSubConnStateChange: 0xc420227340, READY" module=grpc
Jun 29 23:41:50 localhost.localdomain dockerd[13406]: time="2019-06-29T23:41:50.352759483Z" level=info msg="parsed scheme: \"unix\"" module=grpc
Jun 29 23:41:50 localhost.localdomain dockerd[13406]: time="2019-06-29T23:41:50.352779127Z" level=info msg="scheme \"unix\" not registered, fallback to default scheme" module=grpc
Jun 29 23:41:50 localhost.localdomain dockerd[13406]: time="2019-06-29T23:41:50.352825167Z" level=info msg="ccResolverWrapper: sending new addresses to cc: [{unix:///var/run/docker/containerd/docker-containerd.sock 0 <nil>}]" module=grpc
Jun 29 23:41:50 localhost.localdomain dockerd[13406]: time="2019-06-29T23:41:50.352850929Z" level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpc
Jun 29 23:41:50 localhost.localdomain dockerd[13406]: time="2019-06-29T23:41:50.352877889Z" level=info msg="pickfirstBalancer: HandleSubConnStateChange: 0xc420186580, CONNECTING" module=grpc
Jun 29 23:41:50 localhost.localdomain dockerd[13406]: time="2019-06-29T23:41:50.353163892Z" level=info msg="pickfirstBalancer: HandleSubConnStateChange: 0xc420186580, READY" module=grpc
Jun 29 23:41:50 localhost.localdomain dockerd[13406]: time="2019-06-29T23:41:50.354544258Z" level=error msg="'overlay' not found as a supported filesystem on this host. Please ensure kernel is new enough and has overlay support loaded." s
Jun 29 23:41:50 localhost.localdomain dockerd[13406]: Error starting daemon: error initializing graphdriver: driver not supported
Jun 29 23:41:50 localhost.localdomain systemd[1]: docker.service: Main process exited, code=exited, status=1/FAILURE
-- Subject: Unit process exited
-- Defined-By: systemd
-- Support: https://www.debian.org/support
--
-- An ExecStart= process belonging to unit docker.service has exited.
--
-- The process' exit code is 'exited' and its exit status is 1.
Jun 29 23:41:50 localhost.localdomain gravity[13431]: [ERROR]: failed to unset promiscuous mode on "docker0": Cannot find device "docker0"
Jun 29 23:41:50 localhost.localdomain systemd[1]: docker.service: Failed with result 'exit-code'.

What version of gravity are you using? Does the OS have anything that might be resetting modules load and sysctl configuration?

br_netfilter module is either not loaded, or sysctl net.bridge.bridge-nf-call-iptables is not set, see https://www.gravitational.com/docs/faq/#bridge-driver (open /proc/sys/net/bridge/bridge-nf-call-iptables: no such file or directory)

This is indicating that the br_netfilter kernel module doesn’t appear to be loaded, which is required.

Jun 29 23:41:50 localhost.localdomain dockerd[13406]: time="2019-06-29T23:41:50.354544258Z" level=error msg="'overlay' not found as a supported filesystem on this host. Please ensure kernel is new enough and has overlay support loaded." s
Jun 29 23:41:50 localhost.localdomain dockerd[13406]: Error starting daemon: error initializing graphdriver: driver not supported

This looks to me like the overlay module likely isn’t loaded, which is leading to the failure to start docker. So the br_netfilter and overlay module may not be set to load at boot.

Historically gravity has been hesitant to make too many modifications to the underlying host, but newer versions when doing an install should be configuring these modules to load automatically. There was a bug I fixed recently, that if the modules were already loaded at install time, that they wouldn’t get configured. This can happen when you install gravity, uninstall gravity, and re-install gravity. The version you’re using might not have this fix.

@knisbet thanks for looking into it, I am using 5.5.10, those modules were loaded since the installation went fine, how can I check they are configured to load at boot or how I can make them loaded at boot since the issue only arrive when I reboot the machine?

The gravity docs related to module configuration: https://gravitational.com/gravity/docs/requirements/#kernel-modules

lsmod - will list the currently loaded kernel modules.
journalctl -u systemd-modules-load.service --no-pager - will list the modules that were loaded at boot.

Example (not all of these are for gravity, just output from my test vm):

root@kevin-test1:~# journalctl -u systemd-modules-load.service --no-pager
-- Logs begin at Tue 2019-07-02 18:41:04 UTC, end at Tue 2019-07-02 18:47:59 UTC. --
Jul 02 18:41:04 kevin-test1 systemd-modules-load[435]: Inserted module 'ebtables'
Jul 02 18:41:04 kevin-test1 systemd-modules-load[435]: Inserted module 'ip_tables'
Jul 02 18:41:04 kevin-test1 systemd-modules-load[435]: Inserted module 'iptable_filter'
Jul 02 18:41:04 kevin-test1 systemd-modules-load[435]: Inserted module 'iptable_nat'
Jul 02 18:41:04 kevin-test1 systemd-modules-load[435]: Inserted module 'br_netfilter'
Jul 02 18:41:04 kevin-test1 systemd-modules-load[435]: Inserted module 'overlay'
Jul 02 18:41:04 kevin-test1 systemd-modules-load[435]: Inserted module 'iscsi_tcp'
Jul 02 18:41:04 kevin-test1 systemd[1]: Started Load Kernel Modules.

And to configure modules to load at boot, you would write the required modules to a file under /etc/modules-load.d/ such as /etc/modules-load.d/gravity.conf

On my system which is on gravity 6, this looks like:

cat /etc/modules-load.d/gravity.conf

ebtables
ip_tables
iptable_filter
iptable_nat
br_netfilter
overlay

The full auto configuration of kernel modules was added in gravity 6, which is currently a release candidate but the full release should be soon. https://gravitational.com/gravity/docs/changelog/#600-rc1