Upgrade of monitoring-app fails due to not being able to schedule a pod

Description

An upgrade fails during “monitoring-app” application upgrade with the upgrade hook (“monitoring-app-update-xxx” pod) showing the following error:

2020-05-23T04:18:38Z INFO             no pod found on node 192.168.1.1 daemonset:monitoring/telegraf-node-worker rigging/utils.go:187
2020-05-23T04:18:38Z INFO             "attempt 119, result: \nERROR REPORT:\nOriginal Error: *trace.NotFoundError no pod found on node 192.168.1.1\nStack Trace:\n\t/gopath/src/github.com/gravitational/rigging/utils.go:187 github.com/gravitational/rigging.checkRunningAndReady\n\t/gopath/src/github.com/gravitational/rigging/utils.go:175 github.com/gravitational/rigging.checkRunning\n\t/gopath/src/github.com/gravitational/rigging/ds.go:189 github.com/gravitational/rigging.(*DSControl).Status\n\t/gopath/src/github.com/gravitational/rigging/changeset.go:378 github.com/gravitational/rigging.(*Changeset).statusDaemonSet\n\t/gopath/src/github.com/gravitational/rigging/changeset.go:329 github.com/gravitational/rigging.(*Changeset).status\n\t/gopath/src/github.com/gravitational/rigging/changeset.go:196 github.com/gravitational/rigging.(*Changeset).Status.func1\n\t/gopath/src/github.com/gravitational/rigging/utils.go:137 github.com/gravitational/rigging.retry\n\t/gopath/src/github.com/gravitational/rigging/changeset.go:189 github.com/gravitational/rigging.(*Changeset).Status\n\t/gopath/src/github.com/gravitational/rigging/tool/rig/main.go:292 main.status\n\t/gopath/src/github.com/gravitational/rigging/tool/rig/main.go:124 main.run\n\t/gopath/src/github.com/gravitational/rigging/tool/rig/main.go:31 main.main\n\t/go/src/runtime/proc.go:209 runtime.main\n\t/go/src/runtime/asm_amd64.s:1338 runtime.goexit\nUser Message: no pod found on node 192.168.1.1\n, retry in 1s" logrus/exported.go:127

In a few cases where this was encountered, the pod that couldn’t be scheduled was “telegraf” showing that it couldn’t be scheduled due to insufficient CPU:

# kubectl -nmonitoring describe pods telegraf-node-worker-l6rx7
...
Events:
  Type     Reason            Age                   From               Message
  ----     ------            ----                  ----               -------
  Warning  FailedScheduling  15s (x21 over 5m46s)  default-scheduler  0/7 nodes are available: 1 Insufficient cpu, 6 node(s) didn't match node selector.

The reason for scheduling failure is likely that all of the node’s resources are used up by other pods so it’s not able to fit in a telegraf pod, which can be confirmed by looking at the node’s allocated resources:

# kubectl describe nodes 192.168.1.1
...
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests      Limits
  --------           --------      ------
  cpu                1980m (99%)   12400m (620%)
  memory             8720Mi (31%)  11118Mi (39%)

Solution

To workaround the issue, evict/reschedule some pods that consume a lot of CPU/memory requests on the node using kubectl.

After that, rollback and re-execute the monitoring-app upgrade phase to make sure it completes successfully now, and then resume the upgrade operation:

sudo gravity plan rollback --phase=/runtime/monitoring-app
sudo gravity plan execute --phase=/runtime/monitoring-app
sudo gravity plan resume