Pods stuck in ContainerCreating State in 4-node cluster

Problem:

Customer had an issue with pods being stuck in the ContainerCreating state in their 4-node cluster. 2 of the nodes with the issues were cordoned.

Analysis:

After investigation and looking through the logs, we suspected that some of the system’s resources were being exhausted so we tweaked the fs.inotify.max_user_watches and user.max_user_namespaces kernel parameters. Also, one of the nodes was having issues launching containers and docker/planet could not be restarted (froze) so we had to reboot the host after which the node came back online.

Solution:

Here’s the gist of changes that were eventually made to the nodes:

1. Set fs.inotify.max_user_watches = 1048576 and user.max_user_namespaces = 15000 kernel parameters.

2. Updated /etc/sysctl.conf with the same parameters to make the changes persist across reboots.

3. Ran sysctl -p to reload the changes.

4. Restarted kubelet on the nodes.

5. Uncordoned the nodes.

Recommendation:

If in your application you are launching a couple of pods per “project”, we recommend adding these kernel settings to your application requirements to prevent resource exhaustion when many containers are running.

Prepared by: @r0mant