Customer had an issue with pods being stuck in the ContainerCreating state in their 4-node cluster. 2 of the nodes with the issues were cordoned.
After investigation and looking through the logs, we suspected that some of the system’s resources were being exhausted so we tweaked the fs.inotify.max_user_watches and user.max_user_namespaces kernel parameters. Also, one of the nodes was having issues launching containers and docker/planet could not be restarted (froze) so we had to reboot the host after which the node came back online.
Here’s the gist of changes that were eventually made to the nodes:
1. Set fs.inotify.max_user_watches = 1048576 and user.max_user_namespaces = 15000 kernel parameters.
2. Updated /etc/sysctl.conf with the same parameters to make the changes persist across reboots.
3. Ran sysctl -p to reload the changes.
4. Restarted kubelet on the nodes.
5. Uncordoned the nodes.
If in your application you are launching a couple of pods per “project”, we recommend adding these kernel settings to your application requirements to prevent resource exhaustion when many containers are running.
Prepared by: @r0mant