Using nvidia-docker

I am trying to use gravity to package our software which runs over a machine with Nvidia GPUs.
For that to happen, I need to switch the default docker with nvidia-docker as the runtime (see https://github.com/NVIDIA/k8s-device-plugin#enabling-gpu-support-in-kubernetes)

Essentially, as the link above states, I need to edit /etc/docker/daemon.json and replace the default runtime with nvidia-docker.

With gravity, this does not seem to work (the telltale sign is when you bash into the container, you are able to run “nvidia-smi”). I suspect that gravity is using a different docker configuration file/daemon

Does this make sense? Wondering if it relates to https://github.com/gravitational/gravity/issues/673 as well

@yaron We do have customers who use Nvidia GPU instances, however, we don’t have our own hardware setup so it’s a bit difficult for us to replicate the full setup instructions. As I recall from when our customers went through it, there are a couple key components:

The kubernetes stack we deploy to a machine, is package within a system container (we call the container planet), that contains all of our runtime, including the kubernetes components, docker, etc. This container becomes the consistent unit of software that each node is running. You’re accessing this container when you run gravity shell, gravity exec, gravity enter, etc.

The planet container can be customized, by adding layers like any other container, so the docker runtime can be replaced, static configuration changed, etc, and tele build can be told to use the alternate container. By building you’re own system image that’s based off the published planet image, that includes supplementary tools, or the new docker configuration you require.
https://gravitational.com/gravity/docs/pack/#custom-system-container

Where this get’s tricky though, is planet and gravity tend to be tightly coupled together, so when you’re updating the gravity version, you need to make sure you’re also updating the base planet image, or you may run into unexpected issues. We unfortunately don’t currently clearly communicate the correct version, the best way is to inspect the tag used in our Makefiles on which version of planet will be used: https://github.com/gravitational/gravity/blob/master/Makefile#L49

The second component is as I recall, the nvidia drivers and devices for the particular install need to be mapped from the host to the planet container. In the Image Manifest there is a section under node profiles for creating device mounts from the host to planet, as well as to create volume mounts from the host to planet for the drivers. The mounts can be wild-carded, so all GPUs on a particular system can be added to planet for example, and the default nvidia driver locations can also be mounted to planet, that docker-nvidia can then find and use the correct drivers.

Hopefully this puts you on the right path, sorry I don’t have a more specific set of instructions.

Thank you for that.

I was able to set the folder mapping. No issue here. The problem I have is with installing nvidia-docker.
Specifically, I am customizing planet with a dockerfile and trying to install nvidia-docker. However, nvidia-docker requires as a dependency docker-ce and it seems like the planet docker has a non conventional install. Specifically you will not find docker when running apt list --installed.

The result is that when I try run apt-get install -y nvidia-docker2 I get

The following packages have unmet dependencies:
 nvidia-docker2 : Depends: 
                           docker-ce (>= 18.06.0~ce~3-0~debian) but it is not installable or
                           docker-ee (>= 18.06.0~ce~3-0~debian) but it is not installable or
                           docker.io (>= 18.06.0) but it is not installable

I have tried various workarounds, non of which worked:
(1) finding the planet version of docker and running apt-get install -y nvidia-docker2=2.0.3+docker18.09.5-3 is not working

(2) Installed docker 18 or 19 “over” the current planet docker before installing nvidia-docker. The installation of nvidia-docker then succeeds, but then I crash during tele build runai/app.yaml --overwrite --debug

ERRO Command failed. error:[
ERROR REPORT:
Original Error: syscall.Errno operation not permitted
Stack Trace:
/gopath/src/github.com/gravitational/gravity/lib/app/docker/runtime.go:122 github.com/gravitational/gravity/lib/app/docker.TranslateRuntimeImage
/gopath/src/github.com/gravitational/gravity/lib/app/service/vendor.go:430 github.com/gravitational/gravity/lib/app/service.(*vendorer).translateRuntimeImages
/gopath/src/github.com/gravitational/gravity/lib/app/resources/resourcefiles.go:139 github.com/gravitational/gravity/lib/app/resources.(*ResourceFiles).RewriteManifest
/gopath/src/github.com/gravitational/gravity/lib/app/service/vendor.go:285 github.com/gravitational/gravity/lib/app/service.(*vendorer).VendorDir
/gopath/src/github.com/gravitational/gravity/lib/builder/builder.go:318 github.com/gravitational/gravity/lib/builder.(*Builder).Vendor
/gopath/src/github.com/gravitational/gravity/lib/builder/build.go:89 github.com/gravitational/gravity/lib/builder.Build
/gopath/src/github.com/gravitational/gravity/tool/tele/cli/build.go:67 github.com/gravitational/gravity/tool/tele/cli.build
/gopath/src/github.com/gravitational/gravity/tool/tele/cli/run.go:54 github.com/gravitational/gravity/tool/tele/cli.Run
/gopath/src/github.com/gravitational/gravity/tool/tele/main.go:44 main.run
/gopath/src/github.com/gravitational/gravity/tool/tele/main.go:35 main.main
/go/src/runtime/proc.go:200 runtime.main
/go/src/runtime/asm_amd64.s:1337 runtime.goexit
User Message: operation not permitted
] tele/main.go:36
[ERROR]: operation not permitted

(3) I tried to trick nvidia-docker into thinking that docker-ce exists. There are a couple of ways to do that (and you can see it in comments below). They do the work, but it does not end well at runtime

Thanks
Yaron

Below are the Dockerfile details. They represent several attempts, but should be clear (hopefully) enough

FROM quay.io/gravitational/planet:6.3.3-11700
RUN echo ‘export PATH=$PATH:/usr/local/nvidia/bin:/usr/local/nvidia/lib’ >> ~/.bashrc
RUN echo ‘export LD_LIBRARY_PATH=/usr/local/nvidia/lib’ >> ~/.bashrc
RUN /bin/bash -c “source ~/.bashrc”

RUN chmod 777 /tmp &&
mkdir -p /var/cache/apt/archives/partial &&
apt-get update

#TRICK TO SIMULATE DOCKER CE EXISTENCE
#RUN sudo apt-get install -y equivs
#COPY dockerce.control .
#RUN equivs-build dockerce.control && sudo dpkg -i docker-ce_18.06.0~ce~3-0~debian_all.deb

#OVERIDE DOCKER to get docker 19
#RUN apt-get remove -y docker docker-engine docker.io containerd runc
RUN apt-get install -y apt-transport-https ca-certificates curl gnupg2 software-properties-common
RUN curl -fsSL https://download.docker.com/linux/debian/gpg | sudo apt-key add -
RUN add-apt-repository “deb [arch=amd64] https://download.docker.com/linux/debian $(lsb_release -cs) stable”
RUN apt-get update
#RUN apt-get install -y docker-ce docker-ce-cli containerd.io
RUN apt-get install -y docker-ce=5:18.09.5~3-0~debian-stretch docker-ce-cli=5:18.09.5~3-0~debian-stretch containerd.io

#INSTALL NVIDIA DOCKER/CONTAINER TOOLKIT
RUN distribution=$(. /etc/os-release;echo $ID$VERSION_ID) &&
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - &&
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
RUN apt-get update && apt-get install -y nvidia-docker2
#RUN systemctl restart docker

#NVIDIA-DOCKER DEPRECATED, use this instead?
#RUN apt-get update && apt-get install -y nvidia-container-toolkit

ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility