Kubernetes includes experimental support for managing NVIDIA GPUs spread across nodes. The support for NVIDIA GPUs was added in v1.6 and has gone through multiple backwards incompatible iterations. This page describes how users can consume GPUs across different Kubernetes versions and the current limitations.
To enable GPU support in 1.6 and 1.7, a special alpha feature gate
Accelerators
has to be set to true across the system:
--feature-gates="Accelerators=true"
. It also requires using the Docker
Engine as the container runtime.
Further, the Kubernetes nodes have to be pre-installed with NVIDIA drivers. Kubelet will not detect NVIDIA GPUs otherwise.
When you start Kubernetes components after all the above conditions are true,
Kubernetes will expose alpha.kubernetes.io/nvidia-gpu
as a schedulable
resource.
You can consume these GPUs from your containers by requesting
alpha.kubernetes.io/nvidia-gpu
just like you request cpu
or memory
.
However, there are some limitations in how you specify the resource requirements
when using GPUs:
limits
section, which means:
limits
without specifying requests
because
Kubernetes will use the limit as the request value by default.limits
and requests
but these two values
must be equal.requests
without specifying limits
.When using alpha.kubernetes.io/nvidia-gpu
as the resource, you also have to
mount host directories containing NVIDIA libraries (libcuda.so, libnvidia.so
etc.) to the container.
Here’s an example:
apiVersion: v1
kind: Pod
metadata:
name: cuda-vector-add
spec:
restartPolicy: OnFailure
containers:
- name: cuda-vector-add
# https://github.com/kubernetes/kubernetes/blob/v1.7.11/test/images/nvidia-cuda/Dockerfile
image: "k8s.gcr.io/cuda-vector-add:v0.1"
resources:
limits:
alpha.kubernetes.io/nvidia-gpu: 1 # requesting 1 GPU
volumeMounts:
- name: "nvidia-libraries"
mountPath: "/usr/local/nvidia/lib64"
volumes:
- name: "nvidia-libraries"
hostPath:
path: "/usr/lib/nvidia-375"
The Accelerators
feature gate and alpha.kubernetes.io/nvidia-gpu
resource
works on 1.8 and 1.9 as well. It will be deprecated in 1.10 and removed in
1.11.
From 1.8 onwards, the recommended way to consume GPUs is to use device plugins.
To enable GPU support through device plugins before 1,10, the DevicePlugins
feature gate has to be explicitly set to true across the system:
--feature-gates="DevicePlugins=true"
. This is no longer required starting
from 1.10.
Then you have to install NVIDIA drivers on the nodes and run an NVIDIA GPU device plugin (see below).
When the above conditions are true, Kubernetes will expose nvidia.com/gpu
as
a schedulable resource.
You can consume these GPUs from your containers by requesting
nvidia.com/gpu
just like you request cpu
or memory
.
However, there are some limitations in how you specify the resource requirements
when using GPUs:
limits
section, which means:
limits
without specifying requests
because
Kubernetes will use the limit as the request value by default.limits
and requests
but these two values
must be equal.requests
without specifying limits
.Unlike with alpha.kubernetes.io/nvidia-gpu
, when using nvidia.com/gpu
as
the resource, you don’t have to mount any special directories in your pod
specs. The device plugin is expected to inject them automatically in the
container.
Here’s an example:
apiVersion: v1
kind: Pod
metadata:
name: cuda-vector-add
spec:
restartPolicy: OnFailure
containers:
- name: cuda-vector-add
# https://github.com/kubernetes/kubernetes/blob/v1.7.11/test/images/nvidia-cuda/Dockerfile
image: "k8s.gcr.io/cuda-vector-add:v0.1"
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPU
There are currently two device plugin implementations for NVIDIA GPUs:
The official NVIDIA GPU device plugin has the following requirements:
To deploy the NVIDIA device plugin once your cluster is running and the above requirements are satisfied:
# For Kubernetes v1.8
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.8/nvidia-device-plugin.yml
# For Kubernetes v1.9
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.9/nvidia-device-plugin.yml
Report issues with this device plugin to NVIDIA/k8s-device-plugin.
The NVIDIA GPU device plugin used by GKE/GCE doesn’t require using nvidia-docker and should work with any container runtime that is compatible the Kubernetes Container Runtime Interface (CRI). It’s tested on Container-Optimized OS and has experimental code for Ubuntu from 1.9 onwards.
On your 1.9 cluster, you can use the following commands to install the NVIDIA drivers and device plugin:
# Install NVIDIA drivers on Container-Optimized OS:
kubectl create -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/k8s-1.9/daemonset.yaml
# Install NVIDIA drivers on Ubuntu (experimental):
kubectl create -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/k8s-1.9/nvidia-driver-installer/ubuntu/daemonset.yaml
# Install the device plugin:
kubectl create -f https://raw.githubusercontent.com/kubernetes/kubernetes/release-1.9/cluster/addons/device-plugins/nvidia-gpu/daemonset.yaml
Report issues with this device plugin and installation method to GoogleCloudPlatform/container-engine-accelerators.
If different nodes in your cluster have different types of NVIDIA GPUs, then you can use Node Labels and Node Selectors to schedule pods to appropriate nodes.
For example:
# Label your nodes with the accelerator type they have.
kubectl label nodes <node-with-k80> accelerator=nvidia-tesla-k80
kubectl label nodes <node-with-p100> accelerator=nvidia-tesla-p100
Specify the GPU type in the pod spec:
apiVersion: v1
kind: Pod
metadata:
name: cuda-vector-add
spec:
restartPolicy: OnFailure
containers:
- name: cuda-vector-add
# https://github.com/kubernetes/kubernetes/blob/v1.7.11/test/images/nvidia-cuda/Dockerfile
image: "k8s.gcr.io/cuda-vector-add:v0.1"
resources:
limits:
nvidia.com/gpu: 1
nodeSelector:
accelerator: nvidia-tesla-p100 # or nvidia-tesla-k80 etc.
This will ensure that the pod will be scheduled to a node that has the GPU type you specified.