- Create a GKE cluster with a GPU node pool:
gcloud container clusters create gpu-sharing-demo --zone us-central1-c
gcloud container node-pools create gpu --cluster gpu-sharing-demo --zone us-central1-c --num-nodes=1 --accelerator type=nvidia-tesla-p4,count=1
- Apply the DaemonSet to enable GPU sharing:
kubectl apply -f https://gist.githubusercontent.com/danisla/77afbb88f215d116f1905f723d3d879d/raw/472a7c3fbdc38d821b0f85b71f2abadf65e57606/gpu-sharing-daemonset.yaml
- Apply the DaemonSet to install the NVIDIA GPU driver per the docs:
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
- Wait for the driver installer pod to become ready:
kubectl -n kube-system wait pod -l k8s-app=nvidia-driver-installer --for=condition=Ready --timeout=600s
- Wait for the gpu-sharing pod to become ready:
kubectl -n kube-system wait pod -l app=gpu-sharing --for=condition=Ready --timeout=600s
- Verify GPU sharing is working:
kubectl describe node -l cloud.google.com/gke-accelerator | grep nvidia.com/gpu
Example output:
nvidia.com/gpu: 16
nvidia.com/gpu: 16
nvidia.com/gpu 0 0
NOTE: if it's not working, try restarting the nvidia-gpu-device-plugin pods in the kube-sytem namespace.
- Run several GPU pods on the same node:
cat - | kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
name: gpu-sharing-test
spec:
replicas: 6
selector:
matchLabels:
app: gpu-sharing-test
template:
metadata:
labels:
app: gpu-sharing-test
spec:
containers:
- name: my-gpu-container
image: nvidia/cuda:10.0-runtime-ubuntu18.04
command: ["/usr/local/nvidia/bin/nvidia-smi", "-l"]
resources:
limits:
nvidia.com/gpu: 1
EOF
kubectl wait pod -l app=gpu-sharing-test --for=condition=Ready
It runs and I can run
nvidia-smi
inside my pods, but it returnsNo devices were found
and I can't use the GPU.