Skip to content

Instantly share code, notes, and snippets.

@andy108369
Last active August 10, 2024 12:57
Show Gist options
  • Save andy108369/3d3bc26df8e3fc18d5243ac3e8b1b499 to your computer and use it in GitHub Desktop.
Save andy108369/3d3bc26df8e3fc18d5243ac3e8b1b499 to your computer and use it in GitHub Desktop.
Steps to reload the nvidia kernel module in K8s env

Steps to reload the nvidia kernel module in K8s env

Unload nvidia driver

NOTE: You can only do this if there is no GPU deployments actively using GPU on the node.

# Cordoning the node will automatically terminate the `nvdp-nvidia-device-plugin-#####` pod on it. This does not impact already running non-GPU deployments on that node.
kubectl cordon node7

# verify the pod is gone, otherwise "kubectl delete" it
kubectl -n nvidia-device-plugin get pods -o wide | grep node7


lsmod | grep nvidia

# it should report 0's for "nvidia_uvm" and "nvidia_drm", meaning there is no process actively using GPU

# lsmod |grep -i nvidia
nvidia_uvm           1785856  0
nvidia_drm             90112  0
nvidia_modeset       1314816  1 nvidia_drm
nvidia              56827904  2 nvidia_uvm,nvidia_modeset

# unload the nvidia kernel modules one by one in the following order

modprobe -r nvidia_uvm nvidia_drm nvidia_modeset nvidia

# verify there are no nvidia kernel modules loaded
lsmod | grep nvidia

Load nvidia driver back up again

modprobe nvidia
lsmod | grep nvidia
nvidia-smi

kubectl uncordon node7

# wait for the "nvdp-nvidia-device-plugin-####" pod to start
kubectl -n nvidia-device-plugin get pods -o wide | grep node7

kubectl -n akash-services rollout restart deployment/operator-inventory

provider_info2.sh provider.h100.mon.obl.akash.pub

provider_info2.sh script https://github.com/arno01/akash-tools/blob/main/provider_info2.sh


This process recovers from the issues such as Xid errors where nvdp-nvidia-device-plugin Pod marks the GPU as unhealthy:

UPDATE: 9 August 2024: This doesn't really recover the nvidia/GPU, as following attempts of running previously stable deployments on the same node will crash. Until the full node reboot.

image image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment