andy108369/reload-nvidia-module-k8s.md

## reload-nvidia-module-k8s.md

      
    Raw
  

              reload-nvidia-module-k8s.md
            
          
    Steps to reload the nvidia kernel module in K8s env

Unload nvidia driver


NOTE: You can only do this if there is no GPU deployments actively using GPU on the node.

# Cordoning the node will automatically terminate the `nvdp-nvidia-device-plugin-#####` pod on it. This does not impact already running non-GPU deployments on that node.
kubectl cordon node7

# verify the pod is gone, otherwise "kubectl delete" it
kubectl -n nvidia-device-plugin get pods -o wide | grep node7


lsmod | grep nvidia

# it should report 0's for "nvidia_uvm" and "nvidia_drm", meaning there is no process actively using GPU

# lsmod |grep -i nvidia
nvidia_uvm           1785856  0
nvidia_drm             90112  0
nvidia_modeset       1314816  1 nvidia_drm
nvidia              56827904  2 nvidia_uvm,nvidia_modeset

# unload the nvidia kernel modules one by one in the following order

modprobe -r nvidia_uvm nvidia_drm nvidia_modeset nvidia

# verify there are no nvidia kernel modules loaded
lsmod | grep nvidia

Load nvidia driver back up again

modprobe nvidia
lsmod | grep nvidia
nvidia-smi

kubectl uncordon node7

# wait for the "nvdp-nvidia-device-plugin-####" pod to start
kubectl -n nvidia-device-plugin get pods -o wide | grep node7

kubectl -n akash-services rollout restart deployment/operator-inventory

provider_info2.sh provider.h100.mon.obl.akash.pub


provider_info2.sh script https://github.com/arno01/akash-tools/blob/main/provider_info2.sh


This process recovers from the issues such as Xid errors where nvdp-nvidia-device-plugin Pod marks the GPU as unhealthy:
UPDATE: 9 August 2024:
This doesn't really recover the nvidia/GPU, as following attempts of running previously stable deployments on the same node will crash. Until the full node reboot.