NOTE: You can only do this if there is no GPU deployments actively using GPU on the node.
# Cordoning the node will automatically terminate the `nvdp-nvidia-device-plugin-#####` pod on it. This does not impact already running non-GPU deployments on that node.
kubectl cordon node7
# verify the pod is gone, otherwise "kubectl delete" it
kubectl -n nvidia-device-plugin get pods -o wide | grep node7
lsmod | grep nvidia
# it should report 0's for "nvidia_uvm" and "nvidia_drm", meaning there is no process actively using GPU
# lsmod |grep -i nvidia
nvidia_uvm 1785856 0
nvidia_drm 90112 0
nvidia_modeset 1314816 1 nvidia_drm
nvidia 56827904 2 nvidia_uvm,nvidia_modeset
# unload the nvidia kernel modules one by one in the following order
modprobe -r nvidia_uvm nvidia_drm nvidia_modeset nvidia
# verify there are no nvidia kernel modules loaded
lsmod | grep nvidia
modprobe nvidia
lsmod | grep nvidia
nvidia-smi
kubectl uncordon node7
# wait for the "nvdp-nvidia-device-plugin-####" pod to start
kubectl -n nvidia-device-plugin get pods -o wide | grep node7
kubectl -n akash-services rollout restart deployment/operator-inventory
provider_info2.sh provider.h100.mon.obl.akash.pub
provider_info2.sh script https://github.com/arno01/akash-tools/blob/main/provider_info2.sh
This process recovers from the issues such as Xid errors where nvdp-nvidia-device-plugin Pod marks the GPU as unhealthy:
UPDATE: 9 August 2024: This doesn't really recover the nvidia/GPU, as following attempts of running previously stable deployments on the same node will crash. Until the full node reboot.