HPE Ezmeral Software platform
1754086 Members
3878 Online
108811 Solutions
New Discussion

Trying to figure out why some K8s Worker hosts do not recognize the GPU device.

 
DenisChoukroun
HPE Pro

Trying to figure out why some K8s Worker hosts do not recognize the GPU device.

Hi,

We recently deployed Ezmeral Runtime Enterprise 5.4 (GA) with an external Data Fabric 7.0 registered.
We have setup a set of five (5) K8S worker hosts with centOS 7.9. Each host is configured exactly the same, with 40 CPU cores, 256 GB of RAM and 1 GPU device (Tesla P6) with NVIDIA driver version 470.103.01 / CUDA version 11.4.As per the Runtime online doc here: https://docs.containerplatform.hpe.com/54/reference/nvidia-gpu-support/nvidia-gpus.html?hl=gpu%2Csupport, NVIDIA driver version should be 470.57.02  or later.

The GPU device on each K8s worker host is visible in the Ezmeral Runtime K8S hosts installation tab. A k8s cluster have been successfully deployed on Ezmeral Runtime with an MLOps tenant created.

As tenant member, I then deployed 5 instances of KDapp Jupyter-Notebook with 1 GPU device requested. I then wanted to check whether the KDApp Jupyter-notebook (image: bluedata/kd-notebook:3.1) can recognized the GPU using the Python code below:

from tensorflow.python.client import device_lib
print(device_lib.list_local_devices()) 

I noticed that:

  • 3 out of the 5 K8s worker hosts do NOT recognize the GPU. They only list /device: CPU:0.
  • 2 out of the 5 worker hosts recognize the GPU. The output of the Python code above is /device:GPU:0 with physical device description "tesla P6, compute capability 6.1".

We are struggling to figure out why some of the K8s worker hosts recognize the GPU device, whereas some other K8s worker hosts do not recognize the GPU.

Any idea how to troubleshoot further to identify what could be the issue on the K8s worker hosts that do not recognize the GPU?

Thanks,
Denis

I work for HPE