- Community Home
- >
- Software
- >
- HPE Ezmeral Software platform
- >
- Trying to figure out why some K8s Worker hosts do ...
Categories
Company
Local Language
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Discussions
Discussions
Forums
Discussions
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Community
Resources
Forums
Blogs
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-18-2022 02:58 AM - last edited on 05-18-2022 02:59 AM by support_s
05-18-2022 02:58 AM - last edited on 05-18-2022 02:59 AM by support_s
Trying to figure out why some K8s Worker hosts do not recognize the GPU device.
Hi,
We recently deployed Ezmeral Runtime Enterprise 5.4 (GA) with an external Data Fabric 7.0 registered.
We have setup a set of five (5) K8S worker hosts with centOS 7.9. Each host is configured exactly the same, with 40 CPU cores, 256 GB of RAM and 1 GPU device (Tesla P6) with NVIDIA driver version 470.103.01 / CUDA version 11.4.As per the Runtime online doc here: https://docs.containerplatform.hpe.com/54/reference/nvidia-gpu-support/nvidia-gpus.html?hl=gpu%2Csupport, NVIDIA driver version should be 470.57.02 or later.
The GPU device on each K8s worker host is visible in the Ezmeral Runtime K8S hosts installation tab. A k8s cluster have been successfully deployed on Ezmeral Runtime with an MLOps tenant created.
As tenant member, I then deployed 5 instances of KDapp Jupyter-Notebook with 1 GPU device requested. I then wanted to check whether the KDApp Jupyter-notebook (image: bluedata/kd-notebook:3.1) can recognized the GPU using the Python code below:
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())
I noticed that:
- 3 out of the 5 K8s worker hosts do NOT recognize the GPU. They only list /device: CPU:0.
- 2 out of the 5 worker hosts recognize the GPU. The output of the Python code above is /device:GPU:0 with physical device description "tesla P6, compute capability 6.1".
We are struggling to figure out why some of the K8s worker hosts recognize the GPU device, whereas some other K8s worker hosts do not recognize the GPU.
Any idea how to troubleshoot further to identify what could be the issue on the K8s worker hosts that do not recognize the GPU?
Thanks,
Denis