ProLiant Servers (ML,DL,SL)
1855013 Members
3528 Online
104109 Solutions
New Discussion

NVIDIA A100 driver installation failure on HPE DL385 Gen11 GPU server (4x A100) running Ub

 
SOLVED
Go to solution
mo3oota
Frequent Visitor

NVIDIA A100 driver installation failure on HPE DL385 Gen11 GPU server (4x A100) running Ub

Hello everyone, I'm a beginner who recently started setting up a server. I'm having trouble with the following issue:   Environment

 

  • Server: DL385 Gen11

  • GPU: NVIDIA A100 80GB (4 cards)

  • OS: Ubuntu 24.04 LTS

  • NVIDIA Driver: 570 or 580


Issue Description

After installing the driver in the environment described above, running nvidia-smi results in an error.

The following error is displayed when checking dmesg:


 


kernel: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR0 is 0M @ 0x0 (PCI:0000:64:00.0)
kernel: nvidia 0000:64:00.0: probe with driver nvidia failed with error -1
kernel: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR1 is 0M @ 0x0 (PCI:0000:02:00.0)
kernel: nvidia 0000:02:00.0: probe with driver nvidia failed with error -1
kernel: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR1 is 0M @ 0x0 (PCI:0000:e3:00.0)
kernel: nvidia 0000:e3:00.0: probe with driver nvidia failed with error -1
kernel: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR1 is 0M @ 0x0 (PCI:0000:82:00.0)
kernel: nvidia 0000:82:00.0: probe with driver nvidia failed with error -1
kernel: NVRM: The NVIDIA probe routine failed for 4 device(s).
kernel: NVRM: None of the NVIDIA devices were initialized.
kernel: nvidia-nvlink: Unregistered Nvlink Core, major device number 511

Given this error, what steps should I take to diagnose and resolve the issue? I would appreciate your advice. Thank you.

 

4 REPLIES 4
Mamatha_J
HPE Pro
Solution

Re: NVIDIA A100 driver installation failure on HPE DL385 Gen11 GPU server (4x A100) runnin

Hi @mo3oota 

Updating BIOS/iLO/firmware (SPP for Gen11).

Enabling Above 4G Decoding and Resizable BAR in BIOS.

Using a validated NVIDIA driver (535-server) or DKMS-compatible kernel.

Disabling Secure Boot during driver installation.

Installing the HPE NVIDIA Enablement Kit if applicable.

Once these are done, nvidia-smi should correctly initialize all 4 A100 GPUs.



Thanks & Regards,

Mamatha



I work at HPE
HPE Support Center offers support for your HPE services and products when and how you need it. Get started with HPE Support Center today.
[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]
Accept or Kudo
support_s
System Recommended

Query: NVIDIA A100 driver installation failure on HPE DL385 Gen11 GPU server (4x A100) running Ub

Mamatha_J
HPE Pro

Re: NVIDIA A100 driver installation failure on HPE DL385 Gen11 GPU server (4x A100) runnin

Hi @mo3oota 
Please let us know if you find the solution useful. Please click on "Thumbs Up/Kudo" icon to give a "Kudo".

Thanks & Regards,

Mamatha 



I work at HPE
HPE Support Center offers support for your HPE services and products when and how you need it. Get started with HPE Support Center today.
[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]
Accept or Kudo
mo3oota
Frequent Visitor

Re: NVIDIA A100 driver installation failure on HPE DL385 Gen11 GPU server (4x A100) runnin

Thank you to everyone who sent in replies.

The SR-IOV (Single Root I/O Virtualization) setting in the BIOS was set to "disable."

Once I changed this setting to "enable," NVIDIA-SMI was successfully displayed.

Your help was truly appreciated. Thank you very much.