Software - General
1837921 Members
6148 Online
110124 Solutions
New Discussion

NVIDIA AI Enterprise Implementation on VMware.

 
Praveen_M
HPE Pro

NVIDIA AI Enterprise Implementation on VMware.

 

 

 

 

 

 

NVIDIA AI Enterprise Implementation on VMware.

 

 

About:

NVAIE implementation on VMware: This guide includes an end-to-end installation of NVAIE on VMware, starting from the server BIOS setup to running a use case using the NGC library. This includes BIOS setup, installation of ESXi, vCenter, and installation of NVAIE host and guest driver installation. After the configuration and setup of the environment, a Data Scientist / developer can use the environment to develop AI and ML-related workloads with maximum GPU efficiency.

 

Contents

About: 2

Documentation Information: 2

Server Details – MindSparks LAB. 4

Host details: 4

Hypervisor details: 5

vCenter Appliance Details: 5

Gust VM-1: 5

Server Set up. 5

BIOS Setup: 5

Single Root I/O Virtualization (SR-IOV) – Enabled. 6

VT-d/IOMMU – Enabled. 7

Hyperthreading – Enabled. 7

Power Setting or System Profile - High Performance. 7

Install Esxi. 8

Install vCenter Appliance. 12

vCenter implementation – Stage# 2. 15

Setup CUP power management policy.. 16

Install NVIDIA AI enterprise Host Software. 16

Preparing the VIB file for Install. 16

Installing the VIB on the Esxi host: 17

Changing the Default Graphics Type in VMware vSphere: 18

Change the GPU type to Shared Direct. 18

Create an Ubuntu based Virtual Machine: 19

Create a VM (Ubuntu) 19

VM requirement: 19

Configure MMIO settings for the VM. 19

Enable GPU & parameters on VMs. 20

Add a GPU to the gust VM... 20

Change the PCI personality from vCenter. 21

Disable Nouveau on the Gust ubuntu machines. 22

Install NVAIE driver on Guest machines. 23

Guest VM Licensing.. 25

Installing Docker and The Docker Utility Engine for NVIDIA GPUs. 25

Installing Docker and The Docker Utility Engine for NVIDIA GPUs. 25

Installing the NVIDIA Container Toolkit 26

Configuring Docker. 27

Rootless mode. 27

Test the GPU function with a container 27

Install and setup NGC on the gust VM. 28

Install NGC CLI on the Ubuntu Guest VM. 28

Sample use case execution. 29

Conclusion. 30

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Server Details – MindSparks LAB Host details:

Host details

Host IP address

10.25.41.12

iLO IP address

10.25.42.12

Hypervisor details:

Hypervisor

Esxi Host IP

10.25.41.12

OS

Esxi 8.0

vCenter Appliance Details:

vCenter Appliance

vCenter IP

https://x.x.x.x:443

vCenter Appliance version

vcsa 8.0

Gust VM-1:

Gust VM -1

VM name

nvdia-vm-1

IP

10.25.41.15

OS

Ubuntu 22

Server Set up BIOS Setup:

  1. Login to iLO of the server, make sure you are on the right host.
 

image.png

 

 

 

  1. Based on the official documentation we have to follow the below : https://docs.nvidia.com/ai-enterprise/deployment-guide-vmware/0.1.0/prereqs.html
  2. If using NVIDIA A100, the following BIOS settings are required:
    1. Single Root I/O Virtualization (SR-IOV) – Enabled.
    2. VT-d/IOMMU – Enabled.
    3. Hyperthreading – Enabled.
    4. Power Setting or System Profile - High Performance.
    5. CPU Performance (if applicable) - Enterprise or High Throughput (Optional).
    6. Memory Mapped I/O above 4-GB - Enabled (if applicable) (Optional).

 

  1. Reboot the host > get in BIOS.

 

 

 

 

 

 

 

 

 

Single Root I/O Virtualization (SR-IOV) – Enabled

System utilities > System Configuration > BIOS/Platform Configuration (RBSU) > Virtualization Options > SR-IOV = Enabled > F10: Save

 

 

VT-d/IOMMU – Enabled

System utilities > System Configuration > BIOS/Platform Configuration (RBSU) > Virtualization Options > VT-d = Enabled > F10: Save

 

 

Hyperthreading – Enabled

System Utilities > BIOS/Platform Configuration (RBSU) > Processor Options > Intel (R) Hyperthreading Options = Enabled

 

 

Power Setting or System Profile - High Performance.

System Utilities > System Configuration > BIOS/Platform Configuration (RBSU) Workload Profile = Virtualization - Max performance

Note: This Option may be very based on the ROM version / Servers

 

 

  1. Reboot the host and make sure all the above options are updated in BIOS.

Install Esxi.

  1. Access iLO of the GPU server.
  2. Mount the Esxi IOS image from the iLO.

image.png

 

 

 

 

  1. From BIOS, select the one-time boot menu and select the iLO virtual media
  2. Initiate the Esxi installation.

 

 

 

 

 

 

  1. Press Enter to continue when you see the Esxi installer.

image.png

 

 

 

  1. Accept the Licensing and agreement (EULA)

image.png

 

 

 

  1. Select a Disc for the Esxi installation.

 

 

 

 

 

 

  1. Select the language.

 

 

 

  1. Enter the root password for Esxi.

 

 

 

 

  1. Initiate the Esxi installation.

 

 

 

  1. Unmount the image and boot the server normally.

image.png

 

 

 

  1. Validate the Esxi is installed correctly.

 

 

 

 

Install vCenter Appliance.

 

  1. Download the compatible VCSA onto a Windows Machine.
  2. Mount the Image by right clicking on the image.

image.png

 

 
 
  1. Go to the below location and double click on installer.

D:\vcsa-ui-installer\win32

 

 

  1. Click on installation.
 

 

  1. Update Esxi IP, username and password.
 

 

  1. Set up vCenter Server VM.
 

 

  1. Select the deployment Size.
 

 

  1. Select the datastore.
 

 

  1. Update Network Settings.
 

 

  1. Verify the details and initiate the installation of vCenter.
 

 

  1. Access the vCenter appliance with the configured IP address.
 

 

vCenter implementation – Stage# 2

  1. Introduction.
 

 

  1. vCenter server Configuration.
 

 

  1. SSO configuration.
 

 

  1. Configure CEIP.
 

 

  1. Ready to complete > Finish.
 

image.png

 

 

 

  1. Completion of the Appliance installation.
 

 

  1. Access the vCenter using the above-mentioned IP after the stage 2 installation.

Setup CUP power management policy.

Must change the below setting from all the host in the vCenter.

 

 

Install NVIDIA AI enterprise Host Software. Preparing the VIB file for Install.

 

Before you begin, download the archive containing the VIB file and extract the archive contents to a folder. The file ending with .VIB is the file that you must upload to the host data store for installation. For demonstration purposes, these steps use the VMWare vSphere web interface to upload the VIB to the server host.

 

Note: Download the drivers from NVDIA Application Hub.

 

  1. Extract the Software packages - NVIDIA-AI-Enterprise-vSphere-8.0-550.54.16-550.54.15-551.78
    1. Locate NVD-AIE_ESXi_8.0.0_Driver_550.54.16-1OEM.800.1.0.20613240.vib file under Host_Drivers folder

image.png

 

 

 

  1. Uploading VIB in vSphere Client.

 

 

 

 

  1. Upload the .VIB file to the datastore.

 

 

 

 

Installing the VIB on the Esxi host:

  1. Place the host into Maintenance mode.
 

 

  1. Use the esxcli command to install the NVIDIA AI Enterprise Host Software package:
    1. Navigate the folder to the datastore where we have the .vib file saved

esxcli software vib install -v /vmfs/volumes/datastore-5tb/vib/NVD-AIE_ESXi_8.0.0_Driver_550.54.16-1OEM.800.1.0.20613240.vib

 

 

  1. Exit Maintenance Mode.
 

 

  1. Reboot the Esxi host.
  2. Verifying the Installation of the VIB.

vmkload_mod -l | grep nvidia

 

 

  1. Verify that the NVIDIA kernel driver can successfully communicate with the NVIDIA physical GPUs in your system by running the nvidia-smi command.

nvidia-smi

 

 

Changing the Default Graphics Type in VMware vSphere: Change the GPU type to Shared Direct

  1. Log in to vCenter Server by using the vSphere Web Client.
  2. In the navigation tree, select your ESXi host and click the Configure tab.
  3. From the menu, choose Graphics and then click the Host Graphics tab.
  4. On the Host Graphics tab, click Edit.
  5. In the Edit Host Graphics Settings dialog box that opens, select Shared Direct and click > OK.

After you click OK, the default graphics type changes to Shared Direct.

 

 

 

 

  1. Either restart the ESXi host, or stop and restart the Xorg service and nv-hostengine on the ESXi host. To stop and restart the Xorg service and nv-hostengine, perform these steps:

 

Stop the Xorg service.

[root@esxi:~] /etc/init.d/xorg stop

 

Stop nv-hostengine.

[root@esxi:~] nv-hostengine -t

 

 

 

  1. Wait for 1 second to allow nv-hostengine to stop,

 

Start nv-hostengine.

[root@esxi:~] nv-hostengine -d

Start the Xorg service.

[root@esxi:~] /etc/init.d/xorg start

 

 

 

  1. Check the status of the Graphics card.

 

 

 

Create an Ubuntu based Virtual Machine: Create a VM (Ubuntu)

  1. From vCenter create a virtual machine (normal process)

VM requirement:

  1. CPU, RAM and HDD specification (minimum requirement).
 

 

Configure MMIO settings for the VM.

  1. Adjust the Memory Mapped I/O (MMIO) settings for the VM.
    1. Click Add Configuration Params and add the parameters from the table, fill in xxx with the corresponding value in the column MMIO Space Required for the your GPU model.

 

pciPassthru.64bitMMIOSizeGB = 128

pciPassthru.use64bitMMIO = TRUE

 

 

 

   

 

 

 

Enable GPU & parameters on VMs. Add a GPU to the gust VM

  1. Settings of the Gust VM.
  2. Virtual Hardware.
  3. Add a new device.
  4. Under Other device > Select PCI device.
  5. Select the vGPU and add.
 

image.png

 

 

 

 

 

 

 

  1. Check the status.
 

 

Change the PCI personality from vCenter

  1. Right-click on your VM and select "Edit Settings."
  2. Click on the "VM Options" tab.
  3. Select "Edit Configuration" from the "Advanced" drop-down list.
  4. Click "Add Row."
  5. Name: pciPassthru0.cfg.enable_uvm
  6. Value: 1
  7. Click "OK" to save.
  8. Click "Add Row" again.
  9. Name: pciPassthru1.cfg.enable_uvm
  10. Value: 1
  11. Click "OK" to save.

 

pciPassthru0.cfg.enable_uvm - 1

pciPassthru1.cfg.enable_uvm - 1

image.png

 

 

 

  1. Summary > Power on the VM

 

 

 

Disable Nouveau on the Gust ubuntu machines.

Nouveau is an open-source graphics device driver for Nvidia video cards and the Tegra family

  1. Run the below command to verify if Nouveau is loaded

lsmod | grep nouveau

 

 

  1. If you see the above output, follow the below steps to disable Nouveau.

cat <<EOF | sudo tee /etc/modprobe.d/blacklist-nouveau.conf

blacklist nouveau

options nouveau modeset=0

EOF

image.png

 

 

 

  1. Regenerate the kernel initramfs.

sudo update-initramfs -u

 

 

  1. Reboot the host.

Install NVAIE driver on Guest machines.

  1. Log in to the VM and check for updates.

sudo apt-get update

 

 

  1. Install the gcc compiler and the make tool in the terminal.

sudo apt-get install build-essential

  1. Download the NVIDIA AI Enterprise Software Driver and place it in the Guest VMs.
 

 

  1. Navigate to the directory containing the NVIDIA Driver .run file. Then, add the Executable permission to the NVIDIA Driver file using the chmod command

cd vgpu_guest_driver_2_1:510.73.08

sudo chmod +x NVIDIA-Linux-x86_64-510.73.08-grid.run

 

 

 

  1. Run the driver installer as the root user and accept defaults.

sudo sh ./NVIDIA-Linux-x86_64-510.73.08-grid.run

 

image.png

 

 

 

  1. Reboot the gust VM.
  2. Check the GPU status using the below command on the Guest VM.

 

nvidia-smi

image.png

 

 

 

  1. From the physical host side, we can see the sliced GPU status.

nvidia-smi mig -lgip

 

 

 

Guest VM Licensing.

To use an NVIDIA vGPU software licensed product, each client system to which a physical or virtual GPU is assigned must be able to obtain a license from the NVIDIA License System. A client system can be a VM that is configured with NVIDIA vGPU, a VM that is configured for GPU pass through, or a physical host to which a physical GPU is assigned in a bare-metal deployment.

  • Generating a Client Configuration Token
  • Configuring a Licensed Client on Linux

Reference: https://docs.nvidia.com/ai-enterprise/deployment-guide-vmware/0.1.0/first-vm.html#licensing-the-vm

Installing Docker and The Docker Utility Engine for NVIDIA GPUs Installing Docker and The Docker Utility Engine for NVIDIA GPUs

  1. Install docker.
  2. Add Docker's official GPG key:

sudo apt-get update
sudo apt-get install ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc

image.png

 

 

 

  1. Add the repository to Apt sources:

echo \

  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \

  $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \

  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

sudo apt-get update

 

 

 

  1. Install the Docker packages and check the status.

sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

 

 

Installing the NVIDIA Container Toolkit

  1. Configure the production repository.

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \

  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \

    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \

    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

 

 

 

 

  1. Optionally, configure the repository to use experimental packages.
 

 

  1. Update the packages list from the repository.

sudo apt-get update

  1. Install the NVIDIA Container Toolkit packages:

sudo apt-get install -y nvidia-container-toolkit

 

 

 

 

 

 

Configuring Docker

  1. Configure the container runtime by using the nvidia-ctk command.

Note: The nvidia-ctk command modifies the /etc/docker/daemon.json file on the host. The file is updated so that Docker can use the NVIDIA Container Runtime.

sudo nvidia-ctk runtime configure --runtime=docker

 

 

  1. Restart the Docker daemon:

sudo systemctl restart docker

Rootless mode

Note: To configure the container runtime for Docker running in Rootless mode, follow these steps:

Configure the container runtime by using the nvidia-ctk command:

 

nvidia-ctk runtime configure --runtime=docker --config=$HOME/.config/docker/daemon.json

 

  1. Restart the Rootless Docker daemon:

systemctl --user restart docker

  1. Configure /etc/nvidia-container-runtime/config.toml by using the sudo nvidia-ctk command:

sudo nvidia-ctk config --set nvidia-container-cli.no-cgroups --in-place

Test the GPU function with a container

  1. Execute the below command to test the GPU function on the Gust VM

Docker run –rm –runtime=nvidia –gpus all ubuntu nvidia-smi

 

 

Install and setup NGC on the gust VM. Install NGC CLI on the Ubuntu Guest VM.

  1. Enter the NVIDIA NGC website as a guest user.
    1. https://org.ngc.nvidia.com/setup/installers/cli
  2. In the top right corner, click Welcome Guest and then select Setup from the menu.
  3. Click Downloads under Install NGC CLI from the Setup page.
  4. From the CLI Install page, click the Windows, Linux, or MacOS tab, according to the platform from which you will be running NGC Catalog CLI.
 

 

  1. Execute the below command on the guest VMs.

wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/ngc-apps/ngc_cli/versions/3.44.0/files/ngccli_linux.zip -O ngccli_linux.zip && unzip ngccli_linux.zip

  1. Make the NGC CLI binary executable and add your current directory to path
  2. Execute the file from the same directory.
  3. Check the NGC version.

chmod u+x ngc-cli/ngc

 

echo "export PATH=\"\$PATH:$/root/praveen/ngc_cli/ngc-cli/ngc-cli\"" >> ~/.bash_profile && source ~/.bash_profile

 

ngc --version

 
 

image.png

 

 

 

 

 

 

 

 

 

 

 

 

 

Sample use case execution.

Note: An example container pull using the NVIDIA RAPIDS Production Branch has been provided, this is part of NVIDIA CUDA-X, is an open-source suite of GPU-accelerated data science and AI libraries with APIs that match the most popular open-source data tools. It accelerates performance by orders of magnitude at scale across data pipelines.

  1. Login to NVIDIA container repository.

docker login nvcr.io

Username: $oauthtoken

Password: <my-api-key-from-your-ngc-account>

 

image.png

 

 

 

  1. Login to ngc, get a container, for the sample use case execution.
  2. Access the Jupyter notebook with the exposed local IP:port
 

 

docker run --rm -it --pull always --gpus all --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -e EXTRA_CONDA_PACKAGES="jq" -e EXTRA_PIP_PACKAGES="beautifulsoup4" -p 8888:8888 rapidsai/notebooks:24.04-cuda11.8-py3.11

 

 

Conclusion

In conclusion, the comprehensive guide to NVAIE implementation on VMware offers a thorough, step-by-step approach to setting up a robust AI and ML environment. By meticulously detailing the process from server BIOS configuration to leveraging the NGC library, this guide ensures that data scientists and developers can efficiently utilize GPU resources. This seamless integration of NVAIE on VMware not only enhances the performance of AI and ML workloads but also empowers users to harness the full potential of their hardware, driving innovation and productivity in their respective fields.


I am an HPE Employee

Accept or Kudo