GPU Support

Shoc Platform fully supports running GPU-accelerated workloads on your Kubernetes clusters, allowing you to leverage the immense computational power of NVIDIA GPUs for tasks such as Machine Learning training, scientific simulations, and data processing.

Utilizing `nvidia.com/gpu` Resources

To enable the use of nvidia.com/gpu resources within your Kubernetes cluster and allow Shoc Platform to schedule jobs requiring GPUs, you must install the NVIDIA GPU Operator. This operator automates the deployment and management of all necessary NVIDIA software components, including:

NVIDIA drivers
Container Runtime (nvidia-container-runtime)
Kubernetes device plugin for GPUs
DCGM (Data Center GPU Manager)
and more.

⚠️

Prerequisites:

Your Kubernetes cluster must have NVIDIA GPU hardware available on its nodes.
Your Kubernetes cluster must be online and attached to Shoc Platform, as described in the Clusters page.

Installing the NVIDIA GPU Operator

The recommended method for installing the NVIDIA GPU Operator is via Helm. Ensure you have Helm installed on your machine where you run kubectl commands.

Add the NVIDIA Helm Repository: First, add the official NVIDIA Helm chart repository:
```
helm repo add nvdp https://helm.ngc.nvidia.com/nvidia
helm repo update
```
Create a Namespace for the GPU Operator: It’s good practice to install the GPU Operator in its own namespace:
```
kubectl create namespace gpu-operator
```
Install the NVIDIA GPU Operator: Now, install the GPU Operator using Helm. Replace gpu-operator with your desired namespace if you chose a different one.
```
helm install --wait \
  --generate-name \
  -n gpu-operator nvdp/gpu-operator
```
The --wait flag will make Helm wait until all components are deployed and ready. This process can take several minutes as it involves downloading large container images and installing drivers.

Verifying the Installation

After the installation completes, you can verify that the NVIDIA GPU Operator components are running correctly and that your GPUs are exposed to Kubernetes:

Check Pod Status: Ensure all pods in the gpu-operator namespace are running:
```
kubectl get pods -n gpu-operator
```
You should see various pods like nvidia-driver-daemonset, nvidia-container-toolkit-daemonset, nvidia-device-plugin-daemonset, etc., in a Running or Completed state.
Verify GPU Resources on Nodes: Check that your Kubernetes nodes are reporting nvidia.com/gpu resources:
```
kubectl get nodes -o json | jq '.items[].status.allocatable."nvidia.com/gpu"'
```
This command should output the number of GPUs allocatable on your nodes (e.g., "1", "2"). If it shows null or 0, the GPU Operator might not be fully functional or your nodes might not have detected GPUs.

Once nvidia.com/gpu resources are visible on your nodes, Shoc Platform will be able to schedule jobs that request these resources.

Further Documentation

For the most up-to-date and comprehensive information on the NVIDIA GPU Operator, including troubleshooting and advanced configurations, please refer to the official NVIDIA documentation:

NVIDIA GPU Operator Documentation: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/overview.html
NVIDIA Container Toolkit (for low-level GPU access in containers): https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/overview.html

GPU Support

Utilizing nvidia.com/gpu Resources

Installing the NVIDIA GPU Operator

Verifying the Installation

Further Documentation

Utilizing `nvidia.com/gpu` Resources