GPU Operator¶

Deployment¶

Deploy from OperatorHub

./run_toolbox.py gpu_operator deploy_from_operatorhub [--version=<version>] [--channel=<channel>] [--installPlan=Automatic|Manual]
./run_toolbox.py gpu_operator undeploy_from_operatorhub

Examples:

./run_toolbox.py gpu_operator deploy_from_operatorhub
- Installs the latest version available
./run_toolbox.py gpu_operator deploy_from_operatorhub --version=1.7.0 --channel=v1.7
- Installs v1.7.0 from the v1.7 channel
./run_toolbox.py gpu_operator deploy_from_operatorhub --version=1.6.2 --channel=stable
- Installs v1.6.2 from the stable channel
./run_toolbox.py gpu_operator deploy_from_operatorhub --installPlan=Automatic
- Forces the install plan approval to be set to Automatic.

Note about the GPU Operator channel:

Before v1.7.0, the GPU Operator was using a unique channel name (stable). Within this channel, OperatorHub would automatically upgrade the operator to the latest available version. This was an issue as the operator doesn’t support (yet) being upgraded (remove and reinstall is the official way). OperatorHub allows specifying the upgrade as Manual, but this isn’t the default behavior.
Starting with v1.7.0, the channel is set to v1.7, so that OperatorHub won’t trigger an automatic upgrade.
See the OpenShift Subscriptions and channel documentation for further information.

List the versions available from OperatorHub

(not 100% reliable, the connection may timeout)

toolbox/gpu-operator/list_version_from_operator_hub.sh

Usage:

toolbox/gpu-operator/list_version_from_operator_hub.sh [<package-name> [<catalog-name>]]
toolbox/gpu-operator/list_version_from_operator_hub.sh --help

Default values:

package-name: gpu-operator-certified
catalog-name: certified-operators
namespace: openshift-marketplace (controlled with NAMESPACE environment variable)

Deploy from NVIDIA helm repository

toolbox/gpu-operator/list_version_from_helm.sh
toolbox/gpu-operator/deploy_from_helm.sh <helm-version>
toolbox/gpu-operator/undeploy_from_helm.sh

Deploy from a custom commit.

./run_toolbox.py gpu_operator deploy_from_commit <git repository> <git reference> [--tag_uid=TAG_UID]

Example:

./run_toolbox.py gpu_operator deploy_from_commit https://github.com/NVIDIA/gpu-operator.git master

Configuration¶

Set a custom repository list to use in the GPU Operator ClusterPolicy

Using a repo-list file

./run_toolbox.py gpu_operator set_repo_config /path/to/repo.list [--dest_dir=DEST_DIR]

Default values:

dest-dir-in-pod: /etc/distro.repos.d

Testing and Waiting¶

Wait for the GPU Operator deployment and validate it

./run_toolbox.py gpu_operator wait_deployment

Run GPU-burn_ to validate that all the GPUs of all the nodes can run workloads

./run_toolbox.py gpu_operator run_gpu_burn [--runtime=RUNTIME, in seconds]

Default values:

gpu-burn runtime: 30

Troubleshooting¶

Capture GPU operator possible issues

(entitlement, NFD labelling, operator deployment, state of resources in gpu-operator-resources, …)

./run_toolbox.py entitlement test_cluster
./run_toolbox.py nfd has_labels
./run_toolbox.py nfd has_gpu_nodes
./run_toolbox.py gpu_operator wait_deployment
./run_toolbox.py gpu_operator run_gpu_burn --runtime=30
./run_toolbox.py gpu_operator capture_deployment_state

or all in one step:

toolbox/gpu-operator/diagnose.sh

or with the must-gather script:

toolbox/gpu-operator/must-gather.sh

or with the must-gather image:

oc adm must-gather --image=quay.io/openshift-psap/ci-artifacts:latest --dest-dir=/tmp/must-gather -- gpu-operator_gather

Cleaning Up¶

Uninstall and cleanup stalled resources

helm (in particular) fails to deploy when any resource is left from a previously failed deployment, eg:

Error: rendered manifests contain a resource that already
exists. Unable to continue with install: existing resource
conflict: namespace: , name: gpu-operator, existing_kind:
rbac.authorization.k8s.io/v1, Kind=ClusterRole, new_kind:
rbac.authorization.k8s.io/v1, Kind=ClusterRole

toolbox/gpu-operator/cleanup_resources.sh