GPU Operator¶
Deployment¶
Deploy from OperatorHub
./run_toolbox.py gpu_operator deploy_from_operatorhub [--version=<version>] [--channel=<channel>] [--installPlan=Automatic|Manual]
./run_toolbox.py gpu_operator undeploy_from_operatorhub
Examples:
./run_toolbox.py gpu_operator deploy_from_operatorhub
Installs the latest version available
./run_toolbox.py gpu_operator deploy_from_operatorhub --version=1.7.0 --channel=v1.7
Installs
v1.7.0
from thev1.7
channel
./run_toolbox.py gpu_operator deploy_from_operatorhub --version=1.6.2 --channel=stable
Installs
v1.6.2
from thestable
channel
./run_toolbox.py gpu_operator deploy_from_operatorhub --installPlan=Automatic
Forces the install plan approval to be set to
Automatic
.
Note about the GPU Operator channel:
Before
v1.7.0
, the GPU Operator was using a unique channel name (stable
). Within this channel, OperatorHub would automatically upgrade the operator to the latest available version. This was an issue as the operator doesn’t support (yet) being upgraded (remove and reinstall is the official way). OperatorHub allows specifying the upgrade asManual
, but this isn’t the default behavior.Starting with
v1.7.0
, the channel is set tov1.7
, so that OperatorHub won’t trigger an automatic upgrade.See the OpenShift Subscriptions and channel documentation for further information.
List the versions available from OperatorHub
(not 100% reliable, the connection may timeout)
toolbox/gpu-operator/list_version_from_operator_hub.sh
Usage:
toolbox/gpu-operator/list_version_from_operator_hub.sh [<package-name> [<catalog-name>]]
toolbox/gpu-operator/list_version_from_operator_hub.sh --help
Default values:
package-name: gpu-operator-certified
catalog-name: certified-operators
namespace: openshift-marketplace (controlled with NAMESPACE environment variable)
Deploy from NVIDIA helm repository
toolbox/gpu-operator/list_version_from_helm.sh
toolbox/gpu-operator/deploy_from_helm.sh <helm-version>
toolbox/gpu-operator/undeploy_from_helm.sh
Deploy from a custom commit.
./run_toolbox.py gpu_operator deploy_from_commit <git repository> <git reference> [--tag_uid=TAG_UID]
Example:
./run_toolbox.py gpu_operator deploy_from_commit https://github.com/NVIDIA/gpu-operator.git master
Configuration¶
Set a custom repository list to use in the GPU Operator
ClusterPolicy
Using a repo-list file
./run_toolbox.py gpu_operator set_repo_config /path/to/repo.list [--dest_dir=DEST_DIR]
Default values:
dest-dir-in-pod:
/etc/distro.repos.d
Testing and Waiting¶
Wait for the GPU Operator deployment and validate it
./run_toolbox.py gpu_operator wait_deployment
Run GPU-burn_ to validate that all the GPUs of all the nodes can run workloads
./run_toolbox.py gpu_operator run_gpu_burn [--runtime=RUNTIME, in seconds]
Default values:
gpu-burn runtime: 30
Troubleshooting¶
Capture GPU operator possible issues
(entitlement, NFD labelling, operator deployment, state of resources in gpu-operator-resources, …)
./run_toolbox.py entitlement test_cluster
./run_toolbox.py nfd has_labels
./run_toolbox.py nfd has_gpu_nodes
./run_toolbox.py gpu_operator wait_deployment
./run_toolbox.py gpu_operator run_gpu_burn --runtime=30
./run_toolbox.py gpu_operator capture_deployment_state
or all in one step:
toolbox/gpu-operator/diagnose.sh
or with the must-gather script:
toolbox/gpu-operator/must-gather.sh
or with the must-gather image:
oc adm must-gather --image=quay.io/openshift-psap/ci-artifacts:latest --dest-dir=/tmp/must-gather -- gpu-operator_gather
Cleaning Up¶
Uninstall and cleanup stalled resources
helm
(in particular) fails to deploy when any resource is left from
a previously failed deployment, eg:
Error: rendered manifests contain a resource that already
exists. Unable to continue with install: existing resource
conflict: namespace: , name: gpu-operator, existing_kind:
rbac.authorization.k8s.io/v1, Kind=ClusterRole, new_kind:
rbac.authorization.k8s.io/v1, Kind=ClusterRole
toolbox/gpu-operator/cleanup_resources.sh