============ GPU Operator ============ Deployment ========== * Deploy from OperatorHub .. code-block:: shell ./run_toolbox.py gpu_operator deploy_from_operatorhub [--version=] [--channel=] [--installPlan=Automatic|Manual] ./run_toolbox.py gpu_operator undeploy_from_operatorhub **Examples:** - ``./run_toolbox.py gpu_operator deploy_from_operatorhub`` - Installs the latest version available - ``./run_toolbox.py gpu_operator deploy_from_operatorhub --version=1.7.0 --channel=v1.7`` - Installs ``v1.7.0`` from the ``v1.7`` channel - ``./run_toolbox.py gpu_operator deploy_from_operatorhub --version=1.6.2 --channel=stable`` - Installs ``v1.6.2`` from the ``stable`` channel - ``./run_toolbox.py gpu_operator deploy_from_operatorhub --installPlan=Automatic`` - Forces the install plan approval to be set to ``Automatic``. **Note about the GPU Operator channel:** - Before ``v1.7.0``, the GPU Operator was using a unique channel name (``stable``). Within this channel, OperatorHub would automatically upgrade the operator to the latest available version. This was an issue as the operator doesn't support (yet) being upgraded (remove and reinstall is the official way). OperatorHub allows specifying the upgrade as ``Manual``, but this isn't the default behavior. - Starting with ``v1.7.0``, the channel is set to ``v1.7``, so that OperatorHub won't trigger an automatic upgrade. - See the `OpenShift Subscriptions and channel documentation`_ for further information. .. _OpenShift Subscriptions and channel documentation: https://docs.openshift.com/container-platform/4.7/operators/understanding/olm/olm-understanding-olm.html#olm-subscription_olm-understanding-olm * List the versions available from OperatorHub (not 100% reliable, the connection may timeout) .. code-block:: shell toolbox/gpu-operator/list_version_from_operator_hub.sh **Usage:** .. code-block:: shell toolbox/gpu-operator/list_version_from_operator_hub.sh [ []] toolbox/gpu-operator/list_version_from_operator_hub.sh --help *Default values:* .. code-block:: shell package-name: gpu-operator-certified catalog-name: certified-operators namespace: openshift-marketplace (controlled with NAMESPACE environment variable) * Deploy from NVIDIA helm repository .. code-block:: shell toolbox/gpu-operator/list_version_from_helm.sh toolbox/gpu-operator/deploy_from_helm.sh toolbox/gpu-operator/undeploy_from_helm.sh * Deploy from a custom commit. .. code-block:: shell ./run_toolbox.py gpu_operator deploy_from_commit [--tag_uid=TAG_UID] **Example:** .. code-block:: shell ./run_toolbox.py gpu_operator deploy_from_commit https://github.com/NVIDIA/gpu-operator.git master Configuration ============= * Set a custom repository list to use in the GPU Operator ``ClusterPolicy`` *Using a repo-list file* .. code-block:: shell ./run_toolbox.py gpu_operator set_repo_config /path/to/repo.list [--dest_dir=DEST_DIR] **Default values**: - *dest-dir-in-pod*: ``/etc/distro.repos.d`` Testing and Waiting =================== * Wait for the GPU Operator deployment and validate it .. code-block:: shell ./run_toolbox.py gpu_operator wait_deployment * Run `GPU-burn_` to validate that all the GPUs of all the nodes can run workloads .. code-block:: shell ./run_toolbox.py gpu_operator run_gpu_burn [--runtime=RUNTIME, in seconds] **Default values:** .. code-block:: shell gpu-burn runtime: 30 .. _GPU-burn: https://github.com/openshift-psap/gpu-burn Troubleshooting =============== * Capture GPU operator possible issues (entitlement, NFD labelling, operator deployment, state of resources in gpu-operator-resources, ...) .. code-block:: shell ./run_toolbox.py entitlement test_cluster ./run_toolbox.py nfd has_labels ./run_toolbox.py nfd has_gpu_nodes ./run_toolbox.py gpu_operator wait_deployment ./run_toolbox.py gpu_operator run_gpu_burn --runtime=30 ./run_toolbox.py gpu_operator capture_deployment_state or all in one step: .. code-block:: shell toolbox/gpu-operator/diagnose.sh or with the must-gather script: .. code-block:: shell toolbox/gpu-operator/must-gather.sh or with the must-gather image: .. code-block:: shell oc adm must-gather --image=quay.io/openshift-psap/ci-artifacts:latest --dest-dir=/tmp/must-gather -- gpu-operator_gather Cleaning Up =========== * Uninstall and cleanup stalled resources ``helm`` (in particular) fails to deploy when any resource is left from a previously failed deployment, eg: .. code-block:: Error: rendered manifests contain a resource that already exists. Unable to continue with install: existing resource conflict: namespace: , name: gpu-operator, existing_kind: rbac.authorization.k8s.io/v1, Kind=ClusterRole, new_kind: rbac.authorization.k8s.io/v1, Kind=ClusterRole .. code-block:: toolbox/gpu-operator/cleanup_resources.sh