Creating a New Orchestration

You’re working on a new perf&scale test project, and you want to have it automated and running in the CI? Good! Do you already have you test architecture in mind? And your toolbox is ready? Perfect, so we can start building the orchestration!

Prepare the environment

To create an orchestration, go to projects/PROJECT_NAME/testing and prepare the following boilerplate code.

Mind that the PROJECT_NAME should be compatible with Python packages (no -) to keep things simple.

Prepare the, config.yaml and command_args.yaml.j2

These files are all what is mandatory to have a configurable orchestration layer.

  • should contain these entrypoints, for interacting with the CI:

def prepare_ci():
    Prepares the cluster and the namespace for running the tests


def test_ci():
    Runs the test from the CI


def cleanup_cluster(mute=False):
    Restores the cluster to its original state
    # _Not_ executed in OpenShift CI cluster (running on AWS). Only required for running in bare-metal environments.



@entrypoint(ignore_secret_path=True, apply_preset_from_pr_args=False)
def generate_plots_from_pr_args():
    Generates the visualization reports from the PR arguments


    export.export_artifacts(env.ARTIFACT_DIR, test_step="plot")

class Entrypoint:
    Commands for launching the CI tests

    def __init__(self):

        self.prepare_ci = prepare_ci
        self.test_ci = test_ci
        self.cleanup_cluster_ci = cleanup_cluster
        self.export_artifacts = export_artifacts

        self.generate_plots_from_pr_args = generate_plots_from_pr_args

def main():
    # Print help rather than opening a pager
    fire.core.Display = lambda lines, out: print(*lines, file=out)


if __name__ == "__main__":
    except subprocess.CalledProcessError as e:
        logging.error(f"Command '{e.cmd}' failed --> {e.returncode}")
    except KeyboardInterrupt:
        print() # empty line after ^C
  • config.yaml should contain

  # name of the presets to apply, or null if no preset
  name: null
  # list of names of the presets to apply, or a single name, or null if no preset
  names: null

    clusters.create.type: single

    clusters.create.keep: true
    clusters.create.ocp.tags.Project: PSAP/Project/...
    # clusters.create.ocp.tags.TicketId:

  light_cluster: cluster_light

    extends: [light_cluster]


    name: psap-ods-secret
  # name of the file containing the properties of LDAP secrets
  s3_ldap_password_file: s3_ldap.passwords
  keep_cluster_password_file: get_cluster.password
  opensearch_instances: opensearch.yaml
  aws_credentials: .awscred
  git_credentials: git-credentials

    ...: ...
    type: single # can be: single, ocp, managed
    keep: false
    name_prefix: fine-tuning-ci
      # list of tags to apply to the machineset when creating the cluster
        # TicketId: "..."
        Project: PSAP/Project/...
        target: cluster
      version: 4.15.9
      region: us-west-2
        type: m6a.xlarge
        type: m6a.2xlarge
        count: 2

    is_metal: false
      name: null
      dedicated: true
        name: workload-pods
        type: m6i.2xlarge
        count: null
          key: only-workload-pods
          value: "yes"
          effect: NoSchedule
    is_metal: false
      dedicated: true
        name: test-pods
        count: null
        type: m6i.2xlarge
          key: only-test-pods
          value: "yes"
          effect: NoSchedule
  cleanup_on_exit: false

  preset: null
  workload: projects....visualizations...
  prom_workload: projects....visualizations....
  config_file: plots.yaml
    mode: prefer_cache
    # if true, copy the results downloaded by `matbench download` into the artifacts directory
    save_to_artifacts: false
  # directory to plot. Set by testing/common/ before launching the visualization
  test_directory: null
    generate: true
      test_name: null
        enabled: false
        enabled_on_replot: false
        fail_test_on_fail: true
      instance: smoke
      index: ...
      index_prefix: ""
      prom_index_suffix: -prom
      enabled: false
      # if the regression analyses fail, mark the test as failed
      fail_test_on_regression: false
  enabled: false
  bucket: rhoai-cpt-artifacts
  path_prefix: cpt/fine-tuning
  dest: null # will be set by the export code
  • command_args.yml.j2 should start with:

{% set secrets_location = false | or_env(secrets.dir.env_key) %}
{% if not secrets_location %}
  {{ ("ERROR: secrets_location must be defined ("+|string +" or env(secrets.dir.env_key=" + secrets.dir.env_key|string + ")) ") | raise_exception }}
{% endif %}
{% set s3_ldap_password_location = secrets_location + "/" + secrets.s3_ldap_password_file %}

# ---

Copy the and

These files are necessary to be able to create clusters on OpenShift CI. (/test rhoai-e2e). They shouldn’t be modified.

And now, the boiler-plate code is in place, and we can start building the test orchestration.

Create and

Starting at this step, the development of the test orchestration starts, and you “just” have to fill the gaps :)

In the prepare_ci method, prepare your cluster, according to the configuration. In the test_ci method, run your test and collect its artifacts. In the cleanup_cluster_ci, cleanup you cluster, so that it can be used again for another test.

Start building your test orchestration

One the boilerplate code is in place, we can start building the test orchestration. TOPSAIL provides some “low level” helper modules:

from projects.core.library import env, config, run, configure_logging, export

as well as libraries of common orchestration bits:

from projects.rhods.library import prepare_rhoai as prepare_rhoai_mod
from projects.gpu_operator.library import prepare_gpu_operator
from projects.matrix_benchmarking.library import visualize

These libraries are illustrated below. They are not formally described at the moment. They come from project code blocks that have noticed to be used identically across projects, so they have been moved to library directories to be easier to reuse.

Sharing code across projects means extending the risk of unnoticed bugs when updating the library. With this in mind, the question of code sharing vs code duplication takes another direction, as extensive testing is not easy in such a rapidly evolving project.

Core helper modules

The run module

  • helper functions to run system commands, toolbox commands, and from_config toolbox commands:

def run(command, capture_stdout=False, capture_stderr=False, check=True, protect_shell=True, cwd=None, stdin_file=None, log_command=True)

This method allows running a command, capturing or not its stdout/stderr, checking it’s return code, chaning it’s working directory, protecting it with bash safety flags (set -o errexit;set -o pipefail;set -o nounset;set -o errtrace), passing a file as stdin, logging or not the command, …

def run_toolbox(group, command, artifact_dir_suffix=None, run_kwargs=None, mute_stdout=None, check=None, **kwargs)

This command allows running a toolbox command. group, command, kwargs are the CLI toolbox command arguments. run_kwargs allows passing arguments directory to the run command described above. mute_stdout allows muting (capturing) the stdout text. check allows disabling the exception on error check. artifact_dir_suffix allows appending a suffix to the toolbox directory name (eg, to distinguish two identical calls in the artifacts).

def run_toolbox_from_config(group, command, prefix=None, suffix=None, show_args=None, extra=None, artifact_dir_suffix=None, mute_stdout=False, check=True, run_kwargs=None)

This command allows running a toolbox command with the from_config helper (see the description of the command_args.yaml.j2 file). prefix and suffix allow distinguishing commands in the command_args.yaml.j2 file. extra allows passing extra arguments that override what is in the template file. show_args only display the arguments that would be passed to

  • run_and_catch is an helper function for chaining multiple functions without swallowing exceptions:

exc = None
exc = run.run_and_catch(
  run.run_toolbox, "kserve", "capture_operators_state", run_kwargs=dict(capture_stdout=True),

exc = run.run_and_catch(
  run.run_toolbox, "cluster", "capture_environment", run_kwargs=dict(capture_stdout=True),

if exc: raise exc
  • helper context to run functions in parallel. If exit_on_exception is set, the code will exit the process when an exception is catch. Otherwise it will simply raise it. If dedicated_dir is set, a dedicated directly, based on the name parameter, will be created.

class Parallel(object):
    def __init__(self, name, exit_on_exception=True, dedicated_dir=True):


def prepare():
  with run.Parallel("prepare1") as parallel:

  test_settings = config.project.get_config("tests.fine_tuning.test_settings")
  with run.Parallel("prepare2") as parallel:
      parallel.delayed(prepare_namespace, test_settings)

  with run.Parallel("prepare3") as parallel:

The env module

  • ARTIFACT_DIR thread-safe access to the storage directory. Prefer using this than $ARTIFACT_DIR which isn’t thread safe.

  • helper context to create a dedicated artifact directory. Based on OpenShift CI, TOPSAIL relies on the ARTIFACT_DIR environment variable to store its artifacts. Each toolbox command creates a new directory name nnn__group__command, which keeps the directories ordered and easy to follow. However, when many commands are executed, sometimes in parallel, the number of directories increase and becomes hard to understand. This command allows creating subdirectories, to group things logically:


with env.NextArtifactDir("prepare_namespace"):

The config module

  • the config.project.get_config(<config key>) helper command to access the configuration. Uses the inline Json format. This object holds the main project configuration.

  • the config.project.set_config(<config key>, <value>) helper command to update the configuration. Sometimes, it is convenient to store values in the configuration (eg, coming from the command-line). Mind that this is not thread-safe (an error is raised if this command is called in a run.Parallel context). Mind that this command does not allow creating new configuration fields in the document. Only existing fields can be updated.

The projects.rhods.library.prepare_rhoai library module

This library helps with the deployment of RHOAI pre-builds on OpenShift.

  • install_servicemesh() installs the ServiceMesh Operator, if not already installed in the cluster (this is a dependency of RHOAI)

  • uninstall_servicemesh(mute=True) uninstall the ServiceMesh Operator, if it is installed

  • is_rhoai_installed() tells if RHOAI is currently installed or not.

  • install(token_file=None, force=False) installs RHOAI, if it is not already installed (unless force is passed). Mind that the current deployment code only works with the pre-builds of RHOAI, which require a Brew token_file. If the token isn’t passed, it is assumed that the cluster already has access to Brew.

The projects.gpu_operator.library.prepare_gpu_operator library module

This library helps with the deployment of the GPU stack on OpenShift.

  • prepare_gpu_operator() deploys the NFD Operator and the GPU Operator, if they are not already installed.

  • wait_ready(...) waits for the GPU Operator stack to be deployed, and optionally enable additional GPU Operator features:

    • enable_time_sharing enables the time-sharing capability of the GPU Operator, (configured via the command_args.yaml.j2 file).

    • extend_metrics=True, wait_metrics=True enables extra metrics to be captured by the GPU Operator DCGM component (the “well-known” metrics set). If wait_metrics is enabled, the automation will wait for the DCGM to start reporting these metrics.

    • wait_stack_deployed allows disabling the final wait, and only enable the components above.

  • cleanup_gpu_operator() undeploys the GPU Operator and the NFD Operator, if they are deployed.

  • add_toleration(effect, key) adds a toleration to the GPU Operator DaemonSet Pods. This allows the GPU Operator Pods to be deployed on nodes with specific taints. Mind that this command overrides any toleration previously set.

The projects.local_ci.library.prepare_user_pods library module

This library helps with the execution of multi-user TOPSAIL tests.

Multi-user tests consist in Pods running inside the cluster, and all executing a TOPSAIL command. Their initialization is synchronized with a barrier, then they wait a configurable delay before starting their script. When they terminate, their file artifacts are collected via a S3 server, and stored locally for post-processing.

  • prepare_base_image_container(namespace) builds a TOPSAIL image in a given namespace. The image must be consistent with the commit of TOPSAIL being tested, so the BuildConfig relies on the PR number of fetch the right commit. The apply_prefer_pr function provides the helper code to update the configuration with the number of the PR being tested.

  • apply_prefer_pr(pr_number=None) inspects the environment to detect the PR number. When running locally, export TOPSAIL_LOCAL_CI=true and PULL_NUMBER=... for this function to automatically detect the PR number. Mind that this function updates the configuration file, so it cannot run inside a parallel context.

  • delete_istags(namespace) cleanups up the istags used by TOPSAIL User Pods.

  • rebuild_driver_image(namespace, pr_number) helps refreshing the image when running locally.

def rebuild_driver_image(pr_number):
    namespace = config.project.get_config("base_image.namespace")
    prepare_user_pods.rebuild_driver_image(namespace, pr_number)
  • cluster_scale_up(user_count) scales up the cluster with the right number of nodes (when not running in a bare-metal cluster).

  • prepare_user_pods(user_count) prepares the cluster for running a multi-user scale test. Deploys the dependency tools (minio, redis), builds the image, prepare the ServiceAccount that TOPSAIL will use, prepare the secrets that TOPSAIL will have access to …

  • cleanup_cluster() cleanups up the cluster by deleting the User Pod namespace.

The projects.matrix_benchmarking.library.visualize library module

This module helps with the post-processing of TOPSAIL results.

  • prepare_matbench() is called from the ContainerFile. It installs the pip dependencies of MatrixBenchmarking.

  • download_and_generate_visualizations(results_dirname) is called from the CIs, when replotting. It downloads test results runs the post-processing steps against it.

  • generate_from_dir(results_dirname, generate_lts=None) is the main entrypoint of this library. It accepts a directory as argument, and runs the post-processing steps against it. The expected configuration should be further documented …