The Test Orchestrations Layer
The test orchestration layer is the crux of TOPSAIL. It binds everything else together: - the CI job launchers - the configuration - the toolbox commands - the post-mortem visualizations and automated regression analyses.
Historically, this layer has been first and foremost triggered by CI jobs, with clean clusters and kube-admin privileges. This is still the first target of TOPSAIL test automation. The side effect of that is that TOPSAIL may seem not very user-friendly when trying to use it interactively from a terminal.
In this section, we’ll try to cover these different aspects that TOPSAIL binds together.
The CI job launchers
TOPSAIL test orchestrations are focused on reproducibility and end-to-end testing. These two ideas are directly linked, and in the OpenShift world, the easiest to ensure that the rests are reproducible and end-to-end automated is to start from scratch (or from a fresh and clean cluster).
Cluster creation
In OpenShift CI, TOPSAIL has the ability to create a dedicated cluster
(even two, one for RHOAI, one for simulating users). This mode is
launched with the rhoai-e2e
test. It is particularly useful when
launching cloud scale tests. The cluster creation is handled by the
deploy-cluster subproject.
This part of TOPSAIL is old, and mostly written in Bash. But it has
proved to be robust and reliable, although we haven’t been using it
much since we got access to bare-metal clusters.
By default, these clusters are destroyed after the test.
A keep
flag can be set in the configuration to avoid destroying
it, and creating a kube-admin user with a predefined password. (Ask
in PM for how access the cluster).
Cluster from pool
In OpenShift CI, TOPSAIL has a pool of pre-deployed clusters. These clusters are controlled by the Hive tool, managed by the OpenShift CI team. In the current configuration, the pool have 2 single-node OpenShift systems.
These clusters are always destroyed at the end of the run. This is outside of TOPSAIL control.
Bare-metal clusters
In the Middleware Jenkins CI, TOPSAIL can be launched against two bare-metal clusters. These clusters have long running OpenShift deployments, and they are “never” reinstalled (at least, there is no reinstall automation in place at the moment). Hence, the test orchestrations are in charge of cleanup the cluster before (to ensure that no garbage is left) and after the test (to let the cluster clean for the following users). So the complete test sequence is:
cleanup
prepare
test
cleanup
This is the theory at least. In practice, the clusters are dedicated to the team, and after mutual agreement, the cleanups and prepare steps may be skipped to save time. Or the test and final cleanup, to have a cluster ready for development.
Before launching a test, check the state of the cluster. Is RHOAI installed? is the DSC configured as you expected? If not, make sure you tick the cleanup and prepare steps.
Is someone else’s job already on the same cluster? if yes, your job will be queued and start only after the first job completion. Make sure you tick the cleanup and prepare steps.
Launching TOPSAIL jobs on the CI engines
See this google doc for all the details about launching TOPSAIL jobs on the CI engines:
TOPSAIL Configuration System
The configuration system is (yet another) key element of TOPSAIL. It has been designed to flexible, modular, and (important point to understand some of its implementation choices) configurable from OpenShift CI and other CI engines.
A bit of history
OpenShift CI is a great tool, but a strong limitation of it is that it can be only statically configured (from the openshift/release repository). TOPSAIL had to find a way to enable dynamic configuration, without touching the source code. Long story (see a small slide deck illustrating it) short, TOPSAIL can be configured in Github. (See How to launch TOPSAIL tests for all the details).
/test rhoai-light fine_tuning ibm_40gb_models
/var tests.fine_tuning.test_settings.gpu: [2, 4]
A bit of apology
TOPSAIL project’s configuration is a YAML document. On one side, each
project is free to define is own configuration. But on the other side,
some code is shared between different projects (the library
files,
defined in some of the projects).
This aspect (the full flexibility + the code reuse in the libraries) makes the configuration structure hard to track. A refactoring might be envisaged to have a more strongly defined configuration format, at least for the reusable libraries (eg, the library could tell: this configuration block does not follow my model, I do not accept to process it).
How it actually works
So, TOPSAIL project’s configuration is a YAML document. And the test orchestration reads it alter its behavior. It’s as simple as that.
tests:
capture_prom: true
capture_state: true
capture_prom = config.project.get_config("tests.capture_prom")
if not capture_prom:
logging.info("tests.capture_prom is disabled, skipping Prometheus DB reset")
return
Sometimes, the test orchestration doesn’t need to handle some
configuration flags, but only pass them to the toolbox layer. TOPSAIL
provides a helper toolbox command for that: from_config
.
Example:
rhods:
catalog:
image: brew.registry.redhat.io/rh-osbs/iib
tag: 804339
channel: fast
version: 2.13.0
version_name: rc1
opendatahub: false
managed_rhoi: true
These configuration flags should be passed directly to the rhods
deploy_ods
toolbox command
def deploy_ods(self, catalog_image, tag, channel="", version="",
disable_dsc_config=False, opendatahub=False, managed_rhoai=True):
"""
Deploy ODS operator from its custom catalog
Args:
catalog_image: Container image containing the RHODS bundle.
tag: Catalog image tag to use to deploy RHODS.
channel: The channel to use for the deployment. Let empty to use the default channel.
...
"""
So the way to launch the RHOAI deployement should be:
run.run_toolbox("rhods", "deploy_ods"
catalog_image=config.project.get_config("rhods.catalog.image"),
tag=config.project.get_config("rhods.catalog.tag"),
channel=config.project.get_config("rhods.catalog.channel"),
...)
Instead, the orchestration can use the command_args.yaml.j2
file:
rhods deploy_ods:
catalog_image: {{ rhods.catalog.image }}
tag: {{ rhods.catalog.tag }}
channel: {{ rhods.catalog.channel }}
...
where the template will be generated from the configuration file. And this command will trigger it:
run.run_toolbox_from_config("rhods", "deploy_ods")
or this equivalent, from the command-line:
source ./projects/fine_tuning/testing/configure.sh
./run_toolbox.py from_config rhods deploy_ods
Configuring the configuration with presets
TOPSAIL configuration can be updated through the presets. This allows storing multiple different test flavors side by side, and deciding at launch time which one to execute.
The presets, stored inside in the configuration in the ci_presets
field, define how to update the main configuration blocks before
running the test.
Here is an example, which will test multiple dataset replication factors:
dgx_single_model_multi_dataset:
extends: [dgx_single_model]
tests.fine_tuning.matbenchmarking.enabled: true
tests.fine_tuning.test_settings.gpu: 1
tests.fine_tuning.test_settings.dataset_replication: [1, 2, 4, 8]
We see that three fields are “simply” updated. The extends
keyword
means that first of all (because it is in the first position), we need
to apply the dgx_single_model
preset, and only after modify the
three fields.
The presets are applied with a simple recursive algorithm (which will
dirtily crash if there is a loop in the presets ^.^). If multiple
presets are defined, and they touch the same values, only the last
change will be visible. Same for the extends
keyword. It applied
at its position in the dictionary.
Last important point: the presets cannot create new fields. This can be worked around by having placeholders in the main configuration. Eg:
tests:
fine_tuning:
test_settings:
hyper_parameters:
per_device_train_batch_size: null
gradient_accumulation_steps: null
And everything is YAML. So the preset values can be YAML dictionaries (or lists).
tests.fine_tuning.test_settings.hyper_parameters: {r: 4, lora_alpha: 16}
This would work even if no placeholder has been set for r
and
lora_alpha
, because the hyper_parameters
is being assigned
(and everything it contained before would be erased).
Calling the toolbox commands
The “orchestration” layer orchestrates the toolbox commands. That is, it calls them, in the right order, according to configuration flags, and with the right parameters.
The Python code can call the toolbox directly, by passing all the necessary arguments:
has_dsc = run.run("oc get dsc -oname", capture_stdout=True).stdout
run.run_toolbox(
"rhods", "update_datasciencecluster",
enable=["kueue", "codeflare", "trainingoperator"],
name=None if has_dsc else "default-dsc",
)
or from the configuration:
run.run_toolbox_from_config("rhods", "deploy_ods")
But it can also have a “mix” of both, via the extra
arguments of
the from_config
call:
extra = dict(source=source, storage_dir=storage_dir, name=source_name)
run.run_toolbox_from_config("cluster", "download_to_pvc", extra=extra)
This way, cluster download_to_pvc
will have parameters received
from the configuration, and extra settings (which take precedence),
prepared directly in Python.
The from_config
command also accepts a prefix and/or a
suffix. Indeed, one command might be called with different parameters
in the same workflow.
A simple example is the cluster set_scale
command, which is used,
in cloud environment, to control the number of nodes dedicated to a
given task.
sutest/cluster set_scale:
name: {{ clusters.sutest.compute.machineset.name }}
instance_type: {{ clusters.sutest.compute.machineset.type }}
scale: SET_AT_RUNTIME
driver/cluster set_scale:
instance_type: {{ clusters.driver.compute.machineset.type }}
name: {{ clusters.driver.compute.machineset.name }}
scale: SET_AT_RUNTIME
This will be called with the prefix
parameter:
run.run_toolbox_from_config("cluster", "set_scale", prefix="sutest", extra=dict(scale=...))
run.run_toolbox_from_config("cluster", "set_scale", prefix="driver", extra=dict(scale=...))
and the same works for the suffix:
prefix/command sub-command/suffix: ...
Creating dedicated directories
The artifacts are a critical element for TOPSAIL post-mortem processing and troubleshooting. But when the orchestration starts to involve multiple commands, it gets complicated to understand what is done at which step.
So TOPSAIL provides the env.NextArtifactDir
context, which creates
a dedicated directory (with a nnn__
prefix to enforce the correct
ordering).
Inside this directory, env.ARTIFACT_DIR
will be correctly, so that
the code can write its artifact files in a dedicated directory.
with env.NextArtifactDir("multi_model_test_sequentially"):
This is mostly used in the test
part, to group the multiple
commands related to a test together.
Running toolbox commands in parallel
When the orchestration preparation starts to involve multiple commands, running all of them sequentially make take forever.
So TOPSAIL provides the run.Parallel
context and the
parallel.delayed
function to allow running multiple commands in
parallel:
with run.Parallel("prepare_scale") as parallel:
parallel.delayed(prepare_kserve.prepare)
parallel.delayed(scale_up_sutest)
parallel.delayed(prepare_user_pods.prepare_user_pods, user_count)
parallel.delayed(prepare_user_pods.cluster_scale_up, user_count)
This will create a dedicated directory, and at the end of the block it will execute the 4 functions in dedicated threads.
Mind that the configuration cannot be updated inside a parallel
region (eg,
config.project.set_config("tests.scale.model.consolidated", True)
).