Creating a New Orchestration
You’re working on a new perf&scale test project, and you want to have it automated and running in the CI? Good! Do you already have you test architecture in mind? And your toolbox is ready? Perfect, so we can start building the orchestration!
Prepare the environment
To create an orchestration, go to projects/PROJECT_NAME/testing
and prepare the following boilerplate code.
Mind that the PROJECT_NAME
should be compatible with Python
packages (no -
) to keep things simple.
Prepare the test.py
, config.yaml
and command_args.yaml.j2
These files are all what is mandatory to have a configurable orchestration layer.
test.py
should contain these entrypoints, for interacting with the CI:
@entrypoint()
def prepare_ci():
"""
Prepares the cluster and the namespace for running the tests
"""
pass
@entrypoint()
def test_ci():
"""
Runs the test from the CI
"""
pass
@entrypoint()
def cleanup_cluster(mute=False):
"""
Restores the cluster to its original state
"""
# _Not_ executed in OpenShift CI cluster (running on AWS). Only required for running in bare-metal environments.
common.cleanup_cluster()
pass
@entrypoint(ignore_secret_path=True, apply_preset_from_pr_args=False)
def generate_plots_from_pr_args():
"""
Generates the visualization reports from the PR arguments
"""
visualize.download_and_generate_visualizations()
export.export_artifacts(env.ARTIFACT_DIR, test_step="plot")
class Entrypoint:
"""
Commands for launching the CI tests
"""
def __init__(self):
self.prepare_ci = prepare_ci
self.test_ci = test_ci
self.cleanup_cluster_ci = cleanup_cluster
self.export_artifacts = export_artifacts
self.generate_plots_from_pr_args = generate_plots_from_pr_args
def main():
# Print help rather than opening a pager
fire.core.Display = lambda lines, out: print(*lines, file=out)
fire.Fire(Entrypoint())
if __name__ == "__main__":
try:
sys.exit(main())
except subprocess.CalledProcessError as e:
logging.error(f"Command '{e.cmd}' failed --> {e.returncode}")
sys.exit(1)
except KeyboardInterrupt:
print() # empty line after ^C
logging.error(f"Interrupted.")
sys.exit(1)
config.yaml
should contain
ci_presets:
# name of the presets to apply, or null if no preset
name: null
# list of names of the presets to apply, or a single name, or null if no preset
names: null
single:
clusters.create.type: single
keep:
clusters.create.keep: true
clusters.create.ocp.tags.Project: PSAP/Project/...
# clusters.create.ocp.tags.TicketId:
light_cluster:
clusters.create.ocp.deploy_cluster.target: cluster_light
light:
extends: [light_cluster]
...
...
secrets:
dir:
name: psap-ods-secret
env_key: PSAP_ODS_SECRET_PATH
# name of the file containing the properties of LDAP secrets
s3_ldap_password_file: s3_ldap.passwords
keep_cluster_password_file: get_cluster.password
brew_registry_redhat_io_token_file: brew.registry.redhat.io.token
opensearch_instances: opensearch.yaml
aws_credentials: .awscred
git_credentials: git-credentials
clusters:
metal_profiles:
...: ...
create:
type: single # can be: single, ocp, managed
keep: false
name_prefix: fine-tuning-ci
ocp:
# list of tags to apply to the machineset when creating the cluster
tags:
# TicketId: "..."
Project: PSAP/Project/...
deploy_cluster:
target: cluster
base_domain: psap.aws.rhperfscale.org
version: 4.15.9
region: us-west-2
control_plane:
type: m6a.xlarge
workers:
type: m6a.2xlarge
count: 2
sutest:
is_metal: false
lab:
name: null
compute:
dedicated: true
machineset:
name: workload-pods
type: m6i.2xlarge
count: null
taint:
key: only-workload-pods
value: "yes"
effect: NoSchedule
driver:
is_metal: false
compute:
dedicated: true
machineset:
name: test-pods
count: null
type: m6i.2xlarge
taint:
key: only-test-pods
value: "yes"
effect: NoSchedule
cleanup_on_exit: false
matbench:
preset: null
workload: projects....visualizations...
prom_workload: projects....visualizations....
config_file: plots.yaml
download:
mode: prefer_cache
url:
url_file:
# if true, copy the results downloaded by `matbench download` into the artifacts directory
save_to_artifacts: false
# directory to plot. Set by testing/common/visualize.py before launching the visualization
test_directory: null
lts:
generate: true
horreum:
test_name: null
opensearch:
export:
enabled: false
enabled_on_replot: false
fail_test_on_fail: true
instance: smoke
index: ...
index_prefix: ""
prom_index_suffix: -prom
regression_analyses:
enabled: false
# if the regression analyses fail, mark the test as failed
fail_test_on_regression: false
export_artifacts:
enabled: false
bucket: rhoai-cpt-artifacts
path_prefix: cpt/fine-tuning
dest: null # will be set by the export code
command_args.yml.j2
should start with:
{% set secrets_location = false | or_env(secrets.dir.env_key) %}
{% if not secrets_location %}
{{ ("ERROR: secrets_location must be defined (secrets.dir.name="+ secrets.dir.name|string +" or env(secrets.dir.env_key=" + secrets.dir.env_key|string + ")) ") | raise_exception }}
{% endif %}
{% set s3_ldap_password_location = secrets_location + "/" + secrets.s3_ldap_password_file %}
# ---
Copy the clusters.sh
and configure.sh
These files are necessary to be able to create clusters on
OpenShift CI. (/test rhoai-e2e
). They shouldn’t be modified.
And now, the boiler-plate code is in place, and we can start building the test orchestration.
Create test_....py
and prepare_....py
Starting at this step, the development of the test orchestration starts, and you “just” have to fill the gaps :)
In the prepare_ci
method, prepare your cluster, according to the
configuration. In the test_ci
method, run your test and collect
its artifacts. In the cleanup_cluster_ci
, cleanup you cluster, so
that it can be used again for another test.
Start building your test orchestration
One the boilerplate code is in place, we can start building the test orchestration. TOPSAIL provides some “low level” helper modules:
from projects.core.library import env, config, run, configure_logging, export
as well as libraries of common orchestration bits:
from projects.rhods.library import prepare_rhoai as prepare_rhoai_mod
from projects.gpu_operator.library import prepare_gpu_operator
from projects.matrix_benchmarking.library import visualize
These libraries are illustrated below. They are not formally described at the moment. They come from project code blocks that have noticed to be used identically across projects, so they have been moved to library directories to be easier to reuse.
Sharing code across projects means extending the risk of unnoticed bugs when updating the library. With this in mind, the question of code sharing vs code duplication takes another direction, as extensive testing is not easy in such a rapidly evolving project.
Core helper modules
The run
module
helper functions to run system commands, toolbox commands, and
from_config
toolbox commands:
def run(command, capture_stdout=False, capture_stderr=False, check=True, protect_shell=True, cwd=None, stdin_file=None, log_command=True)
This method allows running a command, capturing or not its
stdout/stderr, checking it’s return code, chaning it’s working
directory, protecting it with bash safety flags (set -o
errexit;set -o pipefail;set -o nounset;set -o errtrace
), passing a
file as stdin, logging or not the command, …
def run_toolbox(group, command, artifact_dir_suffix=None, run_kwargs=None, mute_stdout=None, check=None, **kwargs)
This command allows running a toolbox command. group, command,
kwargs
are the CLI toolbox command arguments. run_kwargs
allows
passing arguments directory to the run
command described
above. mute_stdout
allows muting (capturing) the stdout
text. check
allows disabling the exception on error
check. artifact_dir_suffix
allows appending a suffix to the
toolbox directory name (eg, to distinguish two identical calls in the
artifacts).
def run_toolbox_from_config(group, command, prefix=None, suffix=None, show_args=None, extra=None, artifact_dir_suffix=None, mute_stdout=False, check=True, run_kwargs=None)
This command allows running a toolbox command with the from_config
helper (see the description of the command_args.yaml.j2
file). prefix
and suffix
allow distinguishing commands in the
command_args.yaml.j2
file. extra
allows passing extra
arguments that override what is in the template file. show_args
only display the arguments that would be passed to run_toolbox.py
.z
run_and_catch
is an helper function for chaining multiple functions without swallowing exceptions:
exc = None
exc = run.run_and_catch(
exc,
run.run_toolbox, "kserve", "capture_operators_state", run_kwargs=dict(capture_stdout=True),
)
exc = run.run_and_catch(
exc,
run.run_toolbox, "cluster", "capture_environment", run_kwargs=dict(capture_stdout=True),
)
if exc: raise exc
helper context to run functions in parallel. If
exit_on_exception
is set, the code will exit the process when an exception is catch. Otherwise it will simply raise it. Ifdedicated_dir
is set, a dedicated directly, based on thename
parameter, will be created.
class Parallel(object):
def __init__(self, name, exit_on_exception=True, dedicated_dir=True):
Example:
def prepare():
with run.Parallel("prepare1") as parallel:
parallel.delayed(prepare_rhoai)
parallel.delayed(scale_up_sutest)
test_settings = config.project.get_config("tests.fine_tuning.test_settings")
with run.Parallel("prepare2") as parallel:
parallel.delayed(prepare_gpu)
parallel.delayed(prepare_namespace, test_settings)
with run.Parallel("prepare3") as parallel:
parallel.delayed(preload_image_yyy)
parallel.delayed(preload_image_xxx)
parallel.delayed(preload_image_zzz)
The env
module
ARTIFACT_DIR
thread-safe access to the storage directory. Prefer using this than$ARTIFACT_DIR
which isn’t thread safe.helper context to create a dedicated artifact directory. Based on OpenShift CI, TOPSAIL relies on the
ARTIFACT_DIR
environment variable to store its artifacts. Each toolbox command creates a new directory namennn__group__command
, which keeps the directories ordered and easy to follow. However, when many commands are executed, sometimes in parallel, the number of directories increase and becomes hard to understand. This command allows creating subdirectories, to group things logically:
Example:
with env.NextArtifactDir("prepare_namespace"):
set_namespace_annotations()
download_data_sources(test_settings)
The config
module
the
config.project.get_config(<config key>)
helper command to access the configuration. Uses the inline Json format. This object holds the main project configuration.the
config.project.set_config(<config key>, <value>)
helper command to update the configuration. Sometimes, it is convenient to store values in the configuration (eg, coming from the command-line). Mind that this is not thread-safe (an error is raised if this command is called in arun.Parallel
context). Mind that this command does not allow creating new configuration fields in the document. Only existing fields can be updated.
The projects.rhods.library.prepare_rhoai
library module
This library helps with the deployment of RHOAI pre-builds on OpenShift.
install_servicemesh()
installs the ServiceMesh Operator, if not already installed in the cluster (this is a dependency of RHOAI)uninstall_servicemesh(mute=True)
uninstall the ServiceMesh Operator, if it is installedis_rhoai_installed()
tells if RHOAI is currently installed or not.install(token_file=None, force=False)
installs RHOAI, if it is not already installed (unlessforce
is passed). Mind that the current deployment code only works with the pre-builds of RHOAI, which require a Brewtoken_file
. If the token isn’t passed, it is assumed that the cluster already has access to Brew.
The projects.gpu_operator.library.prepare_gpu_operator
library module
This library helps with the deployment of the GPU stack on OpenShift.
prepare_gpu_operator()
deploys the NFD Operator and the GPU Operator, if they are not already installed.wait_ready(...)
waits for the GPU Operator stack to be deployed, and optionally enable additional GPU Operator features:enable_time_sharing
enables the time-sharing capability of the GPU Operator, (configured via thecommand_args.yaml.j2
file).extend_metrics=True, wait_metrics=True
enables extra metrics to be captured by the GPU Operator DCGM component (the “well-known” metrics set). Ifwait_metrics
is enabled, the automation will wait for the DCGM to start reporting these metrics.wait_stack_deployed
allows disabling the final wait, and only enable the components above.
cleanup_gpu_operator()
undeploys the GPU Operator and the NFD Operator, if they are deployed.add_toleration(effect, key)
adds a toleration to the GPU Operator DaemonSet Pods. This allows the GPU Operator Pods to be deployed on nodes with specific taints. Mind that this command overrides any toleration previously set.
The projects.local_ci.library.prepare_user_pods
library module
This library helps with the execution of multi-user TOPSAIL tests.
Multi-user tests consist in Pods running inside the cluster, and all executing a TOPSAIL command. Their initialization is synchronized with a barrier, then they wait a configurable delay before starting their script. When they terminate, their file artifacts are collected via a S3 server, and stored locally for post-processing.
prepare_base_image_container(namespace)
builds a TOPSAIL image in a given namespace. The image must be consistent with the commit of TOPSAIL being tested, so theBuildConfig
relies on the PR number of fetch the right commit. Theapply_prefer_pr
function provides the helper code to update the configuration with the number of the PR being tested.apply_prefer_pr(pr_number=None)
inspects the environment to detect the PR number. When running locally, exportHOMELAB_CI=true
andPULL_NUMBER=...
for this function to automatically detect the PR number. Mind that this function updates the configuration file, so it cannot run inside a parallel context.delete_istags(namespace)
cleanups up the istags used by TOPSAIL User Pods.rebuild_driver_image(namespace, pr_number)
helps refreshing the image when running locally.
@entrypoint()
def rebuild_driver_image(pr_number):
namespace = config.project.get_config("base_image.namespace")
prepare_user_pods.rebuild_driver_image(namespace, pr_number)
cluster_scale_up(user_count)
scales up the cluster with the right number of nodes (when not running in a bare-metal cluster).prepare_user_pods(user_count)
prepares the cluster for running a multi-user scale test. Deploys the dependency tools (minio, redis), builds the image, prepare the ServiceAccount that TOPSAIL will use, prepare the secrets that TOPSAIL will have access to …cleanup_cluster()
cleanups up the cluster by deleting the User Pod namespace.
The projects.matrix_benchmarking.library.visualize
library module
This module helps with the post-processing of TOPSAIL results.
prepare_matbench()
is called from the ContainerFile. It installs thepip
dependencies of MatrixBenchmarking.download_and_generate_visualizations(results_dirname)
is called from the CIs, when replotting. It downloads test results runs the post-processing steps against it.generate_from_dir(results_dirname, generate_lts=None)
is the main entrypoint of this library. It accepts a directory as argument, and runs the post-processing steps against it. The expected configuration should be further documented …