This post was contributed by Ara Ghukasyan, Research Software Engineer, and Santosh Kumar Radha, Head of R&D and Product, and William Cunningham, Head of HPC at Agnostiq, with Perminder Singh, Worldwide Partner Solution Architect, Tejas Rakshe, Sr. Global Alliance Manager at AWS.
Complicated multi-step workflows can be challenging to deploy, especially when using a variety of high-compute resources. Consider machine learning workflows which often involve computationally intensive preprocessing, training, and characterization steps that could require specialized hardware. Experiments of this nature fall into the broad category of heterogeneous computing. For instance, it’s common in machine learning to use GPUs for training neural networks, while also using CPUs to handle lighter tasks. Moreover, if the model also has non-learnable hyperparameters, then multiple repetitions of the experiment will be necessary to fine-tune to their values. This leads to massive workloads that require cloud resources in practical applications.
To address this common challenge, many practitioners opt to explore the hyperparameter space through concurrent runs, deploying parallel instances to handle each combination of parameters. Beyond this, however, coordinating an efficient and reproducible experiment can require time and cloud expertise that many users do not have.
Covalent is an open-source orchestration tool that streamlines the deployment of distributed workloads on AWS resources. It provides powerful abstractions that elevate the user-resource interaction, especially in the context of high-compute experiments. Covalent eliminates overhead by parsing workflows to orchestrate their sub-tasks using a serverless HPC architecture. This means that users can easily expand their computing capacity by recruiting cloud resources whenever demand spikes—a concept known as “cloud bursting”.
Covalent is suitable for a wide range of users, including machine learning practitioners, data scientists, and anyone interested in a Pythonic tool for running heterogenous computations from their local environment.
In what follows, we use a sample problem to outline key concepts in Covalent and develop a machine learning workflow for AWS Batch in just a handful of steps. To conclude, we showcase some material benefits of using Covalent, as reflected in saved compute-hours and shorter wall times.