Choosing between AWS Batch or AWS ParallelCluster for your HPC workloads

It’s an understatement that AWS has a lot of services (more than 200 at the time of this post!). We’re usually the first to point out that there’s more than one way to solve a problem. HPC is no different in this regard, because we offer a choice: customers can run their HPC workloads using AWS ParallelCluster or AWS Batch.

Which brings us to today’s question: how should you choose between them?

We think your choice will come down to three big factors:

Your environment and workspace preferences.
What your application(s) assume about the runtime environment.
What you use to define a complex workflow.

Workspace preferences

The first thing to consider is how to take advantage of your prior experience to implement a workflow on AWS.

If you’re currently using (or managing) shared HPC resources to submit jobs to a scheduler, you’ll feel right at home with AWS ParallelCluster. ParallelCluster is an AWS-supported, open-source tool that makes it easy for you to deploy and manage HPC clusters on AWS.

To use ParallelCluster, you define what the cluster should look like, including: what sort of compute instances to use for jobs, limits on how many to spin up, and which other capabilities you need such as shared storage, or remote visualization. ParallelCluster takes that configuration and handles the undifferentiated AWS orchestration for you. It’ll set up the networking and firewall rules, build and configuring a head node with packages, applications, and a scheduler. It’ll also stand-up shared storage if it doesn’t already exist, and make it available across the cluster. And it includes the automation you need to scale your compute nodes to the size of the work queue, expanding and shrinking the number of compute nodes based on the workloads in your queue.

Alternatively, if your background is more developer or DevOps oriented, you should consider implementing your analysis pipelines with AWS Batch. Batch is a container-centric, always-waiting, fully-managed task execution service. Batch provides job queues with sophisticated scheduling capabilities, and compute environments to define the size and shape of worker nodes. You define what the job will look like and when you submit some work, Batch will take care of orchestrating the underlying compute fleet and placement of jobs on that fleet.

Batch is a native AWS service, and has direct support for its resources in our SDKs, and AWS CloudFormation. If you already have processes in place for developing infrastructure-as-code on AWS, then creating Batch environments and integrating it into your workflows should follow the same process as integrating any other AWS service or feature.

In contrast, while ParallelCluster also uses CloudFormation behind the scenes, there aren’t any CloudFormation resources for “HeadNode”, “SlurmScheduler”, etc. This may or may not be an important distinction for you, but it is worth mentioning here.

Batch also integrates well with other AWS services, such as Amazon Identity and Access Management (IAM) for authentication/authorization, leveraging Amazon EC2 Spot Instances for cost savings, Amazon CloudWatch for triggering and monitoring of analysis events, and AWS Step Functions to enable new workflows. Having native integration with other AWS services can simplify development process for when you integrate Batch with other parts of your stack.

Application assumptions

Most HPC applications were written before cloud computing existed (some were written before the internet was even a thing). As a result, most applications were written with some assumptions about their runtime environment. Specifically, they often assume that the underlying servers are homogenous, static in number, and that the application can read and write data to shared POSIX storage that is available across the cluster. Tightly-coupled codes that communicate using a Message Passing Interface (MPI) across several nodes also assume that they have priority access to a high-bandwidth, low-latency network. Finally, some applications require shared access to acceleration hardware, like GPUs, across running processes.

If your application falls into this category, ParallelCluster will allow you to port it to AWS with few (or sometimes no) changes to your existing workflow. It’s a great option for quickly getting started running your existing HPC workloads on AWS and reaping the scalability and flexibility benefits of the cloud while maintaining an environment that is very close to what these applications expect.

These assumptions don’t prevent you using AWS Batch for HPC applications, but it’s a different environment than either running an application on a local workstation, or submitting to a traditional HPC scheduler.

To use an HPC application with Batch you’ll first need a containerized version of it. That’s a straight-forward procedure when the application is pleasingly parallel, and doesn’t require cross-node communication using MPI. Many codes fall into this category, especially in the bioinformatics space. The BioContainers community have already packaged many bioinformatics applications into containers and made them available to the general community. If you don’t find what you need in their registry, they also provide great documentation on best practices for creating containers.

Once the application is containerized, you then need to define the Batch resources to be able to run the application. Like ParallelCluster, you will need to define a set of Batch resources that apply to all jobs. This includes a job queue to define job ordering and placement priority, and a compute environment (CE) that defines the type of instances that should be used (Intel, AMD, Arm, GPUs, CPU/memory ratios, etc.), and the minimum and maximum number of concurrent nodes that can run jobs. At the level of a job submission, you’ll need a Batch job definition that specifies the job’s “shape” (the runtime CPU and memory requirements) for each type of job submitted. A Batch job definition is analogous to what you would submit to a HPC job scheduler to run, except in Batch you need to predefine the job shape before you can request that any instance of that job is run. Job definitions can also define storage mount-points for the container to access, both for local disk volumes, and also for mounting a shared Amazon Elastic File System mount point.

What you gain from this effort is the ability to scale your workload across multiple AWS Regions. For example, we recently worked with the Max Planck Institute for Biophysical Chemistry in Germany to port GROMACS (a molecular-dynamics simulation), and pmx (a free-energy calculation package), to analyze over 20 thousand compounds in three days across multiple AWS Regions. This gave us a lot of scope for scaling it up and out, and would have been hard to do any other way. You can read more about how we did that in a blog post.

Workflow-level concerns

It’s rare that an application is run by itself. Usually, a set of applications are run in a series of steps to form a complete workflow. This can be done via a basic shell or Python script, a Makefile, or feature-rich workflow frameworks such as Apache Airflow, Metaflow, Nextflow, etc.

Basic scripting has the advantage of being easy to implement and run. The downside is that you tend to outgrow them very quickly as your workflow increases in complexity. For example, if you want to restart a workflow from a certain point, you’d need to encode the logic that determines where you left off in a basic script. Workflow frameworks have this capability built-in already, and take care of restarting workflows from where they last left off. This feature alone makes it a lot easier to take advantage of EC2 Spot instances to save money running the workflow.

Read the full blog to learn more about getting the best OpenFOAM Performance on AWS.

Reminder: You can learn a lot from AWS HPC engineers by subscribing to the HPC Tech Short YouTube channel, and following the AWS HPC Blog channel.