Choosing between AWS Batch or AWS ParallelCluster for your HPC workloads

By Amazon Web Services

March 21, 2022

It’s an understatement that AWS has a lot of services (more than 200 at the time of this post!). We’re usually the first to point out that there’s more than one way to solve a problem. HPC is no different in this regard, because we offer a choice: customers can run their HPC workloads using AWS ParallelCluster or AWS Batch.

Which brings us to today’s question: how should you choose between them?

We think your choice will come down to three big factors:

  1. Your environment and workspace preferences.
  2. What your application(s) assume about the runtime environment.
  3. What you use to define a complex workflow.

Workspace preferences

The first thing to consider is how to take advantage of your prior experience to implement a workflow on AWS.

If you’re currently using (or managing) shared HPC resources to submit jobs to a scheduler, you’ll feel right at home with AWS ParallelCluster. ParallelCluster is an AWS-supported, open-source tool that makes it easy for you to deploy and manage HPC clusters on AWS.

To use ParallelCluster, you define what the cluster should look like, including: what sort of compute instances to use for jobs, limits on how many to spin up, and which other capabilities you need such as shared storage, or remote visualization. ParallelCluster takes that configuration and handles the undifferentiated AWS orchestration for you. It’ll set up the networking and firewall rules, build and configuring a head node with packages, applications, and a scheduler. It’ll also stand-up shared storage if it doesn’t already exist, and make it available across the cluster. And it includes the automation you need to scale your compute nodes to the size of the work queue, expanding and shrinking the number of compute nodes based on the workloads in your queue.

Alternatively, if your background is more developer or DevOps oriented, you should consider implementing your analysis pipelines with AWS Batch. Batch is a container-centric, always-waiting, fully-managed task execution service. Batch provides job queues with sophisticated scheduling capabilities, and compute environments to define the size and shape of worker nodes. You define what the job will look like and when you submit some work, Batch will take care of orchestrating the underlying compute fleet and placement of jobs on that fleet.

Batch is a native AWS service, and has direct support for its resources in our SDKs, and AWS CloudFormation. If you already have processes in place for developing infrastructure-as-code on AWS, then creating Batch environments and integrating it into your workflows should follow the same process as integrating any other AWS service or feature.

In contrast, while ParallelCluster also uses CloudFormation behind the scenes, there aren’t any CloudFormation resources for “HeadNode”, “SlurmScheduler”, etc. This may or may not be an important distinction for you, but it is worth mentioning here.

Batch also integrates well with other AWS services, such as Amazon Identity and Access Management (IAM) for authentication/authorization, leveraging Amazon EC2 Spot Instances for cost savings, Amazon CloudWatch for triggering and monitoring of analysis events, and AWS Step Functions to enable new workflows. Having native integration with other AWS services can simplify development process for when you integrate Batch with other parts of your stack.

Application assumptions

Most HPC applications were written before cloud computing existed (some were written before the internet was even a thing). As a result, most applications were written with some assumptions about their runtime environment. Specifically, they often assume that the underlying servers are homogenous, static in number, and that the application can read and write data to shared POSIX storage that is available across the cluster. Tightly-coupled codes that communicate using a Message Passing Interface (MPI) across several nodes also assume that they have priority access to a high-bandwidth, low-latency network. Finally, some applications require shared access to acceleration hardware, like GPUs, across running processes.

If your application falls into this category, ParallelCluster will allow you to port it to AWS with few (or sometimes no) changes to your existing workflow. It’s a great option for quickly getting started running your existing HPC workloads on AWS and reaping the scalability and flexibility benefits of the cloud while maintaining an environment that is very close to what these applications expect.

These assumptions don’t prevent you using AWS Batch for HPC applications, but it’s a different environment than either running an application on a local workstation, or submitting to a traditional HPC scheduler.

To use an HPC application with Batch you’ll first need a containerized version of it. That’s a straight-forward procedure when the application is pleasingly parallel, and doesn’t require cross-node communication using MPI. Many codes fall into this category, especially in the bioinformatics space. The BioContainers community have already packaged many bioinformatics applications into containers and made them available to the general community. If you don’t find what you need in their registry, they also provide great documentation on best practices for creating containers.

Once the application is containerized, you then need to define the Batch resources to be able to run the application. Like ParallelCluster, you will need to define a set of Batch resources that apply to all jobs. This includes a job queue to define job ordering and placement priority, and a compute environment (CE) that defines the type of instances that should be used (Intel, AMD, Arm, GPUs, CPU/memory ratios, etc.), and the minimum and maximum number of concurrent nodes that can run jobs. At the level of a job submission, you’ll need a Batch job definition that specifies the job’s “shape” (the runtime CPU and memory requirements) for each type of job submitted. A Batch job definition is analogous to what you would submit to a HPC job scheduler to run, except in Batch you need to predefine the job shape before you can request that any instance of that job is run. Job definitions can also define storage mount-points for the container to access, both for local disk volumes, and also for mounting a shared Amazon Elastic File System mount point.

What you gain from this effort is the ability to scale your workload across multiple AWS Regions. For example, we recently worked with the Max Planck Institute for Biophysical Chemistry in Germany to port GROMACS (a molecular-dynamics simulation), and pmx (a free-energy calculation package), to analyze over 20 thousand compounds in three days across multiple AWS Regions. This gave us a lot of scope for scaling it up and out, and would have been hard to do any other way. You can read more about how we did that in a blog post.

Workflow-level concerns

It’s rare that an application is run by itself. Usually, a set of applications are run in a series of steps to form a complete workflow. This can be done via a basic shell or Python script, a Makefile, or feature-rich workflow frameworks such as Apache AirflowMetaflowNextflow, etc.

Basic scripting has the advantage of being easy to implement and run. The downside is that you tend to outgrow them very quickly as your workflow increases in complexity. For example, if you want to restart a workflow from a certain point, you’d need to encode the logic that determines where you left off in a basic script. Workflow frameworks have this capability built-in already, and take care of restarting workflows from where they last left off. This feature alone makes it a lot easier to take advantage of EC2 Spot instances to save money running the workflow.

Read the full blog to learn more about getting the best OpenFOAM Performance on AWS.

Reminder: You can learn a lot from AWS HPC engineers by subscribing to the HPC Tech Short YouTube channel, and following the AWS HPC Blog channel.

 

Return to Solution Channel Homepage
Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

Google Announces Sixth-generation AI Chip, a TPU Called Trillium

May 17, 2024

On Tuesday May 14th, Google announced its sixth-generation TPU (tensor processing unit) called Trillium.  The chip, essentially a TPU v6, is the company's latest weapon in the AI battle with GPU maker Nvidia and clou Read more…

ISC 2024 Student Cluster Competition

May 16, 2024

The 2024 ISC 2024 competition welcomed 19 virtual (remote) and eight in-person teams. The in-person teams participated in the conference venue and, while the virtual teams competed using the Bridges-2 supercomputers at t Read more…

Grace Hopper Gets Busy with Science 

May 16, 2024

Nvidia’s new Grace Hopper Superchip (GH200) processor has landed in nine new worldwide systems. The GH200 is a recently announced chip from Nvidia that eliminates the PCI bus from the CPU/GPU communications pathway.  Read more…

Europe’s Race towards Quantum-HPC Integration and Quantum Advantage

May 16, 2024

What an interesting panel, Quantum Advantage — Where are We and What is Needed? While the panelists looked slightly weary — their’s was, after all, one of the last panels at ISC 2024 — the discussion was fascinat Read more…

The Future of AI in Science

May 15, 2024

AI is one of the most transformative and valuable scientific tools ever developed. By harnessing vast amounts of data and computational power, AI systems can uncover patterns, generate insights, and make predictions that Read more…

Some Reasons Why Aurora Didn’t Take First Place in the Top500 List

May 15, 2024

The makers of the Aurora supercomputer, which is housed at the Argonne National Laboratory, gave some reasons why the system didn't make the top spot on the Top500 list of the fastest supercomputers in the world. At s Read more…

Google Announces Sixth-generation AI Chip, a TPU Called Trillium

May 17, 2024

On Tuesday May 14th, Google announced its sixth-generation TPU (tensor processing unit) called Trillium.  The chip, essentially a TPU v6, is the company's l Read more…

Europe’s Race towards Quantum-HPC Integration and Quantum Advantage

May 16, 2024

What an interesting panel, Quantum Advantage — Where are We and What is Needed? While the panelists looked slightly weary — their’s was, after all, one of Read more…

The Future of AI in Science

May 15, 2024

AI is one of the most transformative and valuable scientific tools ever developed. By harnessing vast amounts of data and computational power, AI systems can un Read more…

Some Reasons Why Aurora Didn’t Take First Place in the Top500 List

May 15, 2024

The makers of the Aurora supercomputer, which is housed at the Argonne National Laboratory, gave some reasons why the system didn't make the top spot on the Top Read more…

ISC 2024 Keynote: High-precision Computing Will Be a Foundation for AI Models

May 15, 2024

Some scientific computing applications cannot sacrifice accuracy and will always require high-precision computing. Therefore, conventional high-performance c Read more…

Shutterstock 493860193

Linux Foundation Announces the Launch of the High-Performance Software Foundation

May 14, 2024

The Linux Foundation, the nonprofit organization enabling mass innovation through open source, is excited to announce the launch of the High-Performance Softw Read more…

ISC 2024: Hyperion Research Predicts HPC Market Rebound after Flat 2023

May 13, 2024

First, the top line: the overall HPC market was flat in 2023 at roughly $37 billion, bogged down by supply chain issues and slowed acceptance of some larger sys Read more…

Top 500: Aurora Breaks into Exascale, but Can’t Get to the Frontier of HPC

May 13, 2024

The 63rd installment of the TOP500 list is available today in coordination with the kickoff of ISC 2024 in Hamburg, Germany. Once again, the Frontier system at Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Shutterstock 1606064203

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

January 25, 2024

In under two minutes, Meta's CEO, Mark Zuckerberg, laid out the company's AI plans, which included a plan to build an artificial intelligence system with the eq Read more…

AMD MI3000A

How AMD May Get Across the CUDA Moat

October 5, 2023

When discussing GenAI, the term "GPU" almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word "GPU" is assumed to mean "Nvidia" products. (As an aside, the popular Nvidia hardware used in GenAI are not technically... Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, codenamed Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

Some Reasons Why Aurora Didn’t Take First Place in the Top500 List

May 15, 2024

The makers of the Aurora supercomputer, which is housed at the Argonne National Laboratory, gave some reasons why the system didn't make the top spot on the Top Read more…

Leading Solution Providers

Contributors

Shutterstock 1285747942

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

Eyes on the Quantum Prize – D-Wave Says its Time is Now

January 30, 2024

Early quantum computing pioneer D-Wave again asserted – that at least for D-Wave – the commercial quantum era has begun. Speaking at its first in-person Ana Read more…

The GenAI Datacenter Squeeze Is Here

February 1, 2024

The immediate effect of the GenAI GPU Squeeze was to reduce availability, either direct purchase or cloud access, increase cost, and push demand through the roof. A secondary issue has been developing over the last several years. Even though your organization secured several racks... Read more…

Intel Plans Falcon Shores 2 GPU Supercomputing Chip for 2026  

August 8, 2023

Intel is planning to onboard a new version of the Falcon Shores chip in 2026, which is code-named Falcon Shores 2. The new product was announced by CEO Pat Gel Read more…

The NASA Black Hole Plunge

May 7, 2024

We have all thought about it. No one has done it, but now, thanks to HPC, we see what it looks like. Hold on to your feet because NASA has released videos of wh Read more…

GenAI Having Major Impact on Data Culture, Survey Says

February 21, 2024

While 2023 was the year of GenAI, the adoption rates for GenAI did not match expectations. Most organizations are continuing to invest in GenAI but are yet to Read more…

How the Chip Industry is Helping a Battery Company

May 8, 2024

Chip companies, once seen as engineering pure plays, are now at the center of geopolitical intrigue. Chip manufacturing firms, especially TSMC and Intel, have b Read more…

Q&A with Nvidia’s Chief of DGX Systems on the DGX-GB200 Rack-scale System

March 27, 2024

Pictures of Nvidia's new flagship mega-server, the DGX GB200, on the GTC show floor got favorable reactions on social media for the sheer amount of computing po Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire