Running a 3.2M vCPU HPC Workload on AWS with YellowDog

*Figure 1: Two docking poses for a ligand bound to the active site of Hsp90. Ref.: Simon Bray, 2020 Protein-ligand docking (Galaxy Training Materials).*

Historically, advances in fields such as meteorology, healthcare, and engineering, were achieved through large investments in on-premises computing infrastructure. Upfront capital investment and operational complexity have been the accepted norm of large-scale HPC research. These challenges in deploying HPC technologies restricted the pace that smaller companies could achieve.

In recent work, OMass Therapeutics, a biotechnology company identifying medicines against highly validated target ecosystems, used YellowDog on AWS to analyze and screen 337 million compounds in 7 hours, a task which would have taken two months using an On-Premises HPC cluster. YellowDog, based in Bristol in the UK, ran the drug discovery application on an extremely large, multi-region cluster in AWS with the AWS ‘pay-as-you-go’ pricing model. It provided a central, unified interface to monitor and manage AWS Region selection, compute provisioning, job allocation and execution. To prove out the solution, YellowDog scheduled 200,000 molecular docking analyses across two regions and completed that workload in 65 minutes, enabling scientists to start work on analysis the same day, significantly accelerating the drug discovery process.

In this post, we’ll discuss the AWS and YellowDog services we deployed, and the mechanisms used to scale to 3.2m vCPUs in 33 minutes using multiple EC2 instance types across multiple regions. Once analysis started, the utilization rate was 95%.

Overview of solution

YellowDog democratizes HPC with a comprehensive cloud workload management solution that is available to all customers, at any scale, anywhere in the world. It’s cloud native and schedules compute nodes based on application characteristics, rather than the constraints of a fixed on-premises HPC cluster. This allows you to manage clusters using the YellowDog scheduler or third-party schedulers from one control pane, so you have a single view on cost and department consumption across applications. It can also provision resources across multiple Regions, Availability Zones, instance types and machine sizes.

The YellowDog platform runs on Amazon EC2, using Amazon Elastic Kubernetes Service (EKS). It uses Elastic Load Balancing to manage access to the cluster. Amazon Elastic Block Storage (EBS) is used for the foundational services (Kafka, Artemis and Zookeeper) that require persistent storage. AWS CloudTrail and Amazon CloudWatch are used for monitoring.

In this configuration, we also used Amazon EC2 Spot Fleets in “maintain mode” to acquire and maintain Spot Instances. These fleets are configured with multiple instance overrides (and subnets) and use a “capacity-optimized and order-prioritized” allocation strategy. Finally, Amazon Simple Storage Service (Amazon S3) is used for data transfer and access.

*Figure 2: This is the architecture we used to scale the run. Note the use of load balancers to distribute the traffic, which is an anti-pattern to traditional HPC architectures*

Solution

To execute the run, YellowDog launched 46,733 Spot Instances, utilizing 24 Amazon EC2 Fleets and 8 different instance types. In total, we provisioned 3.2 million vCPUs and achieved over 95% utilization for the duration of our run. YellowDog scheduled 200,000 molecular docking analyses across two geographically dispersed Regions, in North America and Europe. The combination of Docker containers, AutoDock Vina, and Open Babel was used to orchestrate, analyze, and score hit compounds.

Our batch ran for 65 minutes and notably, the utilization of the instances kept pace with provisioning. Furthermore, in the event of Spot Instance reclamations, the reclaimed Spot Instances were immediately replaced, with no disruption to workload execution.

Read the full blog to learn more about how researchers at OMass Therapeutics analyzed 337 million compounds in 7 hours.

Reminder: You can learn a lot from AWS HPC engineers by subscribing to the HPC Tech Short YouTube channel, and following the AWS HPC Blog channel.