This post was contributed by Cristian Măgherușan-Stanciu, Sr. Specialist Solution Architect, EC2 Spot, with contributions from Cristian Kniep, Sr. Developer Advocate for HPC and AWS Batch at AWS, Carlos Manzanedo Rueda, Principal Solutions Architect, EC2 Spot at AWS, Ludvig Nordstrom, Principal Solutions Architect at AWS, Vytautas Gapsys, project group leader at the Max Planck Institute for Biophysical Chemistry, and Carsten Kutzner, staff scientist at the Max Planck Institute for Biophysical Chemistry.
This blog is part of a blog series that covers how we have been working with a team of researchers at the Max Planck Institute for Biophysical Chemistry, helping them leverage the cloud for drug research applications in the pharmaceutical industry.
In this post, we’ll focus on how the team at Max Planck obtained thousands of EC2 Spot Instances spread across multiple AWS Regions for running their compute intensive simulations in a cost-effective manner, and how their solution will be enhanced further using the new Spot Placement Score API.
Computer Aided Drug Design in the cloud
The drug research and development process usually starts with a really large number of potentially promising compounds. From this virtually infinite chemical space, it’s the researcher’s goal to identify potent molecules that might be life-saving. These compounds are then gradually filtered through a multi-stage selection process until eventually a small subset of them are synthesized and thoroughly tested before further approval for use.
After identifying a potential drug candidate (the “lead compound”), the aim is to further optimize this lead into an actual active molecule. Computational methods based on molecular dynamics simulations help this by efficiently reducing the search space to only a few hundred candidates. These can then be processed and tested in later stages, which are increasingly laborious – and expensive.
Computer aided drug design (CADD) is increasingly used in the early drug discovery stage, and thanks to advancements in technology, highly accurate and computationally-intensive methods can be used to select the best possible candidates. This includes a class of methods using molecular dynamics where we simulate the protein-ligand interaction at the atomic level.
These Early drug discovery simulations are usually performed on-premises using large supercomputers shared by multiple research and development institutions. Building that kind of infrastructure takes years and once it’s built it’s expensive to maintain, has limited capacity, and a lot of other users, which means it sometimes takes a long time to get results.
AWS can offer massive capacity which is only provisioned and charged for the duration of a simulation. Besides the lower costs and reduced time to provision the capacity, it also offers increased flexibility by using multiple instance types, different families, and purchasing options. This flexibility means researchers can experiment with many of the available options to find a best fit for each application. This empowers them to achieve the best possible trade-off between time to results and cost for each simulation.
Running GROMACS at scale on EC2 Spot Instances
EC2 Spot Instances enable AWS customers to request unused EC2 capacity at steep discounts, up to 90% compared to On-Demand prices. They’re a great fit for many stateless, fault-tolerant and/or flexible workloads, and are especially suited for loosely coupled computationally-intensive applications running over hundreds or thousands of instances. In these cases, Spot savings can add up to significant amounts of money which can ensure the feasibility of a given workload.
Spot uses capacity pools, which are sets of unused EC2 instances with the same instance type and operating system running within an Availability Zone. When EC2 needs this capacity for another customer, instances are claimed back, with (at least) a two-minute warning.
To be successful with Spot, it helps to be flexible – especially when it comes to your preferred instance types. Diversification across multiple Spot capacity pools means EC2 can provision new instances from other capacity pools in the event of Spot interruptions in a specific pool. Your workload can then resume on the new instances and continue on, often without any visible impact.
For most workloads, Spot diversification is achieved by using multiple instance types and tapping into all the Availability Zones within a Region; the more Availability Zones and instance types, the better the chance to get the desired Spot capacity, and the lower the frequency of Spot interruptions.
The Max Planck research team was interested in using EC2 Spot to provision thousands of instances for running their computationally-intensive simulations. Their GROMACS workload has a few characteristics that make it a great fit for Spot:
- It’s loosely coupled and instance type flexible – it runs well on CPUs and GPUs.
- It’s Region flexible – there’s relatively little input data and output data that need to be moved from one place to the next.
- The acceptable time to get the end results is flexible – it can be measured in hours, days or even more than a week depending on the simulation. Time-flexible workloads like this often present trade-offs between cost and time-to-results.
- It can implement checkpointing – a job can resume quickly in the face of a Spot interruption. For compute-heavy workloads like molecular dynamics, a task might take hours or days to compute.
In our previous blogs from this series, the Max Planck research team showed benchmark results across multiple instance types, and figured out that the most cost-effective instance types for them are G4dn.xlarge, G4dn.2xlarge and G4dn.4xlarge. But they knew they could also use a larger number of instance types – cost-efficiency varied. They summarized their results, which we’ve shown in Table 1.
Considering the workload’s regional flexibility, and its large capacity needs, we helped the team run them over multiple AWS Regions in parallel using a tool called ‘HyperBatch’. This is a solution, designed by an AWS Solution Architect, which is designed to run AWS Batch across multiple AWS Regions, to secure the required capacity by leveraging large numbers of capacity pools.
Depending on the trade-off between cost and time-to-results that the Max Planck research team is trying to achieve for a given simulation, they had some options for achieving their workload’s Spot capacity:
- for the lowest possible cost, they could run only on the preferred G4dn GPU instance types – this doesn’t offer much diversification. Since G4dn instances are popular and used for many workloads including HPC, deep learning and graphics rendering, they can often be in short supply in the Spot capacity pools. That can increase the rate of interruptions, which might increase the simulation time to multiple days, which is not always workable.
- for a faster time-to-results, the team could use a highly-diversified mix of instance types, including a variety of EC2 compute-optimized instances. By increasing the diversification of instances, there’s more compute capacity overall that could satisfy our needs, and we’ll be able to run the simulations sooner.
For this simulation run, the team optimized for shorter time-to-results and used a mix of C5 and G4dn instance types of various sizes, as per the Spot diversification best practices.
To see results for the run, please read the full blog here.
Reminder: You can learn a lot from AWS HPC engineers by subscribing to the HPC Tech Short YouTube channel, and following the AWS HPC Blog channel.