Amazon Elastic Compute Cloud (Amazon EC2) offers a wide-range of compute instances at different price points, all designed to match different customer’s needs. You can further optimize cost by choosing Reserved Instances (RIs) and even Spot Instances.
Amazon EC2 offers Spot Instances at up to 90% discount compared to on-demand instance prices, making them attractive to HPC customers. However, they can be reclaimed by Amazon EC2 with a two-minute warning. For an application to effectively use spot, it must have some resiliency to failure.
Checkpoint-restart is a defense mechanism for an application to save a known good (execution) state, and later resume execution from that point.
In this blog post, we will discuss the use of AWS ParallelCluster configuration for creating an HPC cluster on AWS, and Amazon EventBridge to capture the two-minute warning notification to execute checkpoint code on a Spot Instance reclaim. We’ll use Gadget-4, a cosmological N-body, SPH code for demonstrating the methodology…
Read the full blog to learn more. Reminder: You can learn a lot from AWS HPC engineers by subscribing to the HPC Tech Short YouTube channel, and following the AWS HPC Blog channel.