GPU failures are relatively rare but when they do occur, they can have severe consequences for HPC and deep learning tasks. For example, they can disrupt long-running simulations and distributed training jobs. Amazon EC2 verifies GPU health before it launches an instance. It also does periodic status checks that can detect and mitigate many failure modes. However, this approach can miss failures that arise when a GPU instance has been active for some time.
With AWS ParallelCluster 3.6, you can configure NVIDIA GPU health checks that run at the start of your Slurm jobs. If the health check fails, the job is re-queued on another instance. Meanwhile, the instance is marked as unavailable for new work and is de-provisioned once any other jobs running on it have completed. This helps increase the reliability of GPU-based workloads (NVIDIA-based ones, at least), and helps prevent unwanted spend resulting from unsuccessful jobs.
Using GPU Health Checks
To get started with GPU health checks, you’ll need ParallelCluster 3.6.0 or higher…
Read the full blog to learn more. Reminder: You can learn a lot from AWS HPC engineers by subscribing to the HPC Tech Short YouTube channel, and following the AWS HPC Blog channel.