We recently launched two new Amazon EC2 instance families based on Intel’s Ice Lake – the C6i and M6i. These instances provide higher core counts and take advantage of generational performance improvements on Intel’s Xeon scalable processor family architectures.
In this post we show how GROMACS performs on these new instance families. We use similar methodologies as for previous posts where we characterized price-performance for CPU-only and GPU instances (Part 1, Part 2, Part 3), providing instance recommendations for different workload sizes.
Today we’ll compare the performance on C6i with its predecessors, the c5.24xlarge and c5n.18xlarge which were our top picks on previous posts for single node and scale-out simulation runs. We also show a quick comparison against GPU instances as well. Because we’ll only be using the largest sizes in each of these families, we’ll just refer to the instance family name throughout this post, for brevity.
Our setup
The c6i.32xlarge (C6i from now on) is the instance we used for all the analyses described in this post. It comes with 64 physical cores, 256 GiB of RAM and carries a 50 Gbps network interface with the Elastic Fabric Adapter (EFA).
We’ll use the same benchmark cases from the Max Planck Institute for Biophysical Chemistry that were used in earlier posts. These cases represent three different classes of input sizes: small (benchMEM, 82k atoms), medium (benchRIB, 2M atoms), and large (benchPEP, 12M atoms) system.
To enable the single-node and multi-node scaling runs, we used AWS ParallelCluster to set up the HPC cluster with SLURM as our job scheduler. We compiled GROMACS from source using the Intel 2020 compiler to take advantage of single-instruction-multiple-data (SIMD) AVX2 and AVX512 optimizations, and the Intel MKL FFT library. You can see the specific compiler flags used with the Intel compiler in the online workshop we reference at the end of this post.
Single node Performance and Price-Performance
Our earlier post showed C5 as the best choice for single-node simulations on a CPU only instance – it was the best in performance and price-performance across the three workloads. This changes with C6i being the better instance. This is mainly due to higher-core counts, providing larger magnitudes of parallel efficiency per instance.
Absolute-performance
Figure 1 shows the comparison of C5 and C6i in terms of performance (ns/day) across the three workloads. Notice that for the smaller workload (benchMEM), the performance on C5 is still comparable to C6i, but as the workload size increases, the C6i take the lead. This is because the smaller workloads just can’t utilize the full parallel efficiency provided by the instance. Still, for a comparable core-count variant of C6i, the performance on C6i is usually similar to, or slightly better than, C5 for these smaller workloads.
Read the full blog to learn more about GROMACS performance on Amazon EC2 with Intel Ice Lake processors.
Reminder: You can learn a lot from AWS HPC engineers by subscribing to the HPC Tech Short YouTube channel, and following the AWS HPC Blog channel.