In April, 2019, AWS announced the general availability of Elastic Fabric Adapter (EFA), an Amazon EC2 network device that improves throughput and scalability of distributed High Performance Computing (HPC) and Machine Learning (ML) workloads. Recently, we added a feature to support EFA configuration using AWS ParallelCluster. This tutorial/blog will walk you through the configuration options to enable EFA support via AWS ParallelCluster.
EFA is a network interface for Amazon EC2 instances that enables you to run HPC applications requiring high levels of inter-instance communications (such as computational fluid dynamics, weather modeling, and reservoir simulation) at scale on AWS. It uses an industry-standard operating system bypass technique, with a new custom Scalable Reliable Datagram (SRD) Protocol to enhance the performance of inter-instance communications, which is critical to scaling HPC applications.
AWS ParallelCluster takes care of the undifferentiated heavy lifting involved in setting up an HPC cluster with EFA enabled. When you set the enable_efa = compute flag in your cluster section, AWS ParallelCluster will add EFA to all network-enhanced instances. Under the cover, AWS ParallelCluster performs the following steps:
- Sets InterfaceType = efa in the Launch Template.
- Ensures that the security group has rules to allow all inbound and outbound traffic to itself. Unlike traditional TCP traffic, EFA requires an inbound rule and an outbound rule that explicitly allow all traffic to its own security group ID sg-xxxxx. See Prepare an EFA-enabled Security Group for more information.
- Installs EFA-enabled kernel, libfabric, and OpenMPI 3.1.4.
- Validates instance type, base os, and a placement group.
To get started, you’ll need to have AWS ParallelCluster set up, see Getting Started with AWS ParallelCluster. For this tutorial, we’ll assume that you have an AWS ParallelCluster installed and are familiar with the ~/.parallelcluster/config file.
Read the complete blog and follow along with step-by-step instructions here.