Learn how AWS created a new network interface designed for applications requiring high levels of inter-node communications at scale.
When we started designing our Elastic Fabric Adapter (EFA) several years ago, many were skeptical about its ability to support customers who run all the difficult-to-scale, “latency-sensitive” codes – like weather simulations and molecular dynamics.
Partly, this was due to these codes emerging from traditional on-premises HPC environments with purpose-built machine architectures. These machines designs are driven by every curve ball and every twist in some complicated algorithms. This is “form meets function”, though maybe “hand meets glove” is maybe more evocative. It’s common in HPC to design machines to fit a code, and also common to find that over time the code comes to fit the machine. My other hesitation came from the HPC hardware community getting very focused in recent years on the impact of interconnect latency on application performance. Over time, we came to assume it was a gating factor.
But it turns out: EFA rocks at these codes! Ever since we launched EFA we’ve been learning new lessons about whole classes of codes that are more throughput-constrained than limited by latency. Mainly what we learned was that single-packet latency measured by micro-benchmarks is distracting when trying to predict code performance on an HPC cluster.
Latency isn’t as important as we thought
Latency isn’t irrelevant, though. It’s just that so many of us in the community overlooked the real goal, which is for MPI ranks on different machines to exchange chunks of data quickly. Single-packet latency measures are only a good proxy for that when your network is perfect, lossless, and uncongested. Real applications send large chunks of data to each other. Real networks are busy. What governs “quickly” is whether the hundreds, or thousands of packets that make up that data exchange arrive intact. They also must arrive soon enough (for the next timestep in a simulation, say) so they’re not holding up the other ranks.
Most fabrics (like Infiniband) and protocols (like TCP) send the packets in order. That’s a design choice those transports made (back in the day), making it the network’s problem to re-assemble messages into contiguous blocks of data. However, it means a single packet getting lost messes up the on-time arrival of all the packets behind it in the queue (an effect called “head of line blocking”). However, it did save the need for a fancy PCI card (or worse, an expensive CPU) being involved to reassemble the data and chaperone all the stragglers.
You can see why single-packet latency matters for these fabrics – it’s literally going to make all the difference to how fast they can recover from a lost packet and maintain throughput.
The measure we all should pay more attention to is the p99 tail latency. This is the worst latency experienced by 99% of all packets and speaks for more of the “real network” effects. It’s the net result of all those lost packets, retransmits, and congestion, and it’s the one that predicts the overall performance of MPI applications the most. This is mainly because of routines like collective operations (like MPI_Barrier or MPI_Allreduce) that hold processing up to get all ranks synchronized before moving to a crucial next step. It’s why you often hear experienced HPC engineers say that MPI codes are only as fast as the slowest rank.
This wasn’t news to protocol designers. It’s just that in, the early 2000s, we didn’t have the same low-cost high-performance silicon options we had later in 2018, or 2021. We also didn’t have AWS.
This last point is significant, because operating infrastructure at the scale and pace of growth we do, forces some different design choices. Firstly, AWS is always on. Even a national supercomputer facility goes offline for maintenance every now and again: usually it’s a window for vital repairs, file system upgrades or to refactor the network fabric to make room for expansion. We don’t have that luxury because customers rely on us 24×7. And because we’re so reliable, and agile and scalable, we’re experiencing a pace of growth that means literal truckloads of new servers arrive at our data centers every day. HPC customers kept telling us that they love the reliability, agility, and scalability of AWS, too, so just building a special Region full of HPC equipment wasn’t going to get us off the hook.
The lessons we drew from this were that we can’t do islands of specially connected CPUs, because they’d quickly be surrounded by oceans of everything else. And that everything else stuff is part of what makes the cloud magical. Why restrict HPC people to a small subset of the cloud? Couldn’t we make that everything else part of the solution?
Read the full blog to learn how and why we built our own reliable datagram protocol that enables our customers to scale their tightly-coupled HPC or distributed machine learning applications to thousands of cores.