Finding degradation areas within High Performance Computing (HPC) networking has been an underserved problem since HPC was born. As Amdahl’s Law reminds us, a process is only as fast as its slowest bottleneck, and, therefore, those tail latencies control overall execution times.
The main contributor to this HPC networking problem is that both HPC and Artificial Intelligence (AI) workloads run calculations on tens of thousands to hundreds of thousands and soon millions of compute elements. This is a subtle thing because the latency of any application is related to, but not solely driven by, the synchronization patterns in the application itself.
These tail latency problems are not new. Every distributed computing system that has depended on networks has been the victim of this common HPC tale, but one thing has changed – customers expect their switch and system vendors to do something about the congestion problem.
Both hyperscalers and HPC centers have to meet or beat Moore’s Law in their application performance—not their peak theoretical performance or transistor count, or any other metric you want to use. Reducing the effects of tail latencies would provide a dramatic improvement in the actual efficiency of running applications on supercomputers.
Tail latency is an on-going challenge that needs a benchmark like Global Performance and Congestion Network Test (GPCNet). Standards are critical and must be put in place to addressed critical latency concerns, to bring high quality demonstrable information to the acquisition and operation on interconnects. The adoption of these benchmarks will require assistance from everyone in the industry to make them better.
FOR INDUSTRY DISCOURSE: WHAT DOES IT TAKE TO BE A USEFUL INDUSTRY BENCHMARK TEST?
For starter consideration by the community at-large, this white paper proposes the following five requirements:
- It has to be straightforward to set up and run.
- It has to be able to scale as hardware does and run across interconnects of different styles, generations, and topologies.
- It has to work for a spectrum of system sizes and types. In other words, it has to run on a cluster with a relatively small number of nodes and interconnect switches, up to a large number of nodes with several layers of switching. The latter is vitally important because HPC and AI systems are generally running at scale, but that means different things to different workloads.
- It has to be difficult to game, but easy to use, and produce results that are clear and meaningful.
- Lastly, it has to be relatively inexpensive to run so machines of all makes and models will be used to run the test to show how well or poorly adaptive routing and congestion control technologies perform across time and architectures.
To address these industry needs, the GPCNeT benchmark was created to see how interconnects and their software stacks deal with congestion, while keeping a diverse set of applications running reasonably well and at high network utilization.
The GPCNeT benchmarking tool helps build better systems, improves customer’s ROI, and gives engineers the tools needed to continuously innovate even better network hardware and software in next-generation products. Read our whitepaper to learn more about the challenges faced with latency issues, and how effective usage of a benchmarking solution can bring optimal impact and insights to your data center.