Avoiding Scientific Computing Bottlenecks in the Cloud
Yesterday, HPC in the Cloud discussed the prospect of running scientific computing applications in the cloud on Amazon’s CPU and GPU cores in EC2, particularly with regard to computational fluid dynamics. It is fitting then that HPC Experiment, a research initiative comprised of teams of IT engineers, experts, and analysts, hosted a presentation on the subject this week.
Frank Ding, Engineering Analysis & Technical Computing Manager at Simpson Strong-Tie, discussed the advantages of utilizing the cloud for occasional scientific computing, identified the obstacles to doing so, and proposed workarounds to some of those obstacles.
“Realistic modeling is the key to getting good model fidelity,” Ding said on the goals of scientific computing in general and how to attain them. “HPC is required to get good turnaround time in simulations and also to solve large models.”
Specifically, Ding spoke of these large models as they related to running structural analysis in the Abaqus software suite. “The goal is to shorten the product model cycle and reduce the number of physical prototypes,” Ding said on what his group is using Abaqus for and the overall goals they hope to achieve through HPC. Below is an example of the type of problem on which Ding’s group worked.
Their group has to evaluate the structural integrity of materials that are ‘non-linear,’ meaning their density varies over the volume of the material. As a result, high-fidelity modeling is required to essentially map out cubic inch of the material as it reacts to various stresses.
Ding’s HPC cluster is a 4-node, 32-core that utilizes Nehalem-based Xeon processors and InfiniBand DDR. They used that system to compute concrete anchor bolt tension capacity, a process that had to take into account 1.9 million degrees of freedom, where a degree of freedom in structural mechanics simply represents a point or object that can move and must be considered.
In short, Ding said that, “If I have a large job, I will be limited by current capacity.”
On 32 cores, the simulation required 11 and a half hours of runtime, but his team looked to cut down that runtime by hosting some data and computations in a cloud setting. Outsourcing those capabilities to a cloud is a more reasonable financial option than simply expanding or even revamping their existing HPC cluster. Of course, various performance and latency concerns pop up when computing is moved to the cloud. For Ding, one of those more important and underrated obstacles is the internet bandwidth of the end user.
While this handicap was apparent to Ding’s group in particular, it is reasonable to believe they are not the only group with that potential problem. The optimal bandwidth at which to shuttle information for this kind of project is, according to Ding, around five megabits on a desktop. That number actually represents an attainable average over time, so the problem lies in variability and randomness.
For example, sometimes the internet will run at ten Mbps. Other times, it will run ten times slower. That represents a variability of 70 to 80 percent, according to Ding–an unacceptable figure when important computations are being carried out.
As such, Ding noted that “end point internet bandwidth and randomness is the top barrier for good end user experience.” To combat that unreliability, Ding suggested employing a job monitoring system that recognizes when bandwidth is slower and adjusts accordingly. While adding another software layer to the existing Abaqus system in their case may not be ideal, it would help alleviate bottleneck issues that can set projects back for hours. “Some workflow details have been advised to improve end user experience, such as job monitoring,” Ding said.
He also pointed to virtualization and a solution manager layer to further cut down on bottlenecks. Virtualizing throughout an institution’s HPC cluster could cut down on accumulation, meaning the clusters (the in-house one and the one in the cloud) would not have to undergo delay-inducing large data transfers. A solution manager would be key in identifying which specific cores and points are underperforming or lacking in bandwidth, reducing the loads on those individual cores when necessary.
Currently, according to Ding, 450 people are participating in the HPC Experiment across 80 teams that have been formed over three ‘rounds.’ “Each team,” Ding explained, “consists of an end user, a resource provider, a software provider, and an expert and these parties come together to resolve the problem the end user is facing in HPC in the cloud.”
Of those eighty teams, many hail from institutions that have in-house HPC systems, but they are not always expansive enough to cover those institutions’ needs for especially intensive computations. Work like this is important to ensure science’s continued advancement, which is facilitated by more and more institutions gaining access to high performance computing. Going forward, deploying some of that computing in the cloud will hopefully make such computing more accessible.