For many HPC users, their needs are not evenly distributed throughout a year: some might need few – if any – resources for months, then they might need a very large system for a week. For those kinds of users, large on-premises systems might not make much sense, and “cloud bursting” – renting lots of cloud resources for brief, intensive computing – is a more viable alternative. In a new case study, IBM Research described how they burst to the IBM Cloud to advance their electronic design automation (EDA) workloads, reaching bursts of up 11,000-plus cores.
The authors present an optical proximity correction (OPC) workload, wherein an engineer’s intended semiconductor design undergoes shape manipulation before that new shape is used to create a photomask that is then applied in fabrication. There are trillions of possibilities for these semiconductor shapes, leading the workload to be “computationally intensive, embarrassingly parallel and a perfect example for cloud bursting.”
“By breaking up the design into millions of tiles and then running OPC on all of the tiles in parallel, one can take a slow step in the semiconductor manufacturing process and speed it up by an order of magnitude,” they wrote. “OPC run times are usually limited by the number of cores available on a farm. OPC runs that might take over a week using ~2000 [cores] can be scaled to 10,000 cores or much higher in the cloud, with a tremendous runtime improvement. Considering that a typical semiconductor device may have well over 50 mask layers (all of which need OPC), reducing the OPC run time of each layer translates directly to critical time-to-market gains[.]”
The IBM Research team applied Synopsys’ Proteus suite across all three steps of the OPC workflow (retargeting, the OPC itself and verification). For cloud provisioning, the researchers leveraged both Ansible and Terraform to provision a Linux cluster across three datacenters that ranged from 2,000 to a peak of 11,400 physical cores. On these resources, they simulated a hybrid cloud infrastructure by linking cloud resources in their US-South region to cloud resources in Great Britain. (“The configuration is similar to what might be used in a typical customer burst scenario where the license servers in US-South represent an on-prem data center, and the Linux cluster in EU-GB represents burst compute capacity[.]”)
“The OPC portion of the recipe is expected to scale linearly throughout a broad range of core counts and is a good indication of scalable performance of cluster infrastructure,” the researchers concluded. “[This] is remarkable for a number of reasons.” First, they said, they used virtual instances across three availability zones, and despite physical limitations to latency improvements, they “did not observe any evidence of latency issues”; second, the shared filesystem performed similarly well under the same conditions; and third, they “do not observe indications of network latency on the application network between the head node and worker processes.”
“For future work, we plan to expand these scaling results even higher,” they wrote. “The primary questions are: How many clones of Michelangelo are possible? And what’s the shortest amount of time we can achieve for painting our version of the Sistine Chapel?”