[Connect with LSF users and learn new skills in the IBM Spectrum LSF User Community.]
It seems that everyone is experimenting about cloud computing. While terms like Cloud Bursting and Hybrid Cloud are sliding into Gartner’s trough of disillusionment, talk of Cloud-Native methods and Cloudlets is on the rise.1 According to research from Hyperion, 70% of HPC sites run jobs in public clouds.2 While this is impressive, the same research shows that ~90% of HPC workloads are still run on-premise. In other words, while the use of cloud is widespread, it still accounts for only a small fraction of HPC workloads after more than a decade of availability. In this article, I look at common advantages and pitfalls of HPC in the cloud.
What’s old is new again
Cloud computing comes in many forms, including co-location, managed services, and the on-demand services offered by public cloud providers. At its heart, cloud is essentially an out-sourcing model, similar in some respects to the 1960s timeshare service bureaus. At the time, computer systems were often too expensive for organizations to manage in-house – a dynamic similar to today, although complexity is presently the key driver for outsourcing rather than hardware costs alone. Companies eventually turned away from service bureaus because of issues around cost-transparency, lack of control, fear of lock-in, and the availability of less expensive mini-computers.
While modern cloud providers are differentiated by their scale and self-serve automation, they are similar in some respects to the service bureaus of old. History doesn’t always repeat itself, but it often rhymes, so it’s useful to take a critical look at HPC in the cloud to avoid unexpected cost and performance issues.
Read also: The Perils of Becoming Trapped in the Cloud
Conventional thinking on cloud
Much of the interest in cloud is fueled by perceptions related to cost, convenience, and flexibility. For HPC users, however, the issues are complex. Let’s look at some common assertions about cloud from the perspective of HPC users.
Cloud computing reduces complexity – While true in some circumstances, it’s worth noting that HPC centers predominantly consume IaaS offerings. HPC workloads are often unique and customized and don’t lend themselves to more hands-off SaaS or PaaS distribution models. While administrators are relieved of the need to manage capital-intensive physical hardware, they are still in the business of managing and maintaining operating systems, networks (in the form of VPCs), application software, and the distributed software frameworks that account for much of the complexity of modern HPC. They also take on new headaches related to security, managing cloud credentials, VPNs, direct connect offerings, data synchronization, availability and disaster recovery. Cost management is a particular challenge with multiple cloud acquisition models (on-demand, reserved, spot/preemptible instances) and tiered pricing schemes that vary by cloud service. As on-premises management tools continue to improve, there’s an argument to be made that managing HPC in the cloud is every bit as complicated as managing a local data center.
Cloud computing saves money – While the cloud can be less costly for spikey or short-duration workloads, most HPC data centers don’t operate this way. Unlike commercial data centers where average utilization is often low, HPC centers tend to wring every bit of performance out of infrastructure investments, often driving utilization in the range of 80-90%. This means that on-demand or reserved cloud instances are more expensive than local infrastructure on a sustained basis. Interestingly as Moore’s Law has stalled, and single-threaded performance gains have slowed, there is a case for depreciating infrastructure over longer periods making the economics of cloud less compelling.3
Discounted spot or preemptible instances are price competitive with local infrastructure, but they are only suited to workloads that can tolerate instances being revoked at runtime. In IBM Spectrum LSF terminology, we describe these embarrassingly parallel workloads as “re-runnable” and the workload manager masks the fact that machine instances come and go. It’s not practical to run parallel MPI jobs or stateful HPC services on these low-cost instances. This is part of the reason that life sciences workloads (that tend to tolerate preemption) account for a large portion of HPC cloud spending.
Read also: IBM Spectrum LSF Goes Multicloud
Enterprise agreements, prepayment, and the use of reserved instances can mitigate the high cost of the cloud, but this requires careful planning and can undermine the pay-as-you-go flexibility that makes cloud computing attractive in the first place.
Cloud delivers performance on-par with or better than local infrastructure – While cloud services have come a long way, it’s not necessarily true that cloud-instances are faster than local hardware – in fact, the opposite is often true. It’s important to remember that cloud vCPUs are not physical cores – rather they are threads on a hyper-threaded core. In a 2018 benchmark, it was found that to overcome this difference and obtain similar compute capacity in the cloud, users needed to provision ~27% more vCPUs.4 While fast interconnects and parallel file systems are offered by some cloud providers, you will find these services are premium-priced. Cloud HPC clusters often fall short of the low-latency and high-file system bandwidth delivered by state-of-the-art hardware solutions.
Storing data in the cloud is more convenient than on-premise alternatives – While cloud object storage is inexpensive and a useful way to share data among collaborating HPC sites, most HPC workloads perform file-system I/O and rely on block storage, shared cloud file systems, or high-performance parallel file systems (that in turn use block storage attached to individual machine instances). Storage tiers that support file systems tend to be expensive, so cloud users need to worry about provisioning storage appropriate for their workloads, managing the secure transfer of data to and from the cloud, replicating important data across availability zones, and orchestrating services and shifting data between storage tiers to manage long-term costs. On-premise storage, by contrast, provides high-performance access to the same data using multiple access methods, including POSIX, HDFS, and object-storage interfaces such as S3 and has become increasingly easier to deploy and manage with turnkey, appliance-like storage solutions.
Keys to getting the most from cloud
While HPC cloud spending is on the rise, there is evidence that recent growth is, in part, fueled by the need for specialized GPUs not yet available in all HPC data centers. As data centers re-tool and add GPU-capable hardware, it will be interesting to see whether present growth rates can be sustained.
Gartner warns that 80% of cloud users will overshoot IaaS spending budgets through 2020 due to lack of internal process controls posing a tricky dynamic for academic and research-oriented HPC centers funded through periodic research grants and having limited OPEX budgets.5
A prudent strategy may be to have an eye on the clouds but keep both feet planted firmly on the ground. HPC hardware and software solutions such as IBM Spectrum LSF, IBM Power Systems and IBM Elastic Storage Server provide cloud-like management facilities but afford organizations high-performance, flexibility, and control without up-side cost surprises. Organizations can have the best of all worlds with high-performance infrastructure and software that enables users to maximize productivity and easily burst to their choice of public clouds as required.
By maximizing the use of local infrastructure and avoiding cloud-specific APIs, organizations can take advantage of cloud as warranted while managing costs, avoiding the risk of cloud lock-in, and ensuring that they have the flexibility to run workloads on the infrastructure most appropriate for their needs.
- Gartner Research – 2019 Hype cycle for cloud computing – https://www.gartner.com/en/documents/3956097/hype-cycle-for-cloud-computing-2019
- Hyperion Research findings presented in November 2018 – https://hyperionresearch.com/wp-content/uploads/2019/02/Hyperion-Research-SC18-Breakfast-Presentation.pdf
- How the Cloud is Falling Short for HPC – HPCwire – March 15, 2018, by Chris Dowling – https://www.hpcwire.com/2018/03/15/how-the-cloud-is-falling-short-for-research-computing/
- AWS vs. GCP vs. on-premise performance comparison – March 2018 – https://medium.com/infrastructure-adventures/aws-vs-gcp-vs-on-premises-cpu-performance-comparison-1cb3e91f9716
- Gartner Research – Ten moves to lower your AWS IaaS costs – https://www.gartner.com/en/documents/3847666/how-to-identify-solutions-for-managing-costs-in-public-c0