Rethinking the Economics of Serverless Computing for HPC

Serverless computing has been around for almost five years now. It is used in applications ranging from mobile to IoT backends to stream processing. Also known as function-as-a-service (FaaS), serverless platforms include IBM Cloud Functions, Google Cloud Functions, and AWS Lambda. At first glance, the economics look compelling. Some HPC users are actively exploring serverless for a variety of embarrassingly parallel workloads. In this article, we’ll discuss the pros and cons of serverless computing for HPC and point out some not so obvious pitfalls.

How Serverless Computing works

The idea behind serverless is simple – rather than deal with the hassle of managing machine instances in the cloud, developers write a function in their favorite programming language and post it to a cloud FaaS platform. The cloud provider looks after details including infrastructure and software management, mapping functions to an API end-point, and transparently scaling function instances on demand. Users simply pay a small fee for each function call. In the case of IBM Cloud Functions, this fee is $0.000017 per GB-second of runtime. To put this rate in perspective, this buys you the equivalent of a full hour of runtime for your function for just six cents.

Serverless and HPC

For applications where workloads are variable, the function-as-a-service (FaaS) model makes sense. Functions can be called anytime for a small usage-based charge. Many HPC problems fit the stateless, embarrassingly parallel compute model well suited to functions. Examples include stochastic analysis, parametric sweeps, and pricing calculations in financial risk.

Some potential pitfalls

Like any new technology, the devil is in the details. Based on experience working with clients who have piloted serverless platforms for HPC, there are some lessons to be learned:

Serverless is not always cheaper – If simulations are large, long-running, or consume a lot of memory, on-demand cloud instances can be more cost-effective. On AWS Lambda, after free services are consumed, a GB-hour of compute time on a single core costs roughly six cents. By comparison, an on-demand a1.xlarge EC2 instance offers 4x the cores and 8x memory for just 10.2 cents. Factoring price and performance, the on-demand instance is almost 2.5 times more cost efficient assuming you can keep both environments busy.
Containerized service payloads can be large and slow to start – In serverless environments, you have no control over where your container runs. This means you can’t pre-stage supporting code, libraries or data on a machine instance. As a result, all code and data needs to be copied from object storage across congested networks and unpacked into every function instance slowing performance and start-up times dramatically.
Capacity is finite, and latency is real – FaaS providers typically limit the number of concurrent function executions to 1,000. While you can request a higher limit, this will expose other bottlenecks and may turn into a pricing discussion with your cloud provider. A large HPC simulation can require tens of thousands of cores, and it can take hours to start this many function instances. While “warm-started” functions respond in milliseconds, “cold-starting” a function is slow as explained above. In one client’s application, new functions could only be brought online at the rate of 6,000 functions per hour – at this rate, the on-premise HPC cluster could complete the simulation before the serverless platform could even scale to the needed equivalent core count.
Flexibility and lock-in are serious concerns – Serverless environments impose limits on memory, code payload size. If you outgrow these limits, you may need to re-architect your application. For example, if you decide your application would benefit from using a GPU-enabled library that runs 20x faster, you’ll be out of luck. You’ll need to move your application out of the serverless environment into regular GPU cloud instances. While serverless environments are similar across clouds, each provider presents their own API, imposes different limits, and supports different runtime languages essentially locking you into a single provider.
Developer productivity is a big issue – Developing useful applications with serverless platforms can require a lot more effort. This is because developers will likely need to build functionally handled in mature HPC workload managers from scratch. Examples including user management, session management, exception handling, results aggregation, ensuring security isolation between clients, and more. This adds time, risk and cost to new application development.

[The Risks of Vendor Lock-In: Don’t Get Trapped by the Cloud]

Cloud bursting can be a more attractive model

For HPC applications, a pay-as-you-go cost model can also be realized using cloud bursting functionality under control of a workload manager. Cloud bursting allows an on-premises or cloud-based clusters to dynamically add or remove machine instances on a running cluster based on application demand in a fashion that is transparent to users. Like the serverless model, cloud bursting is automated, and administrators don’t need to manage machine instances. This approach has several potential advantages:

Avoid vendor lock-in – it works on-premises and across multiple clouds
It has minimal impact on existing software and operational procedures.
Users can run a wider variety of application types.
Customers can choose to run on any instance type.

Users also avoid the limitations of serverless environments such as maximum code payloads, maximum memory sizes, runtime-enforcement and avoid challenges related to error-handling, job and task control, session management and multitenancy.

Depending on the application, cloud bursting can be both more economical and perform better than serverless platforms. As illustrated in the example below, running a simulation on a serverless platform vs. machine instances costs roughly the same, but the traditional cluster with cloud bursting finishes 40% faster. Further savings can be realized using AWS Spot pricing or transient VM pricing in the IBM cloud.

To run 500,000 scenario calculations, the FaaS service (in orange) needs to “cold-start” additional instances beyond the default 1,000 warm-started instances. Assuming each task (function) runs for ten seconds, this provides warm-started capacity for only 6,000 function calls per minute (1,000 instances * 6 function calls per minute per instance).

In the cloud bursting scenario (in blue) there is no warm-started capacity, but we can stand up a 100-node cluster comprised of AWS m5.12xlarge instances (4,800 vCPUs) and custom AMIs in approximately six minutes. This means that six minutes into the workload, the cloud bursting environment can support 28,800 function calls per minute (4,800 * 6 tasks per minute per vCPU) whereas the serverless platform has still only started 1,600 function containers for a capacity of 9,600 function calls per minute (1,600 * 6 tasks per minute per vCPU).

Another critical thing to consider is flexibility. If we needed to run a different model that required double the memory per function call, the serverless cost in our example would double since capacity is charged per GB-second. The cloud bursting price would stay the same, however, since the m5.12xlarge instance already has ample memory (4GB available per core).

[Accelerating HPC Hybrid Cloud: Making Clouds Fly]

Choosing the right tool for the right job

Serverless computing has its place, but it’s not a mature solution for HPC. It’s a good idea to model your workloads carefully and consider factors such as actual rate of service scaling, expected utilization, code payload size, and software development considerations. Software designed to manage HPC workloads may be a better choice.

IBM Spectrum LSF supports parametric sweeps and stochastic analysis using LSF job arrays. IBM Spectrum Symphony exposes native or container-based services via a client-side-API similar to a serverless platform and provides many developer-friendly features. Both of these workload managers support automated, policy-based cloud bursting to your choice of public clouds including IBM Cloud, Amazon Web Services, Microsoft Azure, and Google Cloud Platform.

While serverless is worth considering for some workloads, HPC applications can pose unforeseen risks. Before taking the plunge it’s a good idea to compare serverless platforms against cloud bursting alternatives carefully, and model how your applications are likely to perform in both environments.

The Information Nexus of Advanced Computing and Data systems for a High Performance World

Share

Copy short link