How to manage HPC jobs using a serverless API

High performance computing (HPC) helps researchers, engineers and academic institutions to process simulations that are too complex to resolve within an acceptable timeframe or are just too large, which warrants a distribution of the workload across many servers.

HPC systems are traditionally accessed through a command line interface (CLI) where the users submit and manage their computational jobs. Depending on their experience and sophistication, the CLI can be a daunting experience for users not accustomed in using it. Fortunately, the cloud offers many other options for users to submit and manage their computational jobs.

An example is event-driven workflows that can automatically submit jobs as new data is stored in an Amazon S3 bucket. In addition to providing automation and minimizing direct user interactions, event-driven workflows provide creative ways to interact with the resources of your cluster. As a result, researchers and engineers can dedicate more time to science and less time managing their jobs.

In this blog post we will cover how to create a serverless API to interact with an HPC system in the the cloud built using AWS ParallelCluster. This API is a building block that will enable you to build event-driven workflows. We will also demonstrate how to interact with the cluster using the standard curl command. This detailed information will help you to extend the described solution and design your own customized environment. To this end, we use the following AWS services:

Amazon API Gateway is a fully managed service that makes it easy for developers to create, publish, maintain, monitor, and secure APIs at any scale. APIs act as the “entry point” for applications to access data, business logic, or functionality from your backend services. We use Amazon API Gateway as central point of access to the AWS ParallelCluster cluster.
AWS Systems Manager provides a unified user interface so you can track and resolve operational issues across your AWS applications and resources from a central place. With Systems Manager, you can automate operational tasks for Amazon EC2 instances. The solution uses AWS Systems Manager to run the scheduler commands on the AWS ParallelCluster Head node.
AWS Lambda is a serverless compute service that lets you run code without provisioning or managing servers, creating workload-aware cluster scaling logic, maintaining event integrations, or managing runtimes. Amazon API Gateway uses AWS Lambda to run the SSM command on the Scheduler head node and return the results.

Figure 1 shows the components of the solution. The architecture illustrates how a user interacts with the API to start a HPC job in the cluster created by AWS ParallelCluster.

*Figure 1. A high-level architecture of the solution.*

In this workflow, a user interacts with an Amazon API Gateway endpoint which is backed by an AWS Lambda function. The function interacts with the Slurm scheduler on the head node via AWS Systems Manager Run Command. The output from the command is captured and stored in Amazon S3 and the results are also displayed on-screen for the benefit of the user. AWS ParallelCluster performs the action of managing the compute nodes for processing the jobs submitted to the scheduler’s queue. Amazon S3 is also used to store the job script(s) submitted to the Slurm scheduler.

Note: We have temporarily removed the walk-through of the solution and are working to modernize it to take advantage of Slurm’s own REST API component. Apologies for the inconvenience.

Conclusion

In this post, we show you how to deploy a complete serverless API to interact with an HPC system built using AWS ParallelCluster.

A traditional HPC system requires access through a command line interface (CLI) to interact with the underlying environment for the submission and management of the jobs. This kind of interaction can be a barrier for some users.

The architecture in the post brings the simplicity of a cloud native approach to a complex environment like an HPC scheduler through an API. This API enables you to build an event-driven workflow to automate the job submissions based on new data in an Amazon S3 bucket.

The solution shown here interconnects several services with Amazon API Gateway acting as a gateway to your HPC environment and interacts with the AWS ParallelCluster cluster. This solution uses a serverless architecture pattern to manage the Slurm scheduler on the head node and return results.

In addition, this approach can help you improve the security of your HPC system by preventing users from directly accessing the Slurm scheduler head node through the CLI. This can be a requirement driven by organizational security policies. In addition, the solution can be used in an event-driven workflow and automatically invoked when new data is ingested into your environment.

The solution can be extended to orchestrate the entire life of a job; from copying the required data into the cluster, submitting the job, and managing the collection and storage of the generated data.

We consider this architecture an entry point to building your event-driven cluster more reliably and securely.

Reminder: You can learn a lot from AWS HPC engineers by subscribing to the HPC Tech Short YouTube channel, and following the AWS HPC Blog channel.