The HPC community has historically developed its own specialized software stack including schedulers, filesystems, developer tools, container technologies tuned for performance and large-scale on-premises deployments. The advent of cloud native development and operating models, largely built on Kubernetes (K8S) and Docker, are looked on with interest by the HPC community. They can open up HPC data centers to a wider ecosystem of tools and middleware. This is becoming increasingly critical with the growing need to build new applications and workflows that extend beyond modelling and simulation. For example, to incorporate IoT sensor information into simulations, provide rich interactive analytics to discover patterns in data, or train and deploy machine learning and deep learning models and link them to consuming web, mobile or edge computing applications, we have to look beyond traditional approaches.
One new approach is the use of Kubernetes which is a very extensible framework that allows and encourages customization which can potentially be leveraged to accomplish some of these tasks. While the cloud native Kubernetes software stack has promise, Kubernetes on its own doesn’t address all the requirements the HPC community has and it’s not feasible for clients to rip and replace their existing HPC software stack and start over again. This leads us to explore a model of co-existence where the strengths of both HPC stack and Kubernetes can be exploited. Is there a shorter path to integrating HPC environments with the cloud native world to deliver value to clients?
Schedulers are an important element of HPC environments. Schedulers used in high performance computing environments support highly complex batch and interactive applications in the domain of simulation, modelling, analytics and AI. HPC schedulers such as SLURM, TORQUE/Maui, IBM Spectrum LSF provide a rich set of policy controls around job placement, prioritization, fairshare access, job dependencies, Singularity integration, and an ecosystem of applications that have integrated with the batch APIs provided by such schedulers. In the Kubernetes world, the default scheduler provides some base functionality oriented to supporting long-running services, with batch scheduling capabilities still nascent. In the spirit of microservices architecture, Kubernetes allows the scheduler to be replaced with alternative implementations, or even different schedulers to be used. Here we show how IBM Spectrum LSF has been integrated into Kubernetes. This enables introduction of Kubernetes workloads into HPC environments providing a non-disruptive path to cloud-native technologies for HPC users.
To achieve this, IBM has enabled Spectrum LSF to act as the scheduler for Kubernetes pods. This provides a transparent way of bringing the rich resource management capabilities that are at the heart of Spectrum LSF and apply it to workloads managed through the Kubernetes API. Here is how it works:
- For an existing HPC cluster managed by Spectrum LSF on bare-metal servers, there is no disruption. The cluster administrator can deploy a Kuberentes distribution such as IBM Cloud Private on a subset of nodes that will run cloud-native applications. The number of nodes on which Kubernetes can be deployed is up to the scaling limits of Kubernetes itself. An additional K8S scheduler driver daemon needs to be installed into the LSF cluster which will act as a bridge between Spectrum LSF and the Kuberentes API server.
- Users submits cloud native workload into K8S API via kubectl or helm charts. To get the Spectrum LSF scheduler to be aware of the pod the “schedulerName” field must be set, otherwise the pod will be scheduled by the default scheduler. Scheduler directives can be specified using annotations in the pod such as “lsf.ibm.com/queue” or ‘lsf.ibm.com/fairshareGroup” to map Spectrum LSF policy objects.
- In order to be aware of the status of pods and nodes, the Spectrum LSF scheduler uses the K8S scheduler driver that listens to Kubernetes API server and translates pod requests into jobs in the Spectrum LSF scheduler.
- Once the Spectrum LSF scheduler makes a policy decision on where to schedule the pod, the driver will bind pod the pod to specific node.
- The Kubelet will execute and manages pod lifecycle on target nodes in the normal fashion.
- The Spectrum LSF scheduler supports traditional HPC jobs as well as containerized HPC jobs that can leverage other container technologies that are unique to HPC environments such as Shifter and Singularity.
This demonstrates that it is possible to synergistically bring cloud native capabilities from Kubernetes and leverage them in an existing HPC stack. To learn more or try out a Technical Preview of these capabilities for IBM Spectrum LSF, visit: https://github.com/IBMSpectrumComputing/lsf-kubernetes