Choosing the Right Scheduler for HPC and AI Workloads

By Andy Morris, IBM Cognitive Infrastructure

December 16, 2019

[Connect with LSF users and learn new skills in the IBM Spectrum LSF User Community.]

Traditionally, HPC workloads have been all about simulation. Scientists and engineers would model complex systems in software on large-scale parallel clusters to predict real-world outcomes. Financial risk management, computational chemistry, seismic modeling, and simulating car crashes in software are all good examples.

Over the past decade, however, what we consider to be an HPC workload has broadened considerably. Today, workloads are just as likely to involve collecting or filtering streaming data, using distributed analytics to discover patterns in data, or training machine learning models. As HPC applications have become more diverse, techniques for scheduling and managing workloads have evolved as well. In this article, we’ll look at two advanced workload schedulers and discuss their suitability for modern HPC, Analytic, and AI workloads – IBM Spectrum LSF and Kubernetes.

Spectrum LSF and Kubernetes – understanding the differences

One of the challenges with comparing Spectrum LSF and Kubernetes is that each solution was designed to solve different problems. Because of this heritage, the two solutions excel in different areas. Evaluators often have preferences that reflect the types of workloads with which they are most familiar.

Spectrum LSF was designed to support diverse distributed workloads. As its name (Load Sharing Facility) implies, LSF shares resources based on flexible policies. While often categorized as a batch scheduler, this understates the breadth of its capabilities. LSF supports serial and parallel batch jobs along with a variety of other application models. These include interactive workloads, parametric/array jobs, multi-step workflows, virtualized and containerized workloads, and even “long-running” distributed services such as TensorFlow, Spark, or Jupyter notebooks. (1)

Kubernetes was developed to solve an entirely different problem – the delivery of scalable, always-on, reliable web-services in Google’s cloud. In the Kubernetes world, the focus is on scalable, long-running application services such as web-stores, databases, API services, and mobile application back-ends. Kubernetes applications are assumed to be containerized and adhere to a cloud-native design approach. Applications are comprised of Pods – essentially groups of one or more Docker or OCI compliant containers(2) that can be deployed on a cluster to provide specific functionality for an application.

Internet-scale applications frequently need to continuously available, so Kubernetes provides features supporting continuous integration / continuous delivery (CI/CD) pipelines and modern DevOps techniques. Developers can build and roll out new functionality and automatically roll back to previous deployment versions in case of a failure. Health checks provide mechanisms to send readiness and liveness probes to ensure continuous service availability.

Another differentiating feature is that Kubernetes is more than just a resource manager – it’s a complete management and runtime environment. Kubernetes includes services that applications rely on. These include DNS management, ingress controllers, virtual networking, persistent volumes, secret management, and more. Applications built for Kubernetes will only run in a Kubernetes environment.

Different approaches to scheduling

A key difference between HPC-oriented schedulers and Kubernetes is that in the HPC world, jobs and workflows typically have a beginning and an end. Runtimes may vary from seconds to weeks, but HPC jobs generally run to completion. Kubernetes is more commonly used to manage application services that run continuously. While Kubernetes supports the notion of CronJobs and has a specific Controller type for “jobs that run to completion” (aka Batch Jobs), Kubernetes has limited support for these types of workloads(3). Similarly, Spectrum LSF can launch and persist long-running services, but LSF lacks the features needed to manage long-running multi-tier applications.

These differences in workload types have resulted in different scheduling functionality. Without getting into too much detail, Kubernetes manages applications as Deployments, reflecting the desired state of the components that comprise an application. The basic unit of scheduling in Kubernetes is a Pod. Kubernetes watches for newly created Pods and assigns them to hosts by first finding candidate hosts that satisfy resource constraints (typically CPU and memory). It then scores candidate nodes considering a variety of placement policies and assigns Pods to the highest scoring host.

Spectrum LSF is similar, but resource selection and placement controls in LSF are more expressive and granular. Users can construct complex Boolean resource requirement expressions that consider everything from software licenses to temp space to GPU models and mode. Because Spectrum LSF was built for HPC environments, it’s also more sensitive to utilization, timeliness, and considerations such as topology and affinity that drive performance. Spectrum LSF is designed to make thousands of scheduling decisions per second and efficiently share resources among workloads having vastly different run-times. This focus on throughput and utilization is why most large-scale HPC and AI supercomputers use HPC-oriented workload managers such as Spectrum LSF(4).

Read also: Introducing the world’s smartest, most powerful supercomputers

In addition to simply placing jobs considering multiple resource constraints, LSF also considers timeliness. Jobs can be submitted with goal-oriented service level agreements (SLAs) related to velocity, throughput or deadlines(5). LSF will juggle workloads to satisfy these constraints while maximizing utilization. LSF also supports scheduling features foreign to Kubernetes but often required by HPC applications such as job arrays, checkpoint/restart, advanced reservation, backfill scheduling, and more.

What to use when

So, when should you use what solution? There are no hard and fast rules unfortunately. While the schedulers are very different, both provide rich functionality, and both run on-premise, in the cloud, or in hybrid cloud environments. The following guidelines may help decide what scheduler best suits your requirements:

Consider Spectrum LSF when:

  • You are dealing mostly with jobs that have a finite run time or run to completion
  • You need to support a variety of workloads – batch, interactive, multi-step workflows
  • You have a mix of traditional, virtual, or containerized workloads or are using multiple container managers (Singularity, Docker, Shifter, etc..)
  • Time to result, throughput and resource utilization are critical requirements
  • You have complex resource sharing requirements and need to get the most out of expensive resources such as GPUs

Consider Kubernetes when:

  • Applications are mostly comprised of long-running services
  • Workloads already run in containers and are Kubernetes friendly
  • You need services to be continually available and wish to leverage CI/CD pipelines
  • You have complex, multi-tier application deployments
  • Applications require sophisticated virtual networking, load balancers or auto-scaling

Increasingly organizations run multiple types of workloads. For example, a site may gather telemetry from remote sensors, store readings in a Cassandra database, and use a variety of analytic or HPC-oriented tools to process and analyze data. MPI parallel applications can be made to run on Kubernetes, but this usually requires extra effort.

Can I use both?

Large sites with diverse workloads may see benefits in running both schedulers to support different workloads. Kubernetes and Spectrum LSF can even coexist on the same cluster, but the practicality of running multiple schedulers will depend on your mix of workloads. Kubernetes and Spectrum LSF are both highly configurable enterprise schedulers, but both involve a significant learning curve.

Read also: Bridging HPC and Cloud Native Development with Kubernetes

 

A table highlighting some of the key differences between Spectrum LSF and Kubernetes is provided below.

Do you have thoughts on this topic? Connect with LSF users, learn new skills, and join the discussion in the IBM Spectrum LSF User Community.

Figure 1 – Comparing the capabilities of Spectrum LSF and Kubernetes


  1. While Spectrum LSF supports long-running services, IBM Spectrum Conductor can be deployed along with Spectrum LSF. IBM Spectrum Conductor provides advanced features for distributed application services such as Apache Spark, Cassandra, MongoDB and ML/DL frameworks – https://www.ibm.com/marketplace/spark-workload-management
  2. OCI refers to the Open Container Initiative – an industry effort governed The Linux Foundation aimed at providing standards around container formats and runtimes – https://www.opencontainers.org/
  3. Jobs and CronJobs are described in the Kubernetes documentation – https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/
  4. See https://www.top500.org/lists/2019/11/ – The top-ranked Summit and Sierra supercomputers both run IBM Spectrum LSF
  5. Details about goal-oriented scheduling in Spectrum LSF are provided at https://www.ibm.com/support/knowledgecenter/en/SSETD4_9.1.2/lsf_admin/goal_oriented_sla_sched.html
Shares

Traditionally, HPC workloads have been all about simulation. Scientists and engineers would model complex systems in software on large-scale parallel clusters to predict real-world outcomes. Read more…

" share_counter=""]
Return to Solution Channel Homepage
HPCwire