I Love it When a Plan Comes Together

To some people, scheduling is simple – meeting a friend for lunch is a simple decision – where and when. But scheduling a meeting for 20 people across a dozen time zones quickly becomes more challenging, and the first suggested time from Outlook is six months from now!

In HPC, some environments do have relatively simple requirements – run this job across all nodes for the next six weeks. Simple. However, in most environments, there could be hundreds, thousands or even millions of jobs a day to deal with, from a diverse range of users, with everyone wanting their results as soon as possible. Satisfying all these demands is a challenging scheduling problem.

At HPCXXL, the user group for large HPC installations, I gave an update on new IBM Spectrum LSF functionality. One of the topics I discussed was plan-based scheduling that was introduced in LSF v10.1.0.5. This enhancement helps LSF make better scheduling decisions to avoid job starvation. It also supports the burst buffer capability for data staging on the Summit and Sierra systems at Oak Ridge National Laboratory and Lawrence Livermore National Laboratory respectively

By default, LSF tries to keep cluster resources occupied as much as possible. If there are idle resources and some job can use them, then LSF will dispatch the job. While this strategy is great for cluster utilization, it could lead to starvation for jobs with special requirements. For example, large parallel jobs could starve in the presence of smaller jobs, since it is unlikely that sufficient resources for the large job will free up all at once.

To address starvation, LSF has multiple policies that allow priority jobs to accumulate idle resources over time, and once sufficient resources are accumulated, LSF will dispatch the job. Reserved resources can may be backfilled by jobs that LSF expects will complete before the reserving job is expected to start.

This approach works well in many cases, but there are still some cases where jobs could still starve in more complex scheduler configurations and when jobs have special requirements (e.g. network topology requirements, or special resource requirements). In particular:

It is difficult to estimate start times for reserving jobs. Start time estimations are needed to make good backfill decisions.
Jobs might not reserve the resources which would allow them to start earliest.

Rather than making decisions based on current resource availability, plan-based scheduling looks into the future, trying to place each job where it will start earliest. The planned allocation for a job will be the complete allocation for the job, subject to all job resource requirements, resource availability, and configured scheduling policies.

To avoid impacting scheduling performance. planning is done in a separate process from the main scheduler process. Jobs can be dispatched without plans, but once a job has a plan LSF will reserve sufficient resources to execute the plan. Users can view plans for their jobs, including the planned start times, through the CLI.

A prerequisite for plan-based scheduling is that the scheduler needs to have good run time estimates on all (or most) jobs. The better the estimates, the better the plan.

Plan based scheduling ensures that complex jobs can get the resources they need, avoiding starvation, without adversely impacting cluster performance or utilization.

The same planning approach can be used for jobs that require large amounts of memory, or jobs that have data input requirements.

If you are using LSF Data Manager for pre/post staging of job data files between clusters or to/from the cloud, one of the requirements is that the file staging area is accessible from all compute nodes. Thus, data can be pre-staged before the job starts “somewhere” and wherever it starts has access to the data.

However, in the case of the burst buffer on the two CORAL (Collaboration of Oak Ridge, Argonne, and Lawrence Livermore) systems, Summit & Sierra, data needs to be pre-staged from the global file system to the local NVMe storage on each node where the job will run. After a job completes, the job’s data must be staged out to the global filesystem and the files cleaned from the local storage. Staging to and from a node for a job should happen even while the node is occupied by another job.

The requirement to stage to local storage means that we need to know in advance which nodes a job will land on and stick to that decision. For this, it relies on plan-based allocations. If a job is planned to dispatch in less time than the time needed for staging, LSF will trigger the stage-in operation for the job. Without having a planned allocation for a job, this would be impossible. As Colonel Hannibal Smith said “I love it when a plan comes together.”

Learn more about the IBM Spectrum LSF family of products here.

The Information Nexus of Advanced Computing and Data systems for a High Performance World

Share

Copy short link