March 30, 2020 — The large scientific simulations users at the Oak Ridge Leadership Computing Facility (OLCF) run on the center’s IBM AC922 Summit supercomputer almost always require other attendant tasks to fully realize their scientific impact. Resources for these additional “workflow” tasks can be scarce, and pushing this work to the login or data transfer nodes of the machine is not ideal.

Built on Kubernetes, Slate will comprise three different clusters, allowing users to run specialized jobs within the OLCF’s existing HPC environment. Image courtesy of Jason Kincl, ORNL

The OLCF, a US Department of Energy (DOE) Office of Science User Facility located at DOE’s Oak Ridge National Laboratory, is now implementing a new purpose-built resource, Slate, designed with these jobs in mind.

Developed by Jason Kincl, high performance computing (HPC) Linux systems engineer and task lead in the HPC Operations Group at the OLCF, Slate will provide container orchestration services to users, allowing them to run more specialized jobs such as those associated with workflows, databases, data portals, and continuous integration. Workflow systems allow users to automate collections of jobs, but the workflow system itself does not have a concrete start and end time, whereas databases and data portals allow users to host and access curated data sets at any time. Continuous integration refers to the process that occurs as users make code changes and regularly implement those changes—another undertaking that doesn’t end at any particular time.

Built on the Kubernetes software, Slate will comprise three separate clusters in three different security enclaves: Granite, the core services cluster; Marble, the moderate-security production cluster; and Onyx, the NCCS-Open cluster. Slate is being incorporated into the OLCF’s existing HPC environment, with access to Summit’s Alpine file system and batch scheduler available inside the container on Slate.“Summit’s job scheduler is tuned for large-scale modeling and simulation-type workloads,” Kincl said. “Scientific workloads have a beginning and an end, but these types of workloads don’t typically have an end, so it becomes difficult to schedule them on a resource like Summit.”

“Users have previously had to use secure shell port forwarding or run on the login or data transfer nodes for these kinds of jobs, but now they have a dedicated tool that is purpose-built for these workloads,” Kincl said. Secure shell port forwarding creates an encrypted connection between a local computer and remote machine, but Slate provides a direct path to running specialized workload jobs at the center.

Slate will act as a separate supporting compute resource for users with allocations on Summit.

“The allocations on Slate will be tied to existing allocations that users have for projects on Summit or Rhea,” Kincl said. “We’re trying to augment and extend functionality for our users.”

Slate is undergoing testing prior to its release.

About Oak Ridge National Laboratory

UT-Battelle LLC manages Oak Ridge National Laboratory for DOE’s Office of Science, the single largest supporter of basic research in the physical sciences in the United States. DOE’s Office of Science is working to address some of the most pressing challenges of our time. For more information, visit https://energy.gov/science.


Source: Rachel Harken, Oak Ridge Leadership Computing Facility