March 30, 2020 — Before computational users launch large scientific jobs on the Oak Ridge Leadership Computing Facility’s (OLCF’s) IBM AC922 Summit supercomputer, they need to understand how they will use the full power of the machine’s hybrid nodes, including memory layout, CPU, and GPU usage.
Now, they have a way to see this information visually—and in real time.
An effort led by Jack Morrison, a high-performance computing (HPC) engineer in the OLCF’s User Assistance and Outreach Group, and OLCF post-bachelors research associate Travis Young has resulted in the development of a tool called Job Step Viewer for this very purpose. The tool provides users with an easy way to understand how their applications are mapped onto compute node CPU cores and GPUs of Summit, the world’s most powerful and smartest supercomputer, located at the OLCF, a US Department of Energy (DOE) Office of Science User Facility located at DOE’s Oak Ridge National Laboratory.
“Job Step Viewer gives users the ability to see what’s happening when they launch an application,” Morrison said.
The job launcher for Summit, called jsrun (job step run), was built by IBM for the Summit and Sierra systems at Oak Ridge and Lawrence Livermore National Laboratories, respectively. Jsrun is new to the user community of these hybrid supercomputers, and users running on Summit for the first time have a learning curve to overcome.
The functionality of jsrun is similar to more common job launchers, such as mpirun, aprun, or srun, but the complexity of the fat, hybrid nodes requires more information to be provided aside from a simple node count or number of CPU threads. Through the concept of “resource sets”—user-defined groupings of CPU cores, GPUs, and memory—Summit’s large compute nodes can be subdivided to best fit the needs of a particular application. Job Step Viewer is designed to help users determine the appropriate resource set layout and process binding options for their application.
Other tools that let users test their code launches with jsrun spit out information in the form of text, but Job Step Viewer provides a new level of understanding in that it gives users the ability to visualize where their jobs are being executed.
Young and Morrison worked with HPC engineer Matt Belhorn to develop Job Step Viewer, modeling it after a previous iteration called jsrunVisualizer. Morrison built jsrunVisualizer in the days of SummitDev, the early access testbed system for the Summit machine. But the visualizer lacked the system-informed element.
“The jsrunVisualizer tool could only simulate Summit’s behavior—and only for a single compute node. It was limited to a very basic set of jsrun’s possible configurations,” Morrison said. “This new tool is actually informed by jobs that are running on the system, so it’s able to accurately capture complex jsrun invocations and settings in the user’s runtime environment.”
Additional features of Job Step Viewer include support of explicit resource files (ERF) for layouts not possible with command-line jsrun options, as well as warnings for common mistakes such as underutilizing allocated compute nodes or oversubscribing individual CPU cores.
Job Step Viewer is available to users as of February 18, and the OLCF Conference Call on March 25 provided additional support and training on the tool.
Job Step Viewer Conference Call slides: https://www.olcf.ornl.gov/calendar/userconcall-mar2020/
About Oak Ridge National Laboratory
UT-Battelle LLC manages Oak Ridge National Laboratory for DOE’s Office of Science, the single largest supporter of basic research in the physical sciences in the United States. DOE’s Office of Science is working to address some of the most pressing challenges of our time. For more information, visit https://energy.gov/science.
Source: Rachel Harken, Oak Ridge Leadership Computing Facility