May 6, 2021 — When users of the US Department of Energy’s (DOE’s) supercomputers want to improve their scientific codes, they employ tools called profilers that determine which pieces of their codes are lacking in performance. Last month, the Oak Ridge Leadership Computing Facility (OLCF), together with the National Energy Research Scientific Computing Center (NERSC), hosted a workshop on one such profiler, HPCToolkit, to help users better understand their GPU-centric codes.
The two-part training series, which began on March 29 and concluded on April 2, was led by professor John Mellor-Crummey and other HPCToolkit developers at Rice University. More than 150 people participated in the event.
On the first day, the organizers gave an overview of HPCToolkit, providing a large set of representative sample applications—six CPU and six GPU applications—that participants could run and compile using the tool.
During the following 3 days, users worked on the examples and communicated with organizers and staff members at the OLCF and NERSC to profile the codes on the OLCF’s Summit and NERSC’s Cori systems. Summit, the nation’s most powerful and smartest supercomputer, is located at the OLCF, a DOE Office of Science user facility at DOE’s Oak Ridge National Laboratory (ORNL).
“HPCToolkit ingests all the structures of a code and then instruments the executable so that it can time how long each code structure takes to run,” said OLCF high-performance computing (HPC) engineer Suzanne Parete-Koon, who helped HPC consultant Helen He of Lawrence Berkeley National Laboratory organize the event.
After HPCToolkit profiles the code, users can take advantage of a tool called hpcviewer to inspect the structures in the code. The viewer colorizes pieces of the code to provide users with a visual representation of which pieces run slower.
“Some of the examples had dozens of different colors to visually mark the different pieces of code that were being timed,” Parete-Koon said.
The second part of the event gave users the opportunity to bring their own codes and profile them with the aim of adapting them for future systems, such as the OLCF’s upcoming exascale system, Frontier, and NERSC’s Perlmutter supercomputer. Exascale systems will be capable of 1018 calculations—or 1,000,000,000,000,000,000 calculations—per second.
OLCF staff members who facilitated the event included HPC engineers Brian Smith and Subil Abraham of the User Assistance Group.
One OLCF user who benefitted from the workshop was ORNL performance engineer David Rogers. Rogers attended the event with PIConGPU, a plasma physics simulation code being developed by a team at the University of Delaware under assistant professor Sunita Chandrasekaran. Having used several kinds of debuggers and performance analysis tools in the past, Rogers noted that there is typically a trade-off between ease of use and the amount of information that can be garnered from profilers. But this is not so with HPCToolkit.
“HPCToolkit takes a straightforward approach to gathering performance information, only requiring the addition of debug symbols during compile time and executing hpcrun during the program run,” Rogers said. “Even though I only started using this tool 2 weeks ago, I have already been able to gather useful data on Summit and other systems. It’s been extremely helpful for understanding my program’s activity.”
Research scientist Seher Acer of the OLCF’s Technology Integration Group attended the workshop to better understand how to use HPCToolkit for codes in the Center for Accelerated Application Readiness (CAAR), an effort aimed at developing codes to run on Frontier.
“HPCToolkit provides code-centric profiles, where you can immediately see the respective pieces of the source code along expensive call paths that create performance bottlenecks,” Acer said.
So far, Acer appreciates HPCToolkit’s ability to analyze the behavior of a running code and the source code and to reveal relationships between the two. She is looking forward to comparing different profilers on several CAAR codes to better understand the application behavior in terms of different performance metrics and resource requirements.
UT-Battelle LLC manages Oak Ridge National Laboratory for DOE’s Office of Science, the single largest supporter of basic research in the physical sciences in the United States. DOE’s Office of Science is working to address some of the most pressing challenges of our time. For more information, visit https://energy.gov/science.