Debugging at Titan Scale
Launching a supercomputer like Oak Ridge National Laboratory’s Titan machine – the current TOP500 chart-topper – is a many-step process. While a lot of work went into standing up the Titan system, ultimately the hardware is only as good as the software running on it. This is an apt saying for systems of all sizes, but delivering on this promise becomes ever more challenging as systems grow in core count and complexity.
Titan is a 27 petaflops (peak) Cray XK7 supercomputer with a hybrid architecture that combines 18,688 AMD Opteron CPUs with 18,688 NVIDIA Tesla K20X GPUs. The setup poses a challenge for software developers in getting scientific applications to scale across all of Titan’s nearly 300,000 compute cores. An article at the Oak Ridge Leadership Computing Facility website describes how implementing this degree of scaling inevitably introduces bugs that squash the system’s productivity if they are not handled appropriately.
Identifying bugs among thousands of lines of code running across 300,000 cores is a tricky problem, but one that OLCF anticipated. The lab’s staff understood they would need to develop a tool that would foster a smooth transition to the much larger environment. For help with this task, they turned to software vendor Allinea, who had helped create the debugging tool for Titan’s previous incarnation, Jaguar.
Allinea’s distributed debugging tool, Allinea DDT, was designed to quickly locate failures on the largest systems in the world. It does this by displaying a single view of every process in a parallel job and the exact line of code that is being executed. Allinea DDT also supports the most popular HPC programming languages, i.e., Fortran, C, and C++.
OLCF and Allinea collaborated to customize the debugger to make it compatible with Titan’s hybrid architecture. The goal was to enable the supercomputer’s initial users to access large portions of the machine as well as support OLCF with Titan’s acceptance phase. By the end of the project, Allinea DDT is expected to reach scales 40 times over and above previous best-of-class debugging tools.
“Before we joined this project, tools weren’t capable of getting anywhere near the size of the hardware,” said Allinea’s COO David Lecomber. “The problem was that a debugging tool might do 5,000 or 10,000 parallel tasks if it was lucky, when the machines and applications wanted to write things that could do 200,000 plus. So the tools just got beaten up by the hardware.”
Once onerous development chores like bug-spotting are streamlined, researchers can spend more of their time advancing their core discipline.
Full story at Oak Ridge Leadership Computing