Debugging at Titan Scale

By Tiffany Trader

April 15, 2013

Launching a supercomputer like Oak Ridge National Laboratory’s Titan machine – the current TOP500 chart-topper – is a many-step process. While a lot of work went into standing up the Titan system, ultimately the hardware is only as good as the software running on it. This is an apt saying for systems of all sizes, but delivering on this promise becomes ever more challenging as systems grow in core count and complexity.

Titan is a 27 petaflops (peak) Cray XK7 supercomputer with a hybrid architecture that combines 18,688 AMD Opteron CPUs with 18,688 NVIDIA Tesla K20X GPUs. The setup poses a challenge for software developers in getting scientific applications to scale across all of Titan’s nearly 300,000 compute cores. An article at the Oak Ridge Leadership Computing Facility website describes how implementing this degree of scaling inevitably introduces bugs that squash the system’s productivity if they are not handled appropriately.

Identifying bugs among thousands of lines of code running across 300,000 cores is a tricky problem, but one that OLCF anticipated. The lab’s staff understood they would need to develop a tool that would foster a smooth transition to the much larger environment. For help with this task, they turned to software vendor Allinea, who had helped create the debugging tool for Titan’s previous incarnation, Jaguar.

Allinea’s distributed debugging tool, Allinea DDT, was designed to quickly locate failures on the largest systems in the world. It does this by displaying a single view of every process in a parallel job and the exact line of code that is being executed. Allinea DDT also supports the most popular HPC programming languages, i.e., Fortran, C, and C++.

OLCF and Allinea collaborated to customize the debugger to make it compatible with Titan’s hybrid architecture. The goal was to enable the supercomputer’s initial users to access large portions of the machine as well as support OLCF with Titan’s acceptance phase. By the end of the project, Allinea DDT is expected to reach scales 40 times over and above previous best-of-class debugging tools.

“Before we joined this project, tools weren’t capable of getting anywhere near the size of the hardware,” said Allinea’s COO David Lecomber. “The problem was that a debugging tool might do 5,000 or 10,000 parallel tasks if it was lucky, when the machines and applications wanted to write things that could do 200,000 plus. So the tools just got beaten up by the hardware.”

Once onerous development chores like bug-spotting are streamlined, researchers can spend more of their time advancing their core discipline.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

NREL ‘Eagle’ Supercomputer to Advance Energy Tech R&D

August 14, 2018

The U.S. Department of Energy (DOE) National Renewable Energy Laboratory (NREL) has contracted with HPE for a new 8-petaflops (peak) supercomputer that will be used to advance early-stage R&D on energy technologies s Read more…

By Tiffany Trader

Training Time Slashed for Deep Learning

August 14, 2018

Fast.ai, an organization offering free courses on deep learning, claimed a new speed record for training a popular image database using Nvidia GPUs running on public cloud infrastructure. A pair of researchers trained Read more…

By George Leopold

CERN Project Sees Orders-of-Magnitude Speedup with AI Approach

August 14, 2018

An award-winning effort at CERN has demonstrated potential to significantly change how the physics based modeling and simulation communities view machine learning. The CERN team demonstrated that AI-based models have the Read more…

By Rob Farber

HPE Extreme Performance Solutions

Introducing the First Integrated System Management Software for HPC Clusters from HPE

How do you manage your complex, growing cluster environments? Answer that big challenge with the new HPC cluster management solution: HPE Performance Cluster Manager. Read more…

IBM Accelerated Insights

Super Problem Solving

You might think that tackling the world’s toughest problems is a job only for superheroes, but at special places such as the Oak Ridge National Laboratory, supercomputers are the real heroes. Read more…

Rigetti Eyes Scaling with 128-Qubit Architecture

August 10, 2018

Rigetti Computing plans to build a 128-qubit quantum computer based on an equivalent quantum processor that leverages emerging hybrid computing algorithms used to test programs and potential applications. Founded in 2 Read more…

By George Leopold

NREL ‘Eagle’ Supercomputer to Advance Energy Tech R&D

August 14, 2018

The U.S. Department of Energy (DOE) National Renewable Energy Laboratory (NREL) has contracted with HPE for a new 8-petaflops (peak) supercomputer that will be Read more…

By Tiffany Trader

CERN Project Sees Orders-of-Magnitude Speedup with AI Approach

August 14, 2018

An award-winning effort at CERN has demonstrated potential to significantly change how the physics based modeling and simulation communities view machine learni Read more…

By Rob Farber

Intel Announces Cooper Lake, Advances AI Strategy

August 9, 2018

Intel's chief datacenter exec Navin Shenoy kicked off the company's Data-Centric Innovation Summit Wednesday, the day-long program devoted to Intel's datacenter Read more…

By Tiffany Trader

SLATE Update: Making Math Libraries Exascale-ready

August 9, 2018

Practically-speaking, achieving exascale computing requires enabling HPC software to effectively use accelerators – mostly GPUs at present – and that remain Read more…

By John Russell

Summertime in Washington: Some Unexpected Advanced Computing News

August 8, 2018

Summertime in Washington DC is known for its heat and humidity. That is why most people get away to either the mountains or the seashore and things slow down. H Read more…

By Alex R. Larzelere

NSF Invests $15 Million in Quantum STAQ

August 7, 2018

Quantum computing development is in full ascent as global backers aim to transcend the limitations of classical computing by leveraging the magical-seeming prop Read more…

By Tiffany Trader

By the Numbers: Cray Would Like Exascale to Be the Icing on the Cake

August 1, 2018

On its earnings call held for investors yesterday, Cray gave an accounting for its latest quarterly financials, offered future guidance and provided an update o Read more…

By Tiffany Trader

Google is First Partner in NIH’s STRIDES Effort to Speed Discovery in the Cloud

July 31, 2018

The National Institutes of Health, with the help of Google, last week launched STRIDES - Science and Technology Research Infrastructure for Discovery, Experimen Read more…

By John Russell

Leading Solution Providers

SC17 Booth Video Tours Playlist

Altair @ SC17

Altair

AMD @ SC17

AMD

ASRock Rack @ SC17

ASRock Rack

CEJN @ SC17

CEJN

DDN Storage @ SC17

DDN Storage

Huawei @ SC17

Huawei

IBM @ SC17

IBM

IBM Power Systems @ SC17

IBM Power Systems

Intel @ SC17

Intel

Lenovo @ SC17

Lenovo

Mellanox Technologies @ SC17

Mellanox Technologies

Microsoft @ SC17

Microsoft

Penguin Computing @ SC17

Penguin Computing

Pure Storage @ SC17

Pure Storage

Supericro @ SC17

Supericro

Tyan @ SC17

Tyan

Univa @ SC17

Univa

  • arrow
  • Click Here for More Headlines
  • arrow
Do NOT follow this link or you will be banned from the site!
Share This