Exterminating at Extreme Scale

By Nicole Hemsoth

May 7, 2013

Since the first bug was eradicated from a Mark II system at Harvard in 1940s (an actual moth wedged in a relay, which drove the machine to a standstill) system exterminators have faced a constant spray of challenges. Nodes continue to reproduce, architectures alter, and application demands climb ever-higher walls.  

This all means it’s getting tougher for code exterminators to reproduce and track down the bugs across many thousands of cores. Further, many pre-petascale debuggers weren’t able to efficiently relay information about the health of the entire application, allowing a small portal to see one process at a time, despite the fact that hundreds were being debugged alongside.

Throw  coprocessors and accelerators into the mix and it seems there’s a perfect storm brewing for a total rethink in more efficient, scalable bug-zapping—especially with the spectre of exascale in the distance.

According to David Lecomber, co-founder and COO of HPC debugging company, Allinea, the scale and complexity of systems it’s been working with, including Titan and Blue Waters, required new approaches to tackle larger node counts. More pressing and complex, however, is the increased heterogeneity. For top-tier machines like these, he says, scale and core diversity are critical–but at the heart of all of their work is improving debugging speed. The company has targeted all of these areas as it’s worked alongside Oak Ridge National Lab, NCSA, and others aiming for extreme scale computing targets, refining its ability to show thousands of processes in one, full view for more effective bug stomping.

In the “moth-plucking” days of debugging, before visually-oriented, multi-process, scalable approaches, every single node in a cluster had to directly connect to where the user was sitting. Naturally, as node counts climbed, the workstations were quickly overloaded, meaning users could only handle at most several hundred or a thousand cores. Debugging was a necessary, clunky of evil—one that wouldn’t hold up to the demands of core counts in the hundreds of thousands, and even if it could keep up, it would slow to a crawl.

Lecomber touts his company’s role in reshaping that long-standing trend via Allinea’s DDT, which offered a UI that could paint the whole landscape of an application, letting users “visualize and compare 200,000 processes as simply as two.” Their work at massive scale recently started in earnest with Jaguar via their work with Oak Ridge, before wading into Blue Waters or battling the Titan. He claims that despite the scale, the speed was emphasized—to the point that Allinea could handle even higher node counts in anything we’re set to see soon. He said that the time to debug using the old node-connected approach was in the minutes, but they’ve been able to trim this process down to seconds.

During the company’s early work with Jaguar, and later Titan, Oak Ridge had a couple of problems, including limitations with the traditional printfs debugging approach to find bugs, followed by adding GPUs into the mix. Oak Ridge’s Tools Project Technical Officer, Joshua Ladd said that the ability to see every process in a parallel job allowed the lab to remove the debugging hassles and speed time to result.

And on the GPU front, the lab wanted researchers to take advantage of Titan’s accelerators but they needed more powerful tools that could attack those more complicated bugs. Further, Oak Ridge was able to harness DDT on Jaguar to debug an open source implementation of MPI at a half-million lines of code across a maximum of 225,000 cores.

Scale aside, as noted, the true challenges relate to the increasing heterogeneity of ever-larger systems. Lecomber said that a lot of work went on behind the scenes to get DDT primed for GPUs and coprocessors, and he expects such challenges are going to persist during the exascale climb. They’ve already done a great deal of work on accelerators and recently looked to address challenges on Xeon Phi, as detailed below.

Beyond new architectures, Allinea is focusing on combining advanced debugging and performance tools so users will be able to better visualize the performance of their applications. In other words, having a petascale machine isn’t incredibly useful if you can’t take advantage of that power—just as computing the fastest wrong answer won’t work either.

When it comes to exascale, and even petascale at this point, “the real gaps are in the tools area, the people writing applications for these large machines need to be able to do performance profiling in a similar way as they handle debugging—visually and with emphasis on speed,” he said. Their MPI profiler, called MAP, highlights lines of code that executed the slowest to demo what happened during the run in a format that will be familiar to those who already use DDT.

While we generally hear about HPC debuggers in the context of national labs, petascale systems are proliferating in the commercial spaces as well, necessitating enterprise-grade, extreme-scale extermination. Lecomber says that companies they work with, several of which are in the oil and gas and engineering arenas, are adopting similarly-sized systems that present mission-critical challenges. Simulating the performance and safety of an engine, for instance, can have devastating results if not done correctly or at best, can result in expensive runtime waste.

Aside from their academic affiliations and work in oil and gas and other key commercial areas, Allinea is working closely with the European Collaborative Research into Exascale Systemware, Tools and Applications (CRESTA) to identify what these future systems will look like and how tool vendors and application artists will need to rework their approaches. Lecomber says this also involves collaboration with system designers, processor-makers and other vendors to make sure the exascale research food chain is aligned.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

Nvidia Shuts Out RISC-V Software Support for GPUs 

September 23, 2022

Nvidia is not interested in bringing software support to its GPUs for the RISC-V architecture despite being an early adopter of the open-source technology in its GPU controllers. Nvidia has no plans to add RISC-V support for CUDA, which is the proprietary GPU software platform, a company representative... Read more…

Microsoft Closes Confidential Computing Loop with AMD’s Milan Chip

September 22, 2022

Microsoft shared details on how it uses an AMD technology to secure artificial intelligence as it builds out a secure AI infrastructure in its Azure cloud service. Microsoft has a strong relationship with Nvidia, but is also working with AMD's Epyc chips (including the new 3D VCache series), MI Instinct accelerators, and also... Read more…

Nvidia Introduces New Ada Lovelace GPU Architecture, OVX Systems, Omniverse Cloud

September 20, 2022

In his GTC keynote today, Nvidia CEO Jensen Huang launched another new Nvidia GPU architecture: Ada Lovelace, named for the legendary mathematician regarded as the first computer programmer. The company also announced tw Read more…

Nvidia’s Hopper GPUs Enter ‘Full Production,’ DGXs Delayed Until Q1

September 20, 2022

Just about six months ago, Nvidia’s spring GTC event saw the announcement of its hotly anticipated Hopper GPU architecture. Now, the GPU giant is announcing that Hopper-generation GPUs (which promise greater energy eff Read more…

NeMo LLM Service: Nvidia’s First Cloud Service Makes AI Less Vague

September 20, 2022

Nvidia is trying to uncomplicate AI with a cloud service that makes AI and its many forms of computing less vague and more conversational. The NeMo LLM service, which Nvidia called its first cloud service, adds a layer of intelligence and interactivity... Read more…

AWS Solution Channel

Shutterstock 1194728515

Simulating 44-Qubit quantum circuits using AWS ParallelCluster

Dr. Fabio Baruffa, Sr. HPC & QC Solutions Architect
Dr. Pavel Lougovski, Pr. QC Research Scientist
Tyson Jones, Doctoral researcher, University of Oxford

Introduction

Currently, an enormous effort is underway to develop quantum computing hardware capable of scaling to hundreds, thousands, and even millions of physical (non-error-corrected) qubits. Read more…

Microsoft/NVIDIA Solution Channel

Shutterstock 1166887495

Improving Insurance Fraud Detection using AI Running on Cloud-based GPU-Accelerated Systems

Insurance is a highly regulated industry that is evolving as the industry faces changing customer expectations, massive amounts of data, and increased regulations. A major issue facing the industry is tracking insurance fraud. Read more…

Nvidia Targets Computers for Robots in the Surgery Rooms

September 20, 2022

Nvidia is laying the groundwork for a future in which humans and robots will be collaborators in the surgery rooms at hospitals. The company announced a computer called IGX for Medical Devices, which will be populated in robots, image scanners and other computers and medical devices involved in patient care close to the point... Read more…

Nvidia Shuts Out RISC-V Software Support for GPUs 

September 23, 2022

Nvidia is not interested in bringing software support to its GPUs for the RISC-V architecture despite being an early adopter of the open-source technology in its GPU controllers. Nvidia has no plans to add RISC-V support for CUDA, which is the proprietary GPU software platform, a company representative... Read more…

Nvidia Introduces New Ada Lovelace GPU Architecture, OVX Systems, Omniverse Cloud

September 20, 2022

In his GTC keynote today, Nvidia CEO Jensen Huang launched another new Nvidia GPU architecture: Ada Lovelace, named for the legendary mathematician regarded as Read more…

Nvidia’s Hopper GPUs Enter ‘Full Production,’ DGXs Delayed Until Q1

September 20, 2022

Just about six months ago, Nvidia’s spring GTC event saw the announcement of its hotly anticipated Hopper GPU architecture. Now, the GPU giant is announcing t Read more…

NeMo LLM Service: Nvidia’s First Cloud Service Makes AI Less Vague

September 20, 2022

Nvidia is trying to uncomplicate AI with a cloud service that makes AI and its many forms of computing less vague and more conversational. The NeMo LLM service, which Nvidia called its first cloud service, adds a layer of intelligence and interactivity... Read more…

Nvidia Targets Computers for Robots in the Surgery Rooms

September 20, 2022

Nvidia is laying the groundwork for a future in which humans and robots will be collaborators in the surgery rooms at hospitals. The company announced a computer called IGX for Medical Devices, which will be populated in robots, image scanners and other computers and medical devices involved in patient care close to the point... Read more…

Survey Results: PsiQuantum, ORNL, and D-Wave Tackle Benchmarking, Networking, and More

September 19, 2022

The are many issues in quantum computing today – among the more pressing are benchmarking, networking and development of hybrid classical-quantum approaches. Read more…

HPC + AI Wall Street to Feature ‘Spooky’ Science for Financial Services

September 18, 2022

Albert Einstein famously described quantum mechanics as "spooky action at a distance" due to the non-intuitive nature of superposition and quantum entangled par Read more…

Analog Chips Find a New Lease of Life in Artificial Intelligence

September 17, 2022

The need for speed is a hot topic among participants at this week’s AI Hardware Summit – larger AI language models, faster chips and more bandwidth for AI machines to make accurate predictions. But some hardware startups are taking a throwback approach for AI computing to counter the more-is-better... Read more…

AWS Takes the Short and Long View of Quantum Computing

August 30, 2022

It is perhaps not surprising that the big cloud providers – a poor term really – have jumped into quantum computing. Amazon, Microsoft Azure, Google, and th Read more…

The Final Frontier: US Has Its First Exascale Supercomputer

May 30, 2022

In April 2018, the U.S. Department of Energy announced plans to procure a trio of exascale supercomputers at a total cost of up to $1.8 billion dollars. Over the ensuing four years, many announcements were made, many deadlines were missed, and a pandemic threw the world into disarray. Now, at long last, HPE and Oak Ridge National Laboratory (ORNL) have announced that the first of those... Read more…

US Senate Passes CHIPS Act Temperature Check, but Challenges Linger

July 19, 2022

The U.S. Senate on Tuesday passed a major hurdle that will open up close to $52 billion in grants for the semiconductor industry to boost manufacturing, supply chain and research and development. U.S. senators voted 64-34 in favor of advancing the CHIPS Act, which sets the stage for the final consideration... Read more…

Top500: Exascale Is Officially Here with Debut of Frontier

May 30, 2022

The 59th installment of the Top500 list, issued today from ISC 2022 in Hamburg, Germany, officially marks a new era in supercomputing with the debut of the first-ever exascale system on the list. Frontier, deployed at the Department of Energy’s Oak Ridge National Laboratory, achieved 1.102 exaflops in its fastest High Performance Linpack run, which was completed... Read more…

Chinese Startup Biren Details BR100 GPU

August 22, 2022

Amid the high-performance GPU turf tussle between AMD and Nvidia (and soon, Intel), a new, China-based player is emerging: Biren Technology, founded in 2019 and headquartered in Shanghai. At Hot Chips 34, Biren co-founder and president Lingjie Xu and Biren CTO Mike Hong took the (virtual) stage to detail the company’s inaugural product: the Biren BR100 general-purpose GPU (GPGPU). “It is my honor to present... Read more…

Newly-Observed Higgs Mode Holds Promise in Quantum Computing

June 8, 2022

The first-ever appearance of a previously undetectable quantum excitation known as the axial Higgs mode – exciting in its own right – also holds promise for developing and manipulating higher temperature quantum materials... Read more…

AMD’s MI300 APUs to Power Exascale El Capitan Supercomputer

June 21, 2022

Additional details of the architecture of the exascale El Capitan supercomputer were disclosed today by Lawrence Livermore National Laboratory’s (LLNL) Terri Read more…

Tesla Bulks Up Its GPU-Powered AI Super – Is Dojo Next?

August 16, 2022

Tesla has revealed that its biggest in-house AI supercomputer – which we wrote about last year – now has a total of 7,360 A100 GPUs, a nearly 28 percent uplift from its previous total of 5,760 GPUs. That’s enough GPU oomph for a top seven spot on the Top500, although the tech company best known for its electric vehicles has not publicly benchmarked the system. If it had, it would... Read more…

Leading Solution Providers

Contributors

Exclusive Inside Look at First US Exascale Supercomputer

July 1, 2022

HPCwire takes you inside the Frontier datacenter at DOE's Oak Ridge National Laboratory (ORNL) in Oak Ridge, Tenn., for an interview with Frontier Project Direc Read more…

AMD Opens Up Chip Design to the Outside for Custom Future

June 15, 2022

AMD is getting personal with chips as it sets sail to make products more to the liking of its customers. The chipmaker detailed a modular chip future in which customers can mix and match non-AMD processors in a custom chip package. "We are focused on making it easier to implement chips with more flexibility," said Mark Papermaster, chief technology officer at AMD during the analyst day meeting late last week. Read more…

Intel Reiterates Plans to Merge CPU, GPU High-performance Chip Roadmaps

May 31, 2022

Intel reiterated it is well on its way to merging its roadmap of high-performance CPUs and GPUs as it shifts over to newer manufacturing processes and packaging technologies in the coming years. The company is merging the CPU and GPU lineups into a chip (codenamed Falcon Shores) which Intel has dubbed an XPU. Falcon Shores... Read more…

Nvidia Shuts Out RISC-V Software Support for GPUs 

September 23, 2022

Nvidia is not interested in bringing software support to its GPUs for the RISC-V architecture despite being an early adopter of the open-source technology in its GPU controllers. Nvidia has no plans to add RISC-V support for CUDA, which is the proprietary GPU software platform, a company representative... Read more…

Nvidia, Intel to Power Atos-Built MareNostrum 5 Supercomputer

June 16, 2022

The long-troubled, hotly anticipated MareNostrum 5 supercomputer finally has a vendor: Atos, which will be supplying a system that includes both Nvidia and Inte Read more…

UCIe Consortium Incorporates, Nvidia and Alibaba Round Out Board

August 2, 2022

The Universal Chiplet Interconnect Express (UCIe) consortium is moving ahead with its effort to standardize a universal interconnect at the package level. The c Read more…

Using Exascale Supercomputers to Make Clean Fusion Energy Possible

September 2, 2022

Fusion, the nuclear reaction that powers the Sun and the stars, has incredible potential as a source of safe, carbon-free and essentially limitless energy. But Read more…

Is Time Running Out for Compromise on America COMPETES/USICA Act?

June 22, 2022

You may recall that efforts proposed in 2020 to remake the National Science Foundation (Endless Frontier Act) have since expanded and morphed into two gigantic bills, the America COMPETES Act in the U.S. House of Representatives and the U.S. Innovation and Competition Act in the U.S. Senate. So far, efforts to reconcile the two pieces of legislation have snagged and recent reports... Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire