Dealing with HPC Correctness: Challenges and Opportunities

By Ignacio Laguna, Lawrence Livermore National Laboratory and Ganesh Gopalakrishnan, University of Utah

January 25, 2018

Editor’s Note: HPC Correctness – producing reliable HPC code – is a long-term challenge that’s grown only more difficult with the proliferation of heterogeneous computing and the drive towards exascale. This article by Ignacio Laguna (Lawrence Livermore National Laboratory) and Ganesh Gopalakrishnan (University of Utah) describes the problem and reviews recent efforts to develop solutions. As noted in the article, the report from the DOE HPC Correctness Summit has many valuables insights. Thanks also to Sonia Sachs, ALCF Program Manager, Advanced Scientific Computing Research, DOE, who coordinated preparation of the article.

Developing correct and reliable HPC software is notoriously difficult. While effective correctness techniques for serial codes (e.g., verification, debugging and systematic testing) have been in vogue for decades, such techniques are in their infancy for HPC codes. Why is that?

HPC correctness techniques are burdened with all the well-known problems associated with serial software plus special challenges:

  • growing heterogeneity (e.g., architectures with CPUs, special purpose accelerators)
  • massive scales of computation (i.e., some bugs only manifest under very high degrees of concurrency)
  • use of combined parallel programming models (e.g., MPI+X) that often lead to non-intuitive behaviors
  • new scalable numerical algorithms (e.g., to leverage reduced precision in floating-point arithmetic)
  • use of different compilers and optimizations

 

HPC practitioners see additional demands on their time as they learn how to effectively utilize newer machine types that can support much larger problem scales. Developing new and scalable algorithms that work well on next-generation machines while also supporting new science imposes additional—and non-trivial—demands. Developers often don’t have time left to graduate beyond the use of printf debugging or traditional debuggers. Unfortunately, mounting evidence suggests that significant productivity losses due to show-stopper bugs do periodically occur, making the development of better debugging methods inevitable.

Two recent efforts took aim at these challenges. First, an HPC correctness summit sponsored by the U.S. Department of Energy (DOE) resulted in a report (50+ pages) covering a spectrum of issues that can help lay this missing foundation in HPC debugging and correctness.

Second, a well-attended workshop entitled Correctness 2017: First International Workshop on Software Correctness for HPC Applications took place at SC17. This article summarizes these two efforts and concludes with avenues for furthering HPC correctness research. We also invite reader comments on ideas and opportunities to advance this cause.

1. HPC Correctness Summit

Held on January 25–26, 2017, at the DOE headquarters (Washington, D.C.), the HPC Correctness Summit included discussions of several show-stopper bugs that have occurred during large-scale, high-stakes HPC projects. Each bug took several painstaking months of debugging to rectify, revealing the potential for productivity losses and uncertainties of much more severe proportions awaiting the exascale era.

The DOE report distills many valuable nuggets of information not easily found elsewhere. For instance, it compiles one of the most comprehensive tables capturing existing debugging and testing solutions, the family of techniques they fall under, and further details of the state of development of these tools.

The report concludes that we must aim for rigorous specifications, go after debugging automation by emphasizing bug-hunting over formal proofs, and launch a variety of activities that address the many facets of correctness.

These facets include reliable compilation; detecting data races; root-causing the sources of floating-point result variability brought in by different algorithms, compilers, and platforms; combined uses of static and dynamic analysis; focus on libraries; and smart IDEs.

Last but not least, the DOE report laments a near-total absence of a community culture of sharing bug repositories, developing common debugging solutions, and even talking openly about bugs (and not merely about performance and scalability successes). Dr. Leslie Lamport, the 2014 ACM Turing Award Winner, observes that the difficulty of verification can be an indirect measure of how ill-structured the software design is. A famous verification researcher, Dr. Ken McMillan, states it even more directly: We design through debugging. Promoting this culture of openness calls for incentives through well-targeted research grants, as it takes real work to reach a higher plane of rigor. While some of the best creations in the HPC-land were acts of altruism, experience suggests that more than altruism is often inevitable.

Recommendation for sponsoring the Summit was made by the DOE ASCR program manager Dr. Sonia R. Sachs, under the leadership of research director Dr. William Harrod. In addition to the authors of this article, participating researchers were Paul Hovland (Argonne National Lab), Costin Iancu (Lawrence Berkeley National Lab), Sriram Krishnamoorthy (Pacific Northwest National Lab), Richard Lethin (Reservoir Labs), Koushik Sen (UC Berkeley), Stephen Siegel (University of Delaware), and Armando SolarLezama (MIT).

2. HPC Correctness Workshop

As correctness becomes an increasingly important aspect of HPC applications, the research and practitioner community begins to discuss ways to address the problem. Correctness 2017: The First International Workshop on Software Correctness for HPC Applications debuted at the SC conference series on November 12, 2017, demonstrating growing interest on this topic. The goal was to discuss ideas for HPC correctness, including novel research methods to solve challenging problems as well as tools and techniques that can be used in practice today.

A keynote address by Stephen Siegel (Associate Professor, University of Delaware) on the CIVL verification language opened the workshop, followed by seven paper presentations grouped into three categories: applications and algorithms correctness; runtime systems correctness; and code generation and code equivalence correctness.

Topics of discussion included static analysis for finding the root-cause of floating-point variability, how HPC communities like climate modeling deal with platform-dependent result variability, and ambitious proposals aimed at in situ model checking of MPI applications. Participants also examined automated synthesis of HPC algorithms and successes in detecting extremely tricky cases of OpenMP errors by applying rigorous model-level analysis.

While using formal methods to verify large HPC applications is perhaps too ambitious today, a question arose: Can formal methods be applied to verify properties of small HPC programs? (For example, small programs like DOE proxy applications extracted from large production applications could be used to mimic some features of large-scale applications.) Workshop participants agreed that this may be a possibility—at least for some small proxy applications or for some of their key components.

The audience voiced enthusiastic support for continuing correctness workshops at SC. This inaugural workshop was organized by Ignacio Laguna (Lawrence Livermore National Laboratory) and Cindy Rubio-González (University of California at Davis).

3. What’s Next?

As the community depends on in silico experiments for large-scale science and engineering projects, trustworthy platforms and tools will ensure that investments in HPC infrastructures and trained personnel are effective and efficient. While further experience is yet to be gained on cutting-edge exascale machines and their productive use, waiting for the machines to be fully operational before developing effective debugging solutions is extremely short-sighted. Today’s petaflop machines can—and should—be harnessed for testing and calibrating debugging solutions for the exascale era.

Initiatives to address the correctness problem in HPC, such as the DOE summit and the SC17 workshop, are only the beginning of many more such studies and events to follow. In addition to the DOE, the authors thank their own organizations for their support and for facilitating these discussions.

Overall, we encourage the HPC community to acknowledge that debugging is fundamentally an enabler of performance optimizations. While this question was not settled in any formal way at the Correctness workshop, the level of interest exhibited by the attendees coupled with their keen participation suggested that research on rigorous methods at all levels must be encouraged and funded. There was however widespread agreement that conventional methods aren’t bringing in the requisite levels of incisiveness with respect to defect elimination in HPC.

Ganesh Gopalakrishnan’s work is supported by research grants from divisions under the NSF directorate for Computer and Information Science and Engineering. Ignacio Laguna’s work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DEAC52-07NA27344 (LLNL-MI-744729).

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

What’s New in HPC Research: TensorFlow, Buddy Compression, Intel Optane & More

March 20, 2019

In this bimonthly feature, HPCwire highlights newly published research in the high-performance computing community and related domains. From parallel programming to exascale to quantum computing, the details are here. Read more…

By Oliver Peckham

At GTC: Nvidia Expands Scope of Its AI and Datacenter Ecosystem

March 19, 2019

In the high-stakes race to provide the AI life-cycle solution of choice, three of the biggest horses in the field are IBM, Intel and Nvidia. While the latter is only a fraction of the size of its two bigger rivals, and h Read more…

By Doug Black

AWS to Offer Nvidia’s T4 GPUs for AI Inferencing

March 19, 2019

The AI inference market is booming, prompting well-known hyperscaler and Nvidia partner Amazon Web Services to offer a new cloud instance that addresses the growing cost of scaling inference. The new “G4” instances... Read more…

By George Leopold

HPE Extreme Performance Solutions

HPE and Intel® Omni-Path Architecture: How to Power a Cloud

Learn how HPE and Intel® Omni-Path Architecture provide critical infrastructure for leading Nordic HPC provider’s HPCFLOW cloud service.

powercloud_blog.jpgFor decades, HPE has been at the forefront of high-performance computing, and we’ve powered some of the fastest and most robust supercomputers in the world. Read more…

IBM Accelerated Insights

Insurance: Where’s the Risk?

Insurers are facing extreme competitive challenges in their core businesses. Property and Casualty (P&C) and Life and Health (L&H) firms alike are highly impacted by the ongoing globalization, increasing regulation, and digital transformation of their client bases. Read more…

Nvidia Debuts Clara AI Toolkit with Pre-Trained Models for Radiology Use

March 19, 2019

AI’s push into healthcare got a boost yesterday with Nvidia’s release of the Clara Deploy AI toolkit which includes 13 pre-trained models for use in radiology. Clara, you may recall, is Nvidia’s biomedical platform Read more…

By John Russell

At GTC: Nvidia Expands Scope of Its AI and Datacenter Ecosystem

March 19, 2019

In the high-stakes race to provide the AI life-cycle solution of choice, three of the biggest horses in the field are IBM, Intel and Nvidia. While the latter is Read more…

By Doug Black

Nvidia Debuts Clara AI Toolkit with Pre-Trained Models for Radiology Use

March 19, 2019

AI’s push into healthcare got a boost yesterday with Nvidia’s release of the Clara Deploy AI toolkit which includes 13 pre-trained models for use in radiolo Read more…

By John Russell

It’s Official: Aurora on Track to Be First U.S. Exascale Computer in 2021

March 18, 2019

The U.S. Department of Energy along with Intel and Cray confirmed today that an Intel/Cray supercomputer, "Aurora," capable of sustained performance of one exaf Read more…

By Tiffany Trader

Why Nvidia Bought Mellanox: ‘Future Datacenters Will Be…Like High Performance Computers’

March 14, 2019

“Future datacenters of all kinds will be built like high performance computers,” said Nvidia CEO Jensen Huang during a phone briefing on Monday after Nvidia revealed scooping up the high performance networking company Mellanox for $6.9 billion. Read more…

By Tiffany Trader

Oil and Gas Supercloud Clears Out Remaining Knights Landing Inventory: All 38,000 Wafers

March 13, 2019

The McCloud HPC service being built by Australia’s DownUnder GeoSolutions (DUG) outside Houston is set to become the largest oil and gas cloud in the world th Read more…

By Tiffany Trader

Quick Take: Trump’s 2020 Budget Spares DoE-funded HPC but Slams NSF and NIH

March 12, 2019

U.S. President Donald Trump’s 2020 budget request, released yesterday, proposes deep cuts in many science programs but seems to spare HPC funding by the Depar Read more…

By John Russell

Nvidia Wins Mellanox Stakes for $6.9 Billion

March 11, 2019

The long-rumored acquisition of Mellanox came to fruition this morning with GPU chipmaker Nvidia’s announcement that it has purchased the high-performance net Read more…

By Doug Black

Optalysys Rolls Commercial Optical Processor

March 7, 2019

Optalysys, Ltd., a U.K. company seeking to advance it optical co-processor technology, moved a step closer this week with the unveiling of what it claims is th Read more…

By George Leopold

Quantum Computing Will Never Work

November 27, 2018

Amid the gush of money and enthusiastic predictions being thrown at quantum computing comes a proposed cold shower in the form of an essay by physicist Mikhail Read more…

By John Russell

The Case Against ‘The Case Against Quantum Computing’

January 9, 2019

It’s not easy to be a physicist. Richard Feynman (basically the Jimi Hendrix of physicists) once said: “The first principle is that you must not fool yourse Read more…

By Ben Criger

ClusterVision in Bankruptcy, Fate Uncertain

February 13, 2019

ClusterVision, European HPC specialists that have built and installed over 20 Top500-ranked systems in their nearly 17-year history, appear to be in the midst o Read more…

By Tiffany Trader

Intel Reportedly in $6B Bid for Mellanox

January 30, 2019

The latest rumors and reports around an acquisition of Mellanox focus on Intel, which has reportedly offered a $6 billion bid for the high performance interconn Read more…

By Doug Black

Why Nvidia Bought Mellanox: ‘Future Datacenters Will Be…Like High Performance Computers’

March 14, 2019

“Future datacenters of all kinds will be built like high performance computers,” said Nvidia CEO Jensen Huang during a phone briefing on Monday after Nvidia revealed scooping up the high performance networking company Mellanox for $6.9 billion. Read more…

By Tiffany Trader

Looking for Light Reading? NSF-backed ‘Comic Books’ Tackle Quantum Computing

January 28, 2019

Still baffled by quantum computing? How about turning to comic books (graphic novels for the well-read among you) for some clarity and a little humor on QC. The Read more…

By John Russell

Contract Signed for New Finnish Supercomputer

December 13, 2018

After the official contract signing yesterday, configuration details were made public for the new BullSequana system that the Finnish IT Center for Science (CSC Read more…

By Tiffany Trader

Deep500: ETH Researchers Introduce New Deep Learning Benchmark for HPC

February 5, 2019

ETH researchers have developed a new deep learning benchmarking environment – Deep500 – they say is “the first distributed and reproducible benchmarking s Read more…

By John Russell

Leading Solution Providers

SC 18 Virtual Booth Video Tour

Advania @ SC18 AMD @ SC18
ASRock Rack @ SC18
DDN Storage @ SC18
HPE @ SC18
IBM @ SC18
Lenovo @ SC18 Mellanox Technologies @ SC18
NVIDIA @ SC18
One Stop Systems @ SC18
Oracle @ SC18 Panasas @ SC18
Supermicro @ SC18 SUSE @ SC18 TYAN @ SC18
Verne Global @ SC18

IBM Quantum Update: Q System One Launch, New Collaborators, and QC Center Plans

January 10, 2019

IBM made three significant quantum computing announcements at CES this week. One was introduction of IBM Q System One; it’s really the integration of IBM’s Read more…

By John Russell

IBM Bets $2B Seeking 1000X AI Hardware Performance Boost

February 7, 2019

For now, AI systems are mostly machine learning-based and “narrow” – powerful as they are by today's standards, they're limited to performing a few, narro Read more…

By Doug Black

The Deep500 – Researchers Tackle an HPC Benchmark for Deep Learning

January 7, 2019

How do you know if an HPC system, particularly a larger-scale system, is well-suited for deep learning workloads? Today, that’s not an easy question to answer Read more…

By John Russell

HPC Reflections and (Mostly Hopeful) Predictions

December 19, 2018

So much ‘spaghetti’ gets tossed on walls by the technology community (vendors and researchers) to see what sticks that it is often difficult to peer through Read more…

By John Russell

Arm Unveils Neoverse N1 Platform with up to 128-Cores

February 20, 2019

Following on its Neoverse roadmap announcement last October, Arm today revealed its next-gen Neoverse microarchitecture with compute and throughput-optimized si Read more…

By Tiffany Trader

Move Over Lustre & Spectrum Scale – Here Comes BeeGFS?

November 26, 2018

Is BeeGFS – the parallel file system with European roots – on a path to compete with Lustre and Spectrum Scale worldwide in HPC environments? Frank Herold Read more…

By John Russell

France to Deploy AI-Focused Supercomputer: Jean Zay

January 22, 2019

HPE announced today that it won the contract to build a supercomputer that will drive France’s AI and HPC efforts. The computer will be part of GENCI, the Fre Read more…

By Tiffany Trader

It’s Official: Aurora on Track to Be First U.S. Exascale Computer in 2021

March 18, 2019

The U.S. Department of Energy along with Intel and Cray confirmed today that an Intel/Cray supercomputer, "Aurora," capable of sustained performance of one exaf Read more…

By Tiffany Trader

  • arrow
  • Click Here for More Headlines
  • arrow
Do NOT follow this link or you will be banned from the site!
Share This