Ten Ways to Fool the Masses When Giving Performance Results on GPUs

By Scott Pakin

December 13, 2011

The performance potential of GPU computing has produced significant excitement in the HPC community. However, as was the case with the advent of parallel computing decades ago, the nascent technology does not equally benefit all applications — or even all components of a single application. Alas, modest speedups from GPU acceleration are rarely publication-worthy, a fact that occasionally leads GPU zealots to adopt scientifically dubious techniques to artificially inflate the performance benefit of GPU computing to more impressive levels.

In this modern revival of David Bailey’s classic report, “Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Computers,” I present ten forms of experimental sloppiness I’ve encountered repeatedly in scientific publications, all of which can be used to chicane GPU rookies (and pointy-haired bosses) into believing that GPUs can magically improve any application’s performance by multiple orders of magnitude. With this list as their vade mecum, readers will learn to be skeptical of exaggerated GPU performance claims.

Ready to boost your reported GPU performance results without boosting your actual GPU performance? Read on…

1. Quote performance results only with 32-bit floating-point arithmetic, not 64-bit arithmetic.

GPUs get double the performance when using single-precision arithmetic. Who needs more than eight decimal digits of precision, anyway? It goes without saying that the CPU version of the code you compare against should use exclusively 64-bit arithmetic because, well, that’s how people write CPU code (even though CPUs also double their flop rate when utilizing 32-bit SIMD arithmetic).

2. Don’t time data movement or kernel-invocation overhead.

Copying data between CPU memory and GPU memory is slow and cuts into the amount of GPU performance that one can claim. Hence, to make GPUs look good, be sure to start the clock after all of the program’s data have already been transferred to GPU memory and the kernel has already been launched and stop the clock before the results are copied back to CPU memory. There are two corollaries to this rule:

Corollary 1: Never, ever report the performance of an application running across more than one GPU-accelerated node. Doing so requires all sorts of CPU-managed communication, and *that* requires data movement and additional kernel invocations — bad for speedup numbers.

Corollary 2: Always report performance of single kernels, not of complete applications. This is especially true of applications containing important but hard-to-accelerate subroutines.

3. Quote GPU cost and ubiquity for low-end parts. Measure performance on high-end parts.

Here’s some text you can adapt as necessary: “GPUs are an important platform to target because they cost under $100 and come standard with all modern computer systems. For our experiments we measured performance on an NVIDIA Tesla M2090…”

4. Quote memory bandwidth only to/from on-board GPU memory, not to/from main memory.

Impress your audience with your high-end GPU’s ability to do memory transfers at 177 GB/s. As long as you never need to store, transfer, or utilize the result of your computations, that’s a perfectly honest number to quote.

5. Disable ECC checks on memory.

GPUs run faster — and provide more usable memory capacity — when they don’t have to try so hard to produce correct data. Besides, what GPU kernel runs long enough that this should be an issue?

6. Compare full (or even multiple) GPU performance to a single CPU core.

Always compare what you started with (a sequential CPU program) with what you ended up producing (a parallel GPU program). A 10x speedup of GPU code over CPU code sure seems a lot more impressive when you neglect to mention that your host system contains two sockets of eight-core CPUs, which you *could* have used instead.

7. Compare heavily optimized GPU code to unoptimized CPU code.

Naturally, you’ve made sure the GPU code runs as fast as possible by restructuring it to exploit data parallelism, memory locality, and other GPU-friendly program characteristics. Now be sure to compare it only against the original, naive CPU code, not a version that exploits the CPU’s SIMD instructions, properly blocks for cache, optimally aligns data structures, or includes any of the other performance optimizations that CPU programmers rarely bother with. Definitely don’t backport your GPU modifications to the CPU, or the reported speedup will be disappointingly less.

8. Scale the problem size to fit within GPU memory.

This recommendation goes both ways. If your GPU has 6 GB of on-board memory and your application’s problem size is larger than that, then scale it down to 6 GB so you can avoid all the expensive synchronization and messy double-buffering that large problem sizes entail. If your GPU has 6 GB of on-board memory and your application’s problem size is significantly smaller than that, then weak-scale the problem size, even beyond meaningful bounds, so you can reap the performance benefits of increased data parallelism. The following recommendation further develops this point:

9. Sacrifice meaningful numerics for GPU performance.

GPUs are renowned for their computational throughput. However, reaching peak performance requires amortizing that nasty startup cost of moving kernels and data to the GPU. Hence, to demonstrate good GPU performance, always run far more iterations than are typical, necessary, practical, or even meaningful for real-world usage, numerics be damned!

10. Select algorithms that favor GPUs.

The best CPU algorithms often don’t make the best GPU algorithms and vice versa. Consequently, you should always take whatever algorithm works best on the GPU and benchmark that against a CPU version. What’s great about this approach over comparing the performance of the best CPU algorithm to that of the best GPU algorithm is that it leads to a “fair” comparison. After all, you ran the same algorithm on both systems — fair, right?

Parting thoughts

The good news is that advances in GPU technology are alleviating some of the costs that the preceding trickery attempts to hide. While parts of my list may soon appear anachronistic, there should still be enough deviousness remaining to please even the most discerning GPU fanboy.

As a final, largely unrelated comment, can we please eliminate the oxymoronic noun “GPGPU” from our collective lexicon? If a processor is specialized for graphics processing, then it’s not really a general-purpose device, is it?

Further reading

[Bai91] David H. Bailey. “Highly parallel perspective: Twelve ways to fool the masses when giving performance results on parallel computers”. Supercomputing Review, 4(8):54-55, August, 1991. ISSN: 1048-6836. Also appears as NASA Ames RNR Technical Report RNR-91-020.

[BBR10] Rajesh Bordawekar, Uday Bondhugula, and Ravi Rao. “Can CPUs match GPUs on performance with productivity?: Experiences with optimizing a FLOP-intensive application on CPUs and GPU”. IBM T. J. Watson Research Center Technical Report RC25033 (W1008-020). August 5, 2010.

[LKC+10] Victor W. Lee, Changkyu Kim, Jatin Chhugani, Michael Deisher, Daehyun Kim, Anthony D. Nguyen, Nadathur Satish, Mikhail Smelyanskiy, Srinivas Chennupaty, Per Hammarlund, Ronak Singhal, and Pradeep Dubey. “Debunking the 100X GPU vs. CPU myth: An evaluation of throughput computing on CPU and GPU”, Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA 2010), Saint-Malo, France, June 19-23, 2010. ISBN: 978-1-4503-0053-7, DOI: 10.1145/1815961.1816021.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

Q&A with Altair CEO James Scapa, an HPCwire Person to Watch in 2021

May 14, 2021

Chairman, CEO and co-founder of Altair James R. Scapa closed several acquisitions for the company in 2020, including the purchase and integration of Univa and Ellexus. Scapa founded Altair more than 35 years ago with two Read more…

HLRS HPC Helps to Model Muscle Movements

May 13, 2021

The growing scale of HPC is allowing simulation of more and more complex systems at greater detail than ever before, particularly in the biological research spheres. Now, researchers at the University of Stuttgart are le Read more…

Behind the Met Office’s Procurement of a Billion-Dollar Microsoft System

May 13, 2021

The UK’s national weather service, the Met Office, caused shockwaves of curiosity a few weeks ago when it formally announced that its forthcoming billion-dollar supercomputer – expected to be the most powerful weather and climate-focused supercomputer in the world when it launches in 2022... Read more…

AMD, GlobalFoundries Commit to $1.6 Billion Wafer Supply Deal

May 13, 2021

AMD plans to purchase $1.6 billion worth of wafers from GlobalFoundries in the 2022 to 2024 timeframe, the chipmaker revealed today (May 13) in an SEC filing. In the face of global semiconductor shortages and record-high demand, AMD is renegotiating its Wafer Supply Agreement and bumping up capacity. Read more…

Hyperion Offers Snapshot of Quantum Computing Market

May 13, 2021

The nascent quantum computer (QC) market will grow 27 percent annually (CAGR) reaching $830 million in 2024 according to an update provided today by analyst firm Hyperion Research at the HPC User Forum being held this we Read more…

AWS Solution Channel

Numerical weather prediction on AWS Graviton2

The Weather Research and Forecasting (WRF) model is a numerical weather prediction (NWP) system designed to serve both atmospheric research and operational forecasting needs. Read more…

Hyperion: HPC Server Market Ekes 1 Percent Gain in 2020, Storage Poised for ‘Tipping Point’

May 12, 2021

The HPC User Forum meeting taking place virtually this week (May 11-13) kicked off with Hyperion Research’s market update, covering the 2020 period. Although the HPC server market had been facing a 6.7 percent COVID-re Read more…

Behind the Met Office’s Procurement of a Billion-Dollar Microsoft System

May 13, 2021

The UK’s national weather service, the Met Office, caused shockwaves of curiosity a few weeks ago when it formally announced that its forthcoming billion-dollar supercomputer – expected to be the most powerful weather and climate-focused supercomputer in the world when it launches in 2022... Read more…

AMD, GlobalFoundries Commit to $1.6 Billion Wafer Supply Deal

May 13, 2021

AMD plans to purchase $1.6 billion worth of wafers from GlobalFoundries in the 2022 to 2024 timeframe, the chipmaker revealed today (May 13) in an SEC filing. In the face of global semiconductor shortages and record-high demand, AMD is renegotiating its Wafer Supply Agreement and bumping up capacity. Read more…

Hyperion Offers Snapshot of Quantum Computing Market

May 13, 2021

The nascent quantum computer (QC) market will grow 27 percent annually (CAGR) reaching $830 million in 2024 according to an update provided today by analyst fir Read more…

Hyperion: HPC Server Market Ekes 1 Percent Gain in 2020, Storage Poised for ‘Tipping Point’

May 12, 2021

The HPC User Forum meeting taking place virtually this week (May 11-13) kicked off with Hyperion Research’s market update, covering the 2020 period. Although Read more…

IBM Debuts Qiskit Runtime for Quantum Computing; Reports Dramatic Speed-up

May 11, 2021

In conjunction with its virtual Think event, IBM today introduced an enhanced Qiskit Runtime Software for quantum computing, which it says demonstrated 120x spe Read more…

AMD Chipmaker TSMC to Use AMD Chips for Chipmaking

May 8, 2021

TSMC has tapped AMD to support its major manufacturing and R&D workloads. AMD will provide its Epyc Rome 7702P CPUs – with 64 cores operating at a base cl Read more…

Fast Pass Through (Some of) the Quantum Landscape with ORNL’s Raphael Pooser

May 7, 2021

In a rather remarkable way, and despite the frequent hype, the behind-the-scenes work of developing quantum computing has dramatically accelerated in the past f Read more…

IBM Research Debuts 2nm Test Chip with 50 Billion Transistors

May 6, 2021

IBM Research today announced the successful prototyping of the world's first 2 nanometer chip, fabricated with silicon nanosheet technology on a standard 300mm Read more…

AMD Chipmaker TSMC to Use AMD Chips for Chipmaking

May 8, 2021

TSMC has tapped AMD to support its major manufacturing and R&D workloads. AMD will provide its Epyc Rome 7702P CPUs – with 64 cores operating at a base cl Read more…

Intel Launches 10nm ‘Ice Lake’ Datacenter CPU with Up to 40 Cores

April 6, 2021

The wait is over. Today Intel officially launched its 10nm datacenter CPU, the third-generation Intel Xeon Scalable processor, codenamed Ice Lake. With up to 40 Read more…

Julia Update: Adoption Keeps Climbing; Is It a Python Challenger?

January 13, 2021

The rapid adoption of Julia, the open source, high level programing language with roots at MIT, shows no sign of slowing according to data from Julialang.org. I Read more…

CERN Is Betting Big on Exascale

April 1, 2021

The European Organization for Nuclear Research (CERN) involves 23 countries, 15,000 researchers, billions of dollars a year, and the biggest machine in the worl Read more…

HPE Launches Storage Line Loaded with IBM’s Spectrum Scale File System

April 6, 2021

HPE today launched a new family of storage solutions bundled with IBM’s Spectrum Scale Erasure Code Edition parallel file system (description below) and featu Read more…

10nm, 7nm, 5nm…. Should the Chip Nanometer Metric Be Replaced?

June 1, 2020

The biggest cool factor in server chips is the nanometer. AMD beating Intel to a CPU built on a 7nm process node* – with 5nm and 3nm on the way – has been i Read more…

Saudi Aramco Unveils Dammam 7, Its New Top Ten Supercomputer

January 21, 2021

By revenue, oil and gas giant Saudi Aramco is one of the largest companies in the world, and it has historically employed commensurate amounts of supercomputing Read more…

Quantum Computer Start-up IonQ Plans IPO via SPAC

March 8, 2021

IonQ, a Maryland-based quantum computing start-up working with ion trap technology, plans to go public via a Special Purpose Acquisition Company (SPAC) merger a Read more…

Leading Solution Providers

Contributors

AMD Launches Epyc ‘Milan’ with 19 SKUs for HPC, Enterprise and Hyperscale

March 15, 2021

At a virtual launch event held today (Monday), AMD revealed its third-generation Epyc “Milan” CPU lineup: a set of 19 SKUs -- including the flagship 64-core, 280-watt 7763 part --  aimed at HPC, enterprise and cloud workloads. Notably, the third-gen Epyc Milan chips achieve 19 percent... Read more…

Can Deep Learning Replace Numerical Weather Prediction?

March 3, 2021

Numerical weather prediction (NWP) is a mainstay of supercomputing. Some of the first applications of the first supercomputers dealt with climate modeling, and Read more…

Livermore’s El Capitan Supercomputer to Debut HPE ‘Rabbit’ Near Node Local Storage

February 18, 2021

A near node local storage innovation called Rabbit factored heavily into Lawrence Livermore National Laboratory’s decision to select Cray’s proposal for its CORAL-2 machine, the lab’s first exascale-class supercomputer, El Capitan. Details of this new storage technology were revealed... Read more…

African Supercomputing Center Inaugurates ‘Toubkal,’ Most Powerful Supercomputer on the Continent

February 25, 2021

Historically, Africa hasn’t exactly been synonymous with supercomputing. There are only a handful of supercomputers on the continent, with few ranking on the Read more…

GTC21: Nvidia Launches cuQuantum; Dips a Toe in Quantum Computing

April 13, 2021

Yesterday Nvidia officially dipped a toe into quantum computing with the launch of cuQuantum SDK, a development platform for simulating quantum circuits on GPU-accelerated systems. As Nvidia CEO Jensen Huang emphasized in his keynote, Nvidia doesn’t plan to build... Read more…

New Deep Learning Algorithm Solves Rubik’s Cube

July 25, 2018

Solving (and attempting to solve) Rubik’s Cube has delighted millions of puzzle lovers since 1974 when the cube was invented by Hungarian sculptor and archite Read more…

The History of Supercomputing vs. COVID-19

March 9, 2021

The COVID-19 pandemic poses a greater challenge to the high-performance computing community than any before. HPCwire's coverage of the supercomputing response t Read more…

Microsoft to Provide World’s Most Powerful Weather & Climate Supercomputer for UK’s Met Office

April 22, 2021

More than 14 months ago, the UK government announced plans to invest £1.2 billion ($1.56 billion) into weather and climate supercomputing, including procuremen Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire