Kudos for CUDA

By Dr. Vincent Natoli

July 6, 2010

It’s been almost three years since GPU computing broke into the mainstream of HPC with the introduction of NVIDIA’s CUDA API in September 2007. Adoption of the technology since then has proceeded at a surprisingly strong and steady pace. Many organizations that began with small pilot projects a year or two ago have moved on to enterprise deployment, and GPU accelerated machines are now represented on the TOP500 list starting at position two. The relatively-rapid adoption of CUDA by a community not known for the rapid adoption of much of anything is a noteworthy signal. Contrary to the accepted wisdom that GPU computing is more difficult, I believe its success thus far signals that it is no more complicated than good CPU programming. Further, it more clearly and succinctly expresses the parallelism of a large class of problems leading to code that is easier to maintain, more scalable and better positioned to map to future many-core architectures.

The continued growth of CUDA contrasts sharply with the graveyard of abandoned languages introduced to the HPC market over the last 20 to 25 years. Its success can largely be attributed to i) support from a major corporate backer as opposed to a consortium, ii) the maturity of its compilers iii) adherence to a C syntax easily recognized by developers and iv) a more ephemeral feature that can best be described as elegance or simplicity. Physicists and Mathematicians, often use the word “elegant” as a high compliment to describe particularly appealing solutions or equations that neatly represent complex physical phenomena; where the language of mathematics succinctly and…well…elegantly describes and captures symmetry and physics. CUDA is an elegant solution to the problem of representing parallelism in algorithms, not all algorithms, but enough to matter. It seems to resonate in some way with the way we think and code, allowing an easier more natural expression of parallelism beyond the task-level.

HPC developers writing parallel code today have two enterprise options i) traditional multicore platforms built on CPUs from Intel/AMD and ii) platforms accelerated with GPGPU options from NVIDIA and AMD/ATI. Developing performant, scalable parallel code for multicore architectures is still non-trivial and involves a multi-level programming model that includes inter-node parallelism handled with MPI, intra-node parallelism with MPI, OpenMP or pthreads, and register level parallelism expressed via Streaming SIMD Instructions (SSE). The expression of parallelism in this multi-level model is often verbose and messy, obscuring the underlying algorithm. The developer is often left feeling as though he or she is shoehorning in the parallelism.

The CUDA programming model presents a different, in some ways refreshing, approach to expressing parallelism. The MPI, OpenMP and SSE trio evolved from a world centered on serial processing. CUDA, by contrast, arises from a decidedly parallel world, where thousands of simultaneous threads are managed as the norm. The programming model forces the developer to identify the irreducible level of parallelism in his or her problem. In a world that is rapidly moving to manycore, not multicore, this seems to be a better, more intuitive and extensible way to think about our problems.

CUDA is a programming language with constructs that are designed for the natural expression of data-level parallelism. It’s not hard to understand expressibility in languages and the idea that some concepts are more easily stated in specific languages. Computer scientists do this all the time as they create optimal structures to represent their data. DNA base pairs, for example, are neatly and compactly expressed as a sequence of 2-bit data fields much better than a simple minded ASCII representation. Our Italian exchange student was fond of pointing out the vast superiority of Italian over English for passionate argument.

Similarly, we have found in many cases that the expression of algorithmic parallelism in CUDA in fields as diverse as oil and gas, bioinformatics and finance is more elegant, compact and readable than equivalently-optimized CPU code, preserving and more clearly presenting the underlying algorithm. In a recent project we reduced 3,500 lines of highly-optimized C code to a CUDA kernel of about 800 lines. The optimized C was peppered with inline assembly, SSE macros, unrolled loops and special cases, making it difficult to read, extract algorithmic meaning and extend in the future. By comparison the CUDA code was cleaner and more readable. Ultimately it will be easier to maintain.

Commodity parallel processing began as a way to divide large tasks over multiple loosely-connected processors. Programming models supported the idea of dividing problems into a number of smaller pieces of equivalent work. Over time those processors have grown closer to one another in terms of latency and bandwidth, first as single operating system multiprocessor nodes and next as multicore processor components of those nodes. Looking towards the future we see only more cores per chip and more chips per node.

Even though our computing cores are more tightly coupled, our view of them is still very much from a top-down, task parallel mindset, i.e., take a large problem, divide it into many small pieces, distribute them to processing elements and just deal with the communication. In this top-down approach, we must discover new parallelism at each level, domain level parallelism for MPI, “for-loop” level for OpenMP, and data level parallelism for SSE. What is intriguing about CUDA is that it takes a bottom-up point of view, identifying the atomic unit of parallelism and embedding that in a hierarchical structure, e.g., thread::warp::block::grid.

The enduring contribution of GPU computing to HPC may well be a programming model that peels us away from the current top-down, multi-level, task-parallel approach, popularizing instead a more scalable bottom-up, data-parallel alternative. It’s not right for every problem but for those that map well to it, such as finite difference stencils and molecular dynamics among many others, it provides a cleaner, more natural language for expressing parallelism. It should be recognized that the simpler, cleaner expression for these applications in code is a main driver for the relatively-rapid adoption by commercial and academic practitioners. Further, there is no intrinsic reason scaling must stop at the grid or device level. One can easily imagine a generalization of CUDA on future architectures that abstracts one or more levels above the grid to accomplish an implementation across multiple devices, effectively aggregating global memory into one contiguous span; a sort of GPU/NUMA approach. If this can be done, then GPU computing will have made a great leap toward solving a key problem in parallel computing by reducing the programming model from three levels to one level for a simpler more elegant solution.

About the Author
Dr. Vincent NatoliDr. Natoli is the president and founder of Stone Ridge Technology. He is a computational physicist with 20 years experience in the field of high performance computing. He worked as a technical director at High Performance Technologies (HPTi) and before that for 10 years as a senior physicist at ExxonMobil Corporation, at their Corporate Research Lab in Clinton, New Jersey, and in the Upstream Research Center in Houston, Texas. Dr. Natoli holds Bachelor’s and Master’s degrees from MIT, a PhD in Physics from the University of Illinois Urbana-Champaign, and a Masters in Technology Management from the University of Pennsylvania and the Wharton School. Stone Ridge Technology is a professional services firm focused on authoring, profiling, optimizing and porting high performance technical codes to multicore CPUs, GPUs, and FPGAs.

Dr. Natoli can be reached at [email protected].

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

Automated Optimization Boosts ResNet50 Performance by 1.77x on Intel CPUs

October 23, 2018

From supercomputers to cell phones, every system and software device in our digital panoply has a growing number of settings that, if not optimized, constrain performance, wasting precious cycles and watts. In the f Read more…

By Tiffany Trader

South Africa CHPC: Home Grown Dynasty

October 22, 2018

Before the build up to the final event in the 2018 Student Cluster Competition season (the SC18 competition in Dallas), I want to take a moment to write about one of the great inspirational stories of these competitions. Read more…

By Dan Olds

NSF Launches Quantum Computing Faculty Fellows Program

October 22, 2018

Efforts to expand quantum computing research capacity continue to accelerate. The National Science Foundation today announced a Quantum Computing & Information Science Faculty Fellows (QCIS-FF) program aimed at devel Read more…

By John Russell

HPE Extreme Performance Solutions

One Small Step Toward Mars: One Giant Leap for Supercomputing

Since the days of the Space Race between the U.S. and the former Soviet Union, we have continually sought ways to perform experiments in space. Read more…

IBM Accelerated Insights

Join IBM at SC18 and Learn to Harness the Next Generation of AI-focused Supercomputing

Blurring the lines between HPC and AI

Today’s high performance computers are helping clients gain insights at an unprecedented pace. The intersection of artificial intelligence (AI) and HPC can transform industries while solving some of the world’s toughest challenges. Read more…

Democratization of HPC Part 3: Ninth Graders Tap HPC in the Cloud to Design Flying Boats

October 18, 2018

This is the third in a series of articles demonstrating the growing acceptance of high-performance computing (HPC) in new user communities and application areas. In this article we present UberCloud use case #208 on how Read more…

By Wolfgang Gentzsch and Håkon Bull Hove

Automated Optimization Boosts ResNet50 Performance by 1.77x on Intel CPUs

October 23, 2018

From supercomputers to cell phones, every system and software device in our digital panoply has a growing number of settings that, if not optimized, constrain  Read more…

By Tiffany Trader

South Africa CHPC: Home Grown Dynasty

October 22, 2018

Before the build up to the final event in the 2018 Student Cluster Competition season (the SC18 competition in Dallas), I want to take a moment to write about o Read more…

By Dan Olds

Penguin Computing Launches Consultancy for Piecing AI Strategies Together

October 18, 2018

AI stands before the HPC industry as a beacon of great expectations, yet market research repeatedly shows that AI adoption is commonly stuck in the talking phas Read more…

By Tiffany Trader

When Water Quality—Not Quantity—Hinders HPC Cooling

October 18, 2018

Attention has been paid to the sheer quantity of water consumed by supercomputers’ cooling towers – and rightly so, as they can require thousands of gallons per minute to cool. But in the background, another factor can emerge, bottlenecking efficiency and raising costs: water quality. Read more…

By Oliver Peckham

Paper Offers ‘Proof’ of Quantum Advantage on Some Problems

October 18, 2018

Is quantum computing worth all the effort being poured into it or should we just wait for classical computing to catch up? An IBM blog today posed those questio Read more…

By John Russell

Dell EMC to Supply U Michigan’s Great Lakes Cluster

October 16, 2018

The University of Michigan (U-M) today announced Dell EMC is the lead vendor for U-M’s $4.8 million Great Lakes HPC cluster scheduled for deployment in first Read more…

By John Russell

Houston to Field Massive, ‘Geophysically Configured’ Cloud Supercomputer

October 11, 2018

Based on some news stories out today, one might get the impression that the next system to crack number one on the Top500 would be an industrial oil and gas mon Read more…

By Tiffany Trader

Nvidia Platform Pushes GPUs into Machine Learning, High Performance Data Analytics

October 10, 2018

GPU leader Nvidia, generally associated with deep learning, autonomous vehicles and other higher-end enterprise and scientific workloads (and gaming, of course) Read more…

By Doug Black

TACC Wins Next NSF-funded Major Supercomputer

July 30, 2018

The Texas Advanced Computing Center (TACC) has won the next NSF-funded big supercomputer beating out rivals including the National Center for Supercomputing Ap Read more…

By John Russell

IBM at Hot Chips: What’s Next for Power

August 23, 2018

With processor, memory and networking technologies all racing to fill in for an ailing Moore’s law, the era of the heterogeneous datacenter is well underway, Read more…

By Tiffany Trader

Requiem for a Phi: Knights Landing Discontinued

July 25, 2018

On Monday, Intel made public its end of life strategy for the Knights Landing "KNL" Phi product set. The announcement makes official what has already been wide Read more…

By Tiffany Trader

CERN Project Sees Orders-of-Magnitude Speedup with AI Approach

August 14, 2018

An award-winning effort at CERN has demonstrated potential to significantly change how the physics based modeling and simulation communities view machine learni Read more…

By Rob Farber

House Passes $1.275B National Quantum Initiative

September 17, 2018

Last Thursday the U.S. House of Representatives passed the National Quantum Initiative Act (NQIA) intended to accelerate quantum computing research and developm Read more…

By John Russell

Summit Supercomputer is Already Making its Mark on Science

September 20, 2018

Summit, now the fastest supercomputer in the world, is quickly making its mark in science – five of the six finalists just announced for the prestigious 2018 Read more…

By John Russell

New Deep Learning Algorithm Solves Rubik’s Cube

July 25, 2018

Solving (and attempting to solve) Rubik’s Cube has delighted millions of puzzle lovers since 1974 when the cube was invented by Hungarian sculptor and archite Read more…

By John Russell

D-Wave Breaks New Ground in Quantum Simulation

July 16, 2018

Last Friday D-Wave scientists and colleagues published work in Science which they say represents the first fulfillment of Richard Feynman’s 1982 notion that Read more…

By John Russell

Leading Solution Providers

HPC on Wall Street 2018 Booth Video Tours Playlist

Arista

Dell EMC

IBM

Intel

RStor

VMWare

TACC’s ‘Frontera’ Supercomputer Expands Horizon for Extreme-Scale Science

August 29, 2018

The National Science Foundation and the Texas Advanced Computing Center announced today that a new system, called Frontera, will overtake Stampede 2 as the fast Read more…

By Tiffany Trader

HPE No. 1, IBM Surges, in ‘Bucking Bronco’ High Performance Server Market

September 27, 2018

Riding healthy U.S. and global economies, strong demand for AI-capable hardware and other tailwind trends, the high performance computing server market jumped 28 percent in the second quarter 2018 to $3.7 billion, up from $2.9 billion for the same period last year, according to industry analyst firm Hyperion Research. Read more…

By Doug Black

Intel Announces Cooper Lake, Advances AI Strategy

August 9, 2018

Intel's chief datacenter exec Navin Shenoy kicked off the company's Data-Centric Innovation Summit Wednesday, the day-long program devoted to Intel's datacenter Read more…

By Tiffany Trader

Germany Celebrates Launch of Two Fastest Supercomputers

September 26, 2018

The new high-performance computer SuperMUC-NG at the Leibniz Supercomputing Center (LRZ) in Garching is the fastest computer in Germany and one of the fastest i Read more…

By Tiffany Trader

MLPerf – Will New Machine Learning Benchmark Help Propel AI Forward?

May 2, 2018

Let the AI benchmarking wars begin. Today, a diverse group from academia and industry – Google, Baidu, Intel, AMD, Harvard, and Stanford among them – releas Read more…

By John Russell

Houston to Field Massive, ‘Geophysically Configured’ Cloud Supercomputer

October 11, 2018

Based on some news stories out today, one might get the impression that the next system to crack number one on the Top500 would be an industrial oil and gas mon Read more…

By Tiffany Trader

Aerodynamic Simulation Reveals Best Position in a Peloton of Cyclists

July 5, 2018

Eindhoven University of Technology (TU/e) and KU Leuven research group conducts the largest numerical simulation ever done in the sport industry and cycling discipline. The goal was to understand the aerodynamic interactions in the peloton, i.e., the main pack of cyclists in a race. Read more…

No Go for GloFo at 7nm; and the Fujitsu A64FX post-K CPU

September 5, 2018

It’s been a news worthy couple of weeks in the semiconductor and HPC industry. There were several HPC relevant disclosures at Hot Chips 2018 to whet appetites Read more…

By Dairsie Latimer

  • arrow
  • Click Here for More Headlines
  • arrow
Do NOT follow this link or you will be banned from the site!
Share This