Kudos for CUDA

By Dr. Vincent Natoli

July 6, 2010

It’s been almost three years since GPU computing broke into the mainstream of HPC with the introduction of NVIDIA’s CUDA API in September 2007. Adoption of the technology since then has proceeded at a surprisingly strong and steady pace. Many organizations that began with small pilot projects a year or two ago have moved on to enterprise deployment, and GPU accelerated machines are now represented on the TOP500 list starting at position two. The relatively-rapid adoption of CUDA by a community not known for the rapid adoption of much of anything is a noteworthy signal. Contrary to the accepted wisdom that GPU computing is more difficult, I believe its success thus far signals that it is no more complicated than good CPU programming. Further, it more clearly and succinctly expresses the parallelism of a large class of problems leading to code that is easier to maintain, more scalable and better positioned to map to future many-core architectures.

The continued growth of CUDA contrasts sharply with the graveyard of abandoned languages introduced to the HPC market over the last 20 to 25 years. Its success can largely be attributed to i) support from a major corporate backer as opposed to a consortium, ii) the maturity of its compilers iii) adherence to a C syntax easily recognized by developers and iv) a more ephemeral feature that can best be described as elegance or simplicity. Physicists and Mathematicians, often use the word “elegant” as a high compliment to describe particularly appealing solutions or equations that neatly represent complex physical phenomena; where the language of mathematics succinctly and…well…elegantly describes and captures symmetry and physics. CUDA is an elegant solution to the problem of representing parallelism in algorithms, not all algorithms, but enough to matter. It seems to resonate in some way with the way we think and code, allowing an easier more natural expression of parallelism beyond the task-level.

HPC developers writing parallel code today have two enterprise options i) traditional multicore platforms built on CPUs from Intel/AMD and ii) platforms accelerated with GPGPU options from NVIDIA and AMD/ATI. Developing performant, scalable parallel code for multicore architectures is still non-trivial and involves a multi-level programming model that includes inter-node parallelism handled with MPI, intra-node parallelism with MPI, OpenMP or pthreads, and register level parallelism expressed via Streaming SIMD Instructions (SSE). The expression of parallelism in this multi-level model is often verbose and messy, obscuring the underlying algorithm. The developer is often left feeling as though he or she is shoehorning in the parallelism.

The CUDA programming model presents a different, in some ways refreshing, approach to expressing parallelism. The MPI, OpenMP and SSE trio evolved from a world centered on serial processing. CUDA, by contrast, arises from a decidedly parallel world, where thousands of simultaneous threads are managed as the norm. The programming model forces the developer to identify the irreducible level of parallelism in his or her problem. In a world that is rapidly moving to manycore, not multicore, this seems to be a better, more intuitive and extensible way to think about our problems.

CUDA is a programming language with constructs that are designed for the natural expression of data-level parallelism. It’s not hard to understand expressibility in languages and the idea that some concepts are more easily stated in specific languages. Computer scientists do this all the time as they create optimal structures to represent their data. DNA base pairs, for example, are neatly and compactly expressed as a sequence of 2-bit data fields much better than a simple minded ASCII representation. Our Italian exchange student was fond of pointing out the vast superiority of Italian over English for passionate argument.

Similarly, we have found in many cases that the expression of algorithmic parallelism in CUDA in fields as diverse as oil and gas, bioinformatics and finance is more elegant, compact and readable than equivalently-optimized CPU code, preserving and more clearly presenting the underlying algorithm. In a recent project we reduced 3,500 lines of highly-optimized C code to a CUDA kernel of about 800 lines. The optimized C was peppered with inline assembly, SSE macros, unrolled loops and special cases, making it difficult to read, extract algorithmic meaning and extend in the future. By comparison the CUDA code was cleaner and more readable. Ultimately it will be easier to maintain.

Commodity parallel processing began as a way to divide large tasks over multiple loosely-connected processors. Programming models supported the idea of dividing problems into a number of smaller pieces of equivalent work. Over time those processors have grown closer to one another in terms of latency and bandwidth, first as single operating system multiprocessor nodes and next as multicore processor components of those nodes. Looking towards the future we see only more cores per chip and more chips per node.

Even though our computing cores are more tightly coupled, our view of them is still very much from a top-down, task parallel mindset, i.e., take a large problem, divide it into many small pieces, distribute them to processing elements and just deal with the communication. In this top-down approach, we must discover new parallelism at each level, domain level parallelism for MPI, “for-loop” level for OpenMP, and data level parallelism for SSE. What is intriguing about CUDA is that it takes a bottom-up point of view, identifying the atomic unit of parallelism and embedding that in a hierarchical structure, e.g., thread::warp::block::grid.

The enduring contribution of GPU computing to HPC may well be a programming model that peels us away from the current top-down, multi-level, task-parallel approach, popularizing instead a more scalable bottom-up, data-parallel alternative. It’s not right for every problem but for those that map well to it, such as finite difference stencils and molecular dynamics among many others, it provides a cleaner, more natural language for expressing parallelism. It should be recognized that the simpler, cleaner expression for these applications in code is a main driver for the relatively-rapid adoption by commercial and academic practitioners. Further, there is no intrinsic reason scaling must stop at the grid or device level. One can easily imagine a generalization of CUDA on future architectures that abstracts one or more levels above the grid to accomplish an implementation across multiple devices, effectively aggregating global memory into one contiguous span; a sort of GPU/NUMA approach. If this can be done, then GPU computing will have made a great leap toward solving a key problem in parallel computing by reducing the programming model from three levels to one level for a simpler more elegant solution.

About the Author
Dr. Vincent NatoliDr. Natoli is the president and founder of Stone Ridge Technology. He is a computational physicist with 20 years experience in the field of high performance computing. He worked as a technical director at High Performance Technologies (HPTi) and before that for 10 years as a senior physicist at ExxonMobil Corporation, at their Corporate Research Lab in Clinton, New Jersey, and in the Upstream Research Center in Houston, Texas. Dr. Natoli holds Bachelor’s and Master’s degrees from MIT, a PhD in Physics from the University of Illinois Urbana-Champaign, and a Masters in Technology Management from the University of Pennsylvania and the Wharton School. Stone Ridge Technology is a professional services firm focused on authoring, profiling, optimizing and porting high performance technical codes to multicore CPUs, GPUs, and FPGAs.

Dr. Natoli can be reached at vnatoli@stoneridgetechnology.com.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

DARPA Continues Investment in Post-Moore’s Technologies

July 24, 2017

The U.S. military long ago ceded dominance in electronics innovation to Silicon Valley, the DoD-backed powerhouse that has driven microelectronic generation for decades. With Moore's Law clearly running out of steam, the Read more…

By George Leopold

Graphcore Readies Launch of 16nm Colossus-IPU Chip

July 20, 2017

A second $30 million funding round for U.K. AI chip developer Graphcore sets up the company to go to market with its “intelligent processing unit” (IPU) in 2017 with scale-up production for enterprise datacenters and Read more…

By Tiffany Trader

Trinity Supercomputer’s Haswell and KNL Partitions Are Merged

July 19, 2017

Trinity supercomputer’s two partitions – one based on Intel Xeon Haswell processors and the other on Xeon Phi Knights Landing – have been fully integrated are now available for use on classified work in the Nationa Read more…

By HPCwire Staff

Fujitsu Continues HPC, AI Push

July 19, 2017

Summer is well under way, but the so-called summertime slowdown, linked with hot temperatures and longer vacations, does not seem to have impacted Fujitsu's output. The Japanese multinational has made a raft of HPC and A Read more…

By Tiffany Trader

HPE Extreme Performance Solutions

HPE Servers Deliver High Performance Remote Visualization

Whether generating seismic simulations, locating new productive oil reservoirs, or constructing complex models of the earth’s subsurface, energy, oil, and gas (EO&G) is a highly data-driven industry. Read more…

Researchers Use DNA to Store and Retrieve Digital Movie

July 18, 2017

From abacus to pencil and paper to semiconductor chips, the technology of computing has always been an ever-changing target. The human brain is probably the computer we use most (hopefully) and understand least. This mon Read more…

By John Russell

The Exascale FY18 Budget – The Next Step

July 17, 2017

On July 12, 2017, the U.S. federal budget for its Exascale Computing Initiative (ECI) took its next step forward. On that day, the full Appropriations Committee of the House of Representatives voted to accept the recomme Read more…

By Alex R. Larzelere

Summer Reading: IEEE Spectrum’s Chip Hall of Fame

July 17, 2017

Take a trip down memory lane – the Mostek MK4096 4-kilobit DRAM, for instance. Perhaps processors are more to your liking. Remember the Sh-Boom processor (1988), created by Russell Fish and Chuck Moore, and named after Read more…

By John Russell

Women in HPC Luncheon Shines Light on Female-Friendly Hiring Practices

July 13, 2017

The second annual Women in HPC luncheon was held on June 20, 2017, during the International Supercomputing Conference in Frankfurt, Germany. The luncheon provides participants the opportunity to network with industry lea Read more…

By Tiffany Trader

Graphcore Readies Launch of 16nm Colossus-IPU Chip

July 20, 2017

A second $30 million funding round for U.K. AI chip developer Graphcore sets up the company to go to market with its “intelligent processing unit” (IPU) in Read more…

By Tiffany Trader

Fujitsu Continues HPC, AI Push

July 19, 2017

Summer is well under way, but the so-called summertime slowdown, linked with hot temperatures and longer vacations, does not seem to have impacted Fujitsu's out Read more…

By Tiffany Trader

Researchers Use DNA to Store and Retrieve Digital Movie

July 18, 2017

From abacus to pencil and paper to semiconductor chips, the technology of computing has always been an ever-changing target. The human brain is probably the com Read more…

By John Russell

The Exascale FY18 Budget – The Next Step

July 17, 2017

On July 12, 2017, the U.S. federal budget for its Exascale Computing Initiative (ECI) took its next step forward. On that day, the full Appropriations Committee Read more…

By Alex R. Larzelere

Women in HPC Luncheon Shines Light on Female-Friendly Hiring Practices

July 13, 2017

The second annual Women in HPC luncheon was held on June 20, 2017, during the International Supercomputing Conference in Frankfurt, Germany. The luncheon provid Read more…

By Tiffany Trader

Satellite Advances, NSF Computation Power Rapid Mapping of Earth’s Surface

July 13, 2017

New satellite technologies have completely changed the game in mapping and geographical data gathering, reducing costs and placing a new emphasis on time series Read more…

By Ken Chiacchia and Tiffany Jolley

Intel Skylake: Xeon Goes from Chip to Platform

July 13, 2017

With yesterday’s New York unveiling of the new “Skylake” Xeon Scalable processors, Intel made multiple runs at multiple competitive threats and strategic Read more…

By Doug Black

Perverse Incentives? How Economics (Mis-)shaped Academic Science

July 12, 2017

The unintended consequences of how we fund academic research—in the U.S. and elsewhere—are strangling innovation, putting universities into debt and creatin Read more…

By Ken Chiacchia, Senior Science Writer, Pittsburgh Supercomputing Center

Google Pulls Back the Covers on Its First Machine Learning Chip

April 6, 2017

This week Google released a report detailing the design and performance characteristics of the Tensor Processing Unit (TPU), its custom ASIC for the inference Read more…

By Tiffany Trader

Nvidia Responds to Google TPU Benchmarking

April 10, 2017

Nvidia highlights strengths of its newest GPU silicon in response to Google's report on the performance and energy advantages of its custom tensor processor. Read more…

By Tiffany Trader

Quantum Bits: D-Wave and VW; Google Quantum Lab; IBM Expands Access

March 21, 2017

For a technology that’s usually characterized as far off and in a distant galaxy, quantum computing has been steadily picking up steam. Just how close real-wo Read more…

By John Russell

HPC Compiler Company PathScale Seeks Life Raft

March 23, 2017

HPCwire has learned that HPC compiler company PathScale has fallen on difficult times and is asking the community for help or actively seeking a buyer for its a Read more…

By Tiffany Trader

Trump Budget Targets NIH, DOE, and EPA; No Mention of NSF

March 16, 2017

President Trump’s proposed U.S. fiscal 2018 budget issued today sharply cuts science spending while bolstering military spending as he promised during the cam Read more…

By John Russell

CPU-based Visualization Positions for Exascale Supercomputing

March 16, 2017

In this contributed perspective piece, Intel’s Jim Jeffers makes the case that CPU-based visualization is now widely adopted and as such is no longer a contrarian view, but is rather an exascale requirement. Read more…

By Jim Jeffers, Principal Engineer and Engineering Leader, Intel

Nvidia’s Mammoth Volta GPU Aims High for AI, HPC

May 10, 2017

At Nvidia's GPU Technology Conference (GTC17) in San Jose, Calif., this morning, CEO Jensen Huang announced the company's much-anticipated Volta architecture a Read more…

By Tiffany Trader

Facebook Open Sources Caffe2; Nvidia, Intel Rush to Optimize

April 18, 2017

From its F8 developer conference in San Jose, Calif., today, Facebook announced Caffe2, a new open-source, cross-platform framework for deep learning. Caffe2 is the successor to Caffe, the deep learning framework developed by Berkeley AI Research and community contributors. Read more…

By Tiffany Trader

Leading Solution Providers

How ‘Knights Mill’ Gets Its Deep Learning Flops

June 22, 2017

Intel, the subject of much speculation regarding the delayed, rewritten or potentially canceled “Aurora” contract (the Argonne Lab part of the CORAL “ Read more…

By Tiffany Trader

Reinders: “AVX-512 May Be a Hidden Gem” in Intel Xeon Scalable Processors

June 29, 2017

Imagine if we could use vector processing on something other than just floating point problems.  Today, GPUs and CPUs work tirelessly to accelerate algorithms Read more…

By James Reinders

Russian Researchers Claim First Quantum-Safe Blockchain

May 25, 2017

The Russian Quantum Center today announced it has overcome the threat of quantum cryptography by creating the first quantum-safe blockchain, securing cryptocurrencies like Bitcoin, along with classified government communications and other sensitive digital transfers. Read more…

By Doug Black

MIT Mathematician Spins Up 220,000-Core Google Compute Cluster

April 21, 2017

On Thursday, Google announced that MIT math professor and computational number theorist Andrew V. Sutherland had set a record for the largest Google Compute Engine (GCE) job. Sutherland ran the massive mathematics workload on 220,000 GCE cores using preemptible virtual machine instances. Read more…

By Tiffany Trader

Google Debuts TPU v2 and will Add to Google Cloud

May 25, 2017

Not long after stirring attention in the deep learning/AI community by revealing the details of its Tensor Processing Unit (TPU), Google last week announced the Read more…

By John Russell

Groq This: New AI Chips to Give GPUs a Run for Deep Learning Money

April 24, 2017

CPUs and GPUs, move over. Thanks to recent revelations surrounding Google’s new Tensor Processing Unit (TPU), the computing world appears to be on the cusp of Read more…

By Alex Woodie

Six Exascale PathForward Vendors Selected; DoE Providing $258M

June 15, 2017

The much-anticipated PathForward awards for hardware R&D in support of the Exascale Computing Project were announced today with six vendors selected – AMD Read more…

By John Russell

Top500 Results: Latest List Trends and What’s in Store

June 19, 2017

Greetings from Frankfurt and the 2017 International Supercomputing Conference where the latest Top500 list has just been revealed. Although there were no major Read more…

By Tiffany Trader

  • arrow
  • Click Here for More Headlines
  • arrow
Share This