Kudos for CUDA
It’s been almost three years since GPU computing broke into the mainstream of HPC with the introduction of NVIDIA’s CUDA API in September 2007. Adoption of the technology since then has proceeded at a surprisingly strong and steady pace. Many organizations that began with small pilot projects a year or two ago have moved on to enterprise deployment, and GPU accelerated machines are now represented on the TOP500 list starting at position two. The relatively-rapid adoption of CUDA by a community not known for the rapid adoption of much of anything is a noteworthy signal. Contrary to the accepted wisdom that GPU computing is more difficult, I believe its success thus far signals that it is no more complicated than good CPU programming. Further, it more clearly and succinctly expresses the parallelism of a large class of problems leading to code that is easier to maintain, more scalable and better positioned to map to future many-core architectures.
The continued growth of CUDA contrasts sharply with the graveyard of abandoned languages introduced to the HPC market over the last 20 to 25 years. Its success can largely be attributed to i) support from a major corporate backer as opposed to a consortium, ii) the maturity of its compilers iii) adherence to a C syntax easily recognized by developers and iv) a more ephemeral feature that can best be described as elegance or simplicity. Physicists and Mathematicians, often use the word “elegant” as a high compliment to describe particularly appealing solutions or equations that neatly represent complex physical phenomena; where the language of mathematics succinctly and…well…elegantly describes and captures symmetry and physics. CUDA is an elegant solution to the problem of representing parallelism in algorithms, not all algorithms, but enough to matter. It seems to resonate in some way with the way we think and code, allowing an easier more natural expression of parallelism beyond the task-level.
HPC developers writing parallel code today have two enterprise options i) traditional multicore platforms built on CPUs from Intel/AMD and ii) platforms accelerated with GPGPU options from NVIDIA and AMD/ATI. Developing performant, scalable parallel code for multicore architectures is still non-trivial and involves a multi-level programming model that includes inter-node parallelism handled with MPI, intra-node parallelism with MPI, OpenMP or pthreads, and register level parallelism expressed via Streaming SIMD Instructions (SSE). The expression of parallelism in this multi-level model is often verbose and messy, obscuring the underlying algorithm. The developer is often left feeling as though he or she is shoehorning in the parallelism.
The CUDA programming model presents a different, in some ways refreshing, approach to expressing parallelism. The MPI, OpenMP and SSE trio evolved from a world centered on serial processing. CUDA, by contrast, arises from a decidedly parallel world, where thousands of simultaneous threads are managed as the norm. The programming model forces the developer to identify the irreducible level of parallelism in his or her problem. In a world that is rapidly moving to manycore, not multicore, this seems to be a better, more intuitive and extensible way to think about our problems.
CUDA is a programming language with constructs that are designed for the natural expression of data-level parallelism. It’s not hard to understand expressibility in languages and the idea that some concepts are more easily stated in specific languages. Computer scientists do this all the time as they create optimal structures to represent their data. DNA base pairs, for example, are neatly and compactly expressed as a sequence of 2-bit data fields much better than a simple minded ASCII representation. Our Italian exchange student was fond of pointing out the vast superiority of Italian over English for passionate argument.
Similarly, we have found in many cases that the expression of algorithmic parallelism in CUDA in fields as diverse as oil and gas, bioinformatics and finance is more elegant, compact and readable than equivalently-optimized CPU code, preserving and more clearly presenting the underlying algorithm. In a recent project we reduced 3,500 lines of highly-optimized C code to a CUDA kernel of about 800 lines. The optimized C was peppered with inline assembly, SSE macros, unrolled loops and special cases, making it difficult to read, extract algorithmic meaning and extend in the future. By comparison the CUDA code was cleaner and more readable. Ultimately it will be easier to maintain.
Commodity parallel processing began as a way to divide large tasks over multiple loosely-connected processors. Programming models supported the idea of dividing problems into a number of smaller pieces of equivalent work. Over time those processors have grown closer to one another in terms of latency and bandwidth, first as single operating system multiprocessor nodes and next as multicore processor components of those nodes. Looking towards the future we see only more cores per chip and more chips per node.
Even though our computing cores are more tightly coupled, our view of them is still very much from a top-down, task parallel mindset, i.e., take a large problem, divide it into many small pieces, distribute them to processing elements and just deal with the communication. In this top-down approach, we must discover new parallelism at each level, domain level parallelism for MPI, “for-loop” level for OpenMP, and data level parallelism for SSE. What is intriguing about CUDA is that it takes a bottom-up point of view, identifying the atomic unit of parallelism and embedding that in a hierarchical structure, e.g., thread::warp::block::grid.
The enduring contribution of GPU computing to HPC may well be a programming model that peels us away from the current top-down, multi-level, task-parallel approach, popularizing instead a more scalable bottom-up, data-parallel alternative. It’s not right for every problem but for those that map well to it, such as finite difference stencils and molecular dynamics among many others, it provides a cleaner, more natural language for expressing parallelism. It should be recognized that the simpler, cleaner expression for these applications in code is a main driver for the relatively-rapid adoption by commercial and academic practitioners. Further, there is no intrinsic reason scaling must stop at the grid or device level. One can easily imagine a generalization of CUDA on future architectures that abstracts one or more levels above the grid to accomplish an implementation across multiple devices, effectively aggregating global memory into one contiguous span; a sort of GPU/NUMA approach. If this can be done, then GPU computing will have made a great leap toward solving a key problem in parallel computing by reducing the programming model from three levels to one level for a simpler more elegant solution.
About the Author
Dr. Natoli is the president and founder of Stone Ridge Technology. He is a computational physicist with 20 years experience in the field of high performance computing. He worked as a technical director at High Performance Technologies (HPTi) and before that for 10 years as a senior physicist at ExxonMobil Corporation, at their Corporate Research Lab in Clinton, New Jersey, and in the Upstream Research Center in Houston, Texas. Dr. Natoli holds Bachelor’s and Master’s degrees from MIT, a PhD in Physics from the University of Illinois Urbana-Champaign, and a Masters in Technology Management from the University of Pennsylvania and the Wharton School. Stone Ridge Technology is a professional services firm focused on authoring, profiling, optimizing and porting high performance technical codes to multicore CPUs, GPUs, and FPGAs.
Dr. Natoli can be reached at firstname.lastname@example.org.