Programming Models for Scalable Multicore Programming

By Michael D. McCool

September 28, 2007

Multicore devices will quickly evolve in both architecture and core count. This will motivate software developers to decouple the code from the hardware, in order to enable applications to move between different architectures and automatically scale as new processor generations are introduced. An appropriate programming model can enable this decoupling while maintaining — and even enhancing — performance.

Moore’s Law is a statement about transistor density increasing over time. It has become harder and harder to squeeze extra performance out of a single core by using more transistors, and the fact that power consumption increases rapidly and nonlinearly with clock rate blocks further increases in performance by scaling to higher gigahertz ratings. Therefore, all major processor vendors have now switched to an explicitly parallel, multicore processor strategy. By combining multiple small, efficient cores onto a single chip, it is possible to get much higher overall performance and simultaneously improve power efficiency.

Unfortunately, only parallelized applications can exploit this additional performance. In fact, since the individual cores on a processor are often slower than the large single-core processors of the past, non-parallelized applications may in fact be slower on multicore processors. Also, since the number of cores will grow exponentially over time (under the new interpretation of Moore’s Law), any application, in order to grow in performance, must be written to use any number of cores in a scalable fashion.

Autoparallelization tools are unlikely to help. Modern processors already exploit internally much of the implicit parallelism in an application, in the form of low-level instruction level parallelism (ILP). It has been shown that most applications have relatively small amounts of such implicit parallelism, and that this is already nearly fully utilized by modern processors.

However, there are further complications. The memory system is actually the chief bottleneck in many applications. In order to take advantage of the increased computational performance of a processor, the data must be moved onto the chip and off again as efficiently as possible. If the data rate cannot keep pace with the computational performance, than any increase in on-chip computational performance is useless.

In a multicore processor, all cores on a processor must share a finite off-chip bandwidth, making memory access even more of a bottleneck. Also, accessing main memory from the processor, for data that is not in cache, can take hundreds of processor clock cycles to complete. This latency can severely degrade performance since in the worst case the processor must stall while waiting for the memory access to complete.

There is a solution to this: even more parallelism! If the processor has extra, independent work to do while waiting for long-latency operations to complete, then it can run more efficiently. Single-core simultaneous multithreading, also called hyperthreading, is really a mechanism to hide latency. By having multiple concurrent tasks on a single core, it is possible to switch from one to another when one task encounters a long-latency operation, such as a memory access.

Little’s Law states that for efficient execution, the number of concurrent tasks “in flight” at any point in time should be equal to the latency times the parallelism. A modern four-core processor with the ability to issue four floating-point operations (using SSE instructions or some other form of instruction-level-parallelism) at once has a total parallelism of 16, since it can issue 16 operations per clock. Suppose in general that we access main memory for every 8 numerical operations, which is an optimistic value. With a main memory latency of 128 cycles — again optimistic — we need 256 separate, independent tasks in order to fully utilize the processor.

In other words, multicore processing is only exacerbating an already challenging problem. Most software today is grossly inefficient, because it is not written with sufficient parallelism in mind. Breaking up an application into a few tasks is not a long-term solution. First, lots and lots of parallelism is actually needed for efficient execution: much more than the number of cores, actually. Second, with the number of cores increasing exponentially, more and more parallelism will be needed over time.

The solution to this dilemma is data parallelism. In data parallelism, the structure of the data is used to drive the creation of more and more parallel tasks as needed. Since larger problems with more data naturally result in more parallel tasks, a data-parallel approach results in a scalable solution that can automatically take advantage of more and more cores. Data parallel programming models, since they also focus on the data and its movement, also result in predictable memory access patterns and this can also be used to improve the efficiency of memory access.

There is some concern about the general applicability of data parallelism, but it is important to understand that there are a variety of data parallel programming models available. Some of the simplest forms can only be used on very regular problems, but the most flexible models are capable of dealing with a variety of different kinds of irregularities, and are equivalent in expressive power to task parallelism.

The SIMD (Single Instruction, Multiple Data) model is the simplest data parallel model but is also the most limited. In this model, a sequence of operations is applied in parallel to all the elements of a collection of data, such as an array or set. A naive implementation of this model is not very efficient on modern memory-constrained architectures, since it reads and writes memory for each simple operation. In practice, several operations should be combined together into a more arithmetically intense kernel, and the kernel should be unrolled and vectorized to exploit instruction-level parallelism as well as multicore parallelism.

Even so, the basic SIMD model cannot handle irregularities of control or data access. It cannot, for instance, avoid doing unnecessary work, since every element of the data collection will have exactly the same sequence of operations applied. If there are special cases, such as boundary conditions, that require more or less work, the SIMD model cannot take advantage of this fact, and has to do the worst-case computation all the time. This can degrade performance unacceptably.

The SPMD (Single Program, Multiple Data) model is better. In this model, every kernel can also include control flow, which allows a kernel to do more or less work, as the situation demands. This type of model can handle irregularity in workload, but also requires a more sophisticated runtime that can move heavier parts of the workload to more lightly loaded cores.

Collective operations can be added to this model to support irregularity in communication and data access. A general scatter/gather collective operation can handle any irregular memory access pattern, but as a parallel operation. The combination of an SPMD model for kernels with scatter/gather is equivalent in computational power to threading, but is more structured and more naturally leads to the massive parallelism required for performance. It also has certain advantages in safety; for example, it is impossible to express programs with deadlock in this model. Certain refinements to this model are possible, for example it can be extended to nested or recursive parallelism, but even without these refinements it is applicable to a wide range of applications, as has been shown by extensive research over the last twenty years.

Finally, we come to the problem of programming mechanisms. Efficient implementation of a program on a parallel computer requires the coordinated generation of low-level machine language to exploit instruction level parallelism as well as a high-level runtime to support such operations as load balancing. This is because, as pointed out above, processors actually support multiple parallelism mechanism over a range of scales, and exploiting them all simultaneously is crucial for performance.

Many parallel programming languages have been devised, but these systems are not in wide use, and most existing application code is written using languages such as C and C++ that do not natively support parallelism. Tuned parallel libraries can be used, but these are only suitable for the most common stereotypical tasks, and even library writers need something to program with. Frameworks can also be used, but these typically will only address one level of parallelism at once, such as multiple cores, and cannot coordinate code generation with the runtime, since one is unaware of the other.

In developing the RapidMind platform, we have taken a different approach. We start with a programming model, SPMD data-parallelism, that we know is both general and efficient. We provide access to this from existing C++ compilers, but without using the native C++ code generator: instead we provide our own, so that the generated code can be coordinated with the runtime. This allows us to exploit multiple granularities and mechanisms for parallelism simultaneously. The simplicity of the SPMD model means that the programmer can continue to work with familiar concepts, like functions and arrays, but can also directly express parallel algorithms in a natural and efficient way. We have also been able to map this programming model to widely divergent architectures with excellent performance, including GPUs, the Cell BE, and of course multicore CPUs. This portability is useful today but is also crucial in order to future-proof application code against likely changes in processor architectures. Our system takes a high-level abstraction of parallelism and maps it to what is available, and can do so efficiently.

In summary, efficient performance on modern multicore processors requires an aggressive approach to parallelism. There are many performance mechanisms in modern processors, including but not limited to multiple cores, that depend on parallelism. In addition, memory bandwidth and latency can severely degrade performance if not managed, but “spare” parallelism can be used to hide latency. Data-parallel programming models can be used to express the required level of parallelism but also to expose coherent memory access patterns, which can be used to optimize memory bandwidth. A sufficiently general data-parallel computational model, such as the SPMD model, is as powerful as a task-parallel programming model, so no generality is lost in using this more efficient and scalable approach. Finally, the fact that this model is capable of providing portability means that application logic can be decoupled from hardware deployment, providing more choices to the software developer and providing a measure of security that their code will continue to perform well on future massively multicore processors.

—–

About the Author

Michael McCool is an Associate Professor at the University of Waterloo and co-founder of RapidMind. He continues to perform research within the Computer Graphics Lab at the University of Waterloo. Professor McCool has a diverse set of published papers, and his research interests include high-quality real-time rendering, global and local illumination, hardware algorithms, parallel computing, reconfigurable computing, interval and Monte Carlo methods and applications, end-user programming and metaprogramming, image and signal processing, and sampling. Michael has degrees in Computer Engineering and Computer Science.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

MLPerf Inference 4.0 Results Showcase GenAI; Nvidia Still Dominates

March 28, 2024

There were no startling surprises in the latest MLPerf Inference benchmark (4.0) results released yesterday. Two new workloads — Llama 2 and Stable Diffusion XL — were added to the benchmark suite as MLPerf continues Read more…

Q&A with Nvidia’s Chief of DGX Systems on the DGX-GB200 Rack-scale System

March 27, 2024

Pictures of Nvidia's new flagship mega-server, the DGX GB200, on the GTC show floor got favorable reactions on social media for the sheer amount of computing power it brings to artificial intelligence.  Nvidia's DGX Read more…

Call for Participation in Workshop on Potential NSF CISE Quantum Initiative

March 26, 2024

Editor’s Note: Next month there will be a workshop to discuss what a quantum initiative led by NSF’s Computer, Information Science and Engineering (CISE) directorate could entail. The details are posted below in a Ca Read more…

Waseda U. Researchers Reports New Quantum Algorithm for Speeding Optimization

March 25, 2024

Optimization problems cover a wide range of applications and are often cited as good candidates for quantum computing. However, the execution time for constrained combinatorial optimization applications on quantum device Read more…

NVLink: Faster Interconnects and Switches to Help Relieve Data Bottlenecks

March 25, 2024

Nvidia’s new Blackwell architecture may have stolen the show this week at the GPU Technology Conference in San Jose, California. But an emerging bottleneck at the network layer threatens to make bigger and brawnier pro Read more…

Who is David Blackwell?

March 22, 2024

During GTC24, co-founder and president of NVIDIA Jensen Huang unveiled the Blackwell GPU. This GPU itself is heavily optimized for AI work, boasting 192GB of HBM3E memory as well as the the ability to train 1 trillion pa Read more…

MLPerf Inference 4.0 Results Showcase GenAI; Nvidia Still Dominates

March 28, 2024

There were no startling surprises in the latest MLPerf Inference benchmark (4.0) results released yesterday. Two new workloads — Llama 2 and Stable Diffusion Read more…

Q&A with Nvidia’s Chief of DGX Systems on the DGX-GB200 Rack-scale System

March 27, 2024

Pictures of Nvidia's new flagship mega-server, the DGX GB200, on the GTC show floor got favorable reactions on social media for the sheer amount of computing po Read more…

NVLink: Faster Interconnects and Switches to Help Relieve Data Bottlenecks

March 25, 2024

Nvidia’s new Blackwell architecture may have stolen the show this week at the GPU Technology Conference in San Jose, California. But an emerging bottleneck at Read more…

Who is David Blackwell?

March 22, 2024

During GTC24, co-founder and president of NVIDIA Jensen Huang unveiled the Blackwell GPU. This GPU itself is heavily optimized for AI work, boasting 192GB of HB Read more…

Nvidia Looks to Accelerate GenAI Adoption with NIM

March 19, 2024

Today at the GPU Technology Conference, Nvidia launched a new offering aimed at helping customers quickly deploy their generative AI applications in a secure, s Read more…

The Generative AI Future Is Now, Nvidia’s Huang Says

March 19, 2024

We are in the early days of a transformative shift in how business gets done thanks to the advent of generative AI, according to Nvidia CEO and cofounder Jensen Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, codenamed Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

Nvidia Showcases Quantum Cloud, Expanding Quantum Portfolio at GTC24

March 18, 2024

Nvidia’s barrage of quantum news at GTC24 this week includes new products, signature collaborations, and a new Nvidia Quantum Cloud for quantum developers. Wh Read more…

Alibaba Shuts Down its Quantum Computing Effort

November 30, 2023

In case you missed it, China’s e-commerce giant Alibaba has shut down its quantum computing research effort. It’s not entirely clear what drove the change. Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Shutterstock 1285747942

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

DoD Takes a Long View of Quantum Computing

December 19, 2023

Given the large sums tied to expensive weapon systems – think $100-million-plus per F-35 fighter – it’s easy to forget the U.S. Department of Defense is a Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Intel’s Server and PC Chip Development Will Blur After 2025

January 15, 2024

Intel's dealing with much more than chip rivals breathing down its neck; it is simultaneously integrating a bevy of new technologies such as chiplets, artificia Read more…

Baidu Exits Quantum, Closely Following Alibaba’s Earlier Move

January 5, 2024

Reuters reported this week that Baidu, China’s giant e-commerce and services provider, is exiting the quantum computing development arena. Reuters reported � Read more…

Leading Solution Providers

Contributors

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Shutterstock 1179408610

Google Addresses the Mysteries of Its Hypercomputer 

December 28, 2023

When Google launched its Hypercomputer earlier this month (December 2023), the first reaction was, "Say what?" It turns out that the Hypercomputer is Google's t Read more…

AMD MI3000A

How AMD May Get Across the CUDA Moat

October 5, 2023

When discussing GenAI, the term "GPU" almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word "GPU" is assumed to mean "Nvidia" products. (As an aside, the popular Nvidia hardware used in GenAI are not technically... Read more…

Shutterstock 1606064203

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

January 25, 2024

In under two minutes, Meta's CEO, Mark Zuckerberg, laid out the company's AI plans, which included a plan to build an artificial intelligence system with the eq Read more…

Google Introduces ‘Hypercomputer’ to Its AI Infrastructure

December 11, 2023

Google ran out of monikers to describe its new AI system released on December 7. Supercomputer perhaps wasn't an apt description, so it settled on Hypercomputer Read more…

China Is All In on a RISC-V Future

January 8, 2024

The state of RISC-V in China was discussed in a recent report released by the Jamestown Foundation, a Washington, D.C.-based think tank. The report, entitled "E Read more…

Intel Won’t Have a Xeon Max Chip with New Emerald Rapids CPU

December 14, 2023

As expected, Intel officially announced its 5th generation Xeon server chips codenamed Emerald Rapids at an event in New York City, where the focus was really o Read more…

IBM Quantum Summit: Two New QPUs, Upgraded Qiskit, 10-year Roadmap and More

December 4, 2023

IBM kicks off its annual Quantum Summit today and will announce a broad range of advances including its much-anticipated 1121-qubit Condor QPU, a smaller 133-qu Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire