News of the massive CORAL procurement for the next generation pre-exascale systems stole headlines in November, but now that the excitement is simmering, many are beginning to ask critical questions about the architecture—and what it will mean for programmers trying to take advantage of the massive amount of memory, compute, and options to blend the best of CPU and GPU worlds.
For background on the Summit and Sierra machines, which are the first two supercomputers that will form the CORAL triad (the other will be at Argonne but details haven’t been released yet), there are details here. In essence, there are several aspects to the CORAL machines that set them apart. The first two we know about will utilize IBM’s Power9 architecture, which we first heard of during the announcement, along with NVIDIA GPUs. However, they won’t coordinate like a traditional CPU and coprocessor. Rather, each will have their own memory which can be addressable from both CPU and GPU, which will be handled across NVLINK—a special high-bandwidth bus that will let both processing elements read one another’s mind.
Each of the nodes will pack in 2 Power9 processors, multiple GPUs using NVIDIA’s next-next generation Volta architecture, and offer up an expected 40 teraflops of peak performance. With 412 GB of HBM and DDR4 memory, a dual-rail Mellanox EDR-IB full non-blocking fat tree network, and GPFS-based elastic storage, the system is likely to set the stage for future exascale-class systems. But with all this power, finding a programming framework that can take advantage of the novel approach to the GPU and CPU as separate but equal processing elements is still a work in progress.
As it stands now, the Power9 processor with its large well of standard memory will shine on serial tasks while on the other half of the node, the GPU can tackle large parallel tasks since it’s better at managing more threads and can now outsource the serial sections of HPC code to the CPU. While we’re quickly seeing the end of the off-chip coprocessor era (or at least getting closer), the CORAL architecture is a full realization of how both processors can balance the needs of an application by taking on the tasks they’re best at. The issue, however, is that codes will need to evolve significantly to exploit these new possibilities.
“When we look at this from an application perspective, we’re starting to feel through how the GPU, instead of being an adjunct to the CPU, is actually a very high performance processor in its own right,” said James Sexton from IBM T.J. Watson Research Center during a lecture on new programming approaches at SC14. “The GPU has its own memory, it has the capability to do compute, and we’re starting to think that rather than having a CPU with an accelerator, we actually have two different equal peer processors. We have a CPU with CPU memory and the same for GPU—instead of thinking one as a master and another as an accelerator, we’re seeing there are other options from the application development standpoint.”
The traditional approach to thinking about programming for GPU accelerated systems is well known. All the data is set forth in main memory, but data must be copied over to the GPU for computing, then copied back to the CPU, which is where all the data structures are created. Since GPUs can only act on the data that’s in their own memory, the limitations in performance and even still, in programming, are clear. The difference with this architecture is that since both can see the memories of one another, the hop between is lifted—the memory is coherent, which means objects can be dropped in either compute bucket and the selected processor can work on the data right where it is without copying.
To be fair, while it sounds simple, this is still a complex NUMA architecture, so there are different memory pools, different bandwidths, and latencies to each of the pools but this is a performance issue rather than a functional one, says Sexton.
You can still use these new systems in the same way theoretically. The difference is, there is no copying since the data can still be accessed by the GPU without moving it, even if it’s not at the highest possible performance. But of course, why not just begin with the data on the GPU and think of the CPU as the accelerator? In certain applications you might end up copying data from the GPU or work on data in place in the GPU, in other words. Now, isn’t that fun?
“As we go forward,” said Sexton, there is a recognition that there are “natural affinities for which structures should live on the GPU and which should be on the CPU. One can think about placing the data in the natural location for a given algorithm. And now that you don’t have to move the data, at a certain phase the CPU may be active, then the GPU and back and forth—they can be a chain and handoff of compute control without data movement.”
There are other variations on this theme which Sexton’s team found were noteworthy in specific application context. For instance, miniFE, LSMS, AMG, MCG and SNAP were used as an example to highlight how there could be multiple MPI tasks per GPU. Further, multiple GPUs can work together as part of a single MPI task and others, the sensible way was to put the data in the GPU and come to the CPU only when there were small memory footprints in an application.
“The point is, suddenly there are a variety of ways one can lay out an application, rather than complicating the problem of how to program this system it simplifies the problem. You think naturally about your code in making sure you select the right variation. So, for instance, if you have a small memory footprint in an application for a particular input set you might want to place the data on the GPU. If you make the dataset bigger, you might want to place it on the CPU. When you think about these as peer processors, you can quickly and easily shift between variations, we think you’re going to have an easy and portable program.”
As the compilers and programming models evolve there will be increasing capability for locating data or migrating it automatically, Sexton notes. But for now, his recommendations for developers who are just starting to think about programming for systems like this are similar to those put forth by centers looking to the next generation of exascale systems. First, to continue develop threaded coding, further, to make sure when objects are created they are robust and configurable since later the choice could be made to run either CPU or GPU later. Finally, he says programmers should be thinking that very large degrees of parallelism will be possible, so high thread counts are even more critical.
Sexton and his team, as well as their partners in the OpenPower Foundation stack, NVIDIA, are still working out the silicon changes that will be required for both the CPU and GPU, as well as how this will take advantage of stacked memory with its boost of 4x the bandwidth and 3x capacity. But at the end of the day, if they don’t aid the performance of actual applications, it will be a lot of grounded potential.