New Code Paradigms Cooking for CORAL

By Nicole Hemsoth

December 2, 2014

News of the massive CORAL procurement for the next generation pre-exascale systems stole headlines in November, but now that the excitement is simmering, many are beginning to ask critical questions about the architecture—and what it will mean for programmers trying to take advantage of the massive amount of memory, compute, and options to blend the best of CPU and GPU worlds.

For background on the Summit and Sierra machines, which are the first two supercomputers that will form the CORAL triad (the other will be at Argonne but details haven’t been released yet), there are details here. In essence, there are several aspects to the CORAL machines that set them apart. The first two we know about will utilize IBM’s Power9 architecture, which we first heard of during the announcement, along with NVIDIA GPUs. However, they won’t coordinate like a traditional CPU and coprocessor. Rather, each will have their own memory which can be addressable from both CPU and GPU, which will be handled across NVLINK—a special high-bandwidth bus that will let both processing elements read one another’s mind.

Each of the nodes will pack in 2 Power9 processors, multiple GPUs using NVIDIA’s next-next generation Volta architecture, and offer up an expected 40 teraflops of peak performance. With 412 GB of HBM and DDR4 memory, a dual-rail Mellanox EDR-IB full non-blocking fat tree network, and GPFS-based elastic storage, the system is likely to set the stage for future exascale-class systems. But with all this power, finding a programming framework that can take advantage of the novel approach to the GPU and CPU as separate but equal processing elements is still a work in progress.

As it stands now, the Power9 processor with its large well of standard memory will shine on serial tasks while on the other half of the node, the GPU can tackle large parallel tasks since it’s better at managing more threads and can now outsource the serial sections of HPC code to the CPU. While we’re quickly seeing the end of the off-chip coprocessor era (or at least getting closer), the CORAL architecture is a full realization of how both processors can balance the needs of an application by taking on the tasks they’re best at. The issue, however, is that codes will need to evolve significantly to exploit these new possibilities.

“When we look at this from an application perspective, we’re starting to feel through how the GPU, instead of being an adjunct to the CPU, is actually a very high performance processor in its own right,” said James Sexton from IBM T.J. Watson Research Center during a lecture on new programming approaches at SC14. “The GPU has its own memory, it has the capability to do compute, and we’re starting to think that rather than having a CPU with an accelerator, we actually have two different equal peer processors. We have a CPU with CPU memory and the same for GPU—instead of thinking one as a master and another as an accelerator, we’re seeing there are other options from the application development standpoint.”

The traditional approach to thinking about programming for GPU accelerated systems is well known. All the data is set forth in main memory, but data must be copied over to the GPU for computing, then copied back to the CPU, which is where all the data structures are created. Since GPUs can only act on the data that’s in their own memory, the limitations in performance and even still, in programming, are clear. The difference with this architecture is that since both can see the memories of one another, the hop between is lifted—the memory is coherent, which means objects can be dropped in either compute bucket and the selected processor can work on the data right where it is without copying.

To be fair, while it sounds simple, this is still a complex NUMA architecture, so there are different memory pools, different bandwidths, and latencies to each of the pools but this is a performance issue rather than a functional one, says Sexton.

You can still use these new systems in the same way theoretically. The difference is, there is no copying since the data can still be accessed by the GPU without moving it, even if it’s not at the highest possible performance. But of course, why not just begin with the data on the GPU and think of the CPU as the accelerator? In certain applications you might end up copying data from the GPU or work on data in place in the GPU, in other words. Now, isn’t that fun?

“As we go forward,” said Sexton, there is a recognition that there are “natural affinities for which structures should live on the GPU and which should be on the CPU. One can think about placing the data in the natural location for a given algorithm. And now that you don’t have to move the data, at a certain phase the CPU may be active, then the GPU and back and forth—they can be a chain and handoff of compute control without data movement.”

There are other variations on this theme which Sexton’s team found were noteworthy in specific application context. For instance, miniFE, LSMS, AMG, MCG and SNAP were used as an example to highlight how there could be multiple MPI tasks per GPU. Further, multiple GPUs can work together as part of a single MPI task and others, the sensible way was to put the data in the GPU and come to the CPU only when there were small memory footprints in an application.

“The point is, suddenly there are a variety of ways one can lay out an application, rather than complicating the problem of how to program this system it simplifies the problem. You think naturally about your code in making sure you select the right variation. So, for instance, if you have a small memory footprint in an application for a particular input set you might want to place the data on the GPU. If you make the dataset bigger, you might want to place it on the CPU. When you think about these as peer processors, you can quickly and easily shift between variations, we think you’re going to have an easy and portable program.”

As the compilers and programming models evolve there will be increasing capability for locating data or migrating it automatically, Sexton notes. But for now, his recommendations for developers who are just starting to think about programming for systems like this are similar to those put forth by centers looking to the next generation of exascale systems. First, to continue develop threaded coding, further, to make sure when objects are created they are robust and configurable since later the choice could be made to run either CPU or GPU later. Finally, he says programmers should be thinking that very large degrees of parallelism will be possible, so high thread counts are even more critical.

Sexton and his team, as well as their partners in the OpenPower Foundation stack, NVIDIA, are still working out the silicon changes that will be required for both the CPU and GPU, as well as how this will take advantage of stacked memory with its boost of 4x the bandwidth and 3x capacity. But at the end of the day, if they don’t aid the performance of actual applications, it will be a lot of grounded potential.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, code-named Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from its predecessors, including the red-hot H100 and A100 GPUs. Read more…

Nvidia Showcases Quantum Cloud, Expanding Quantum Portfolio at GTC24

March 18, 2024

Nvidia’s barrage of quantum news at GTC24 this week includes new products, signature collaborations, and a new Nvidia Quantum Cloud for quantum developers. While Nvidia may not spring to mind when thinking of the quant Read more…

2024 Winter Classic: Meet the HPE Mentors

March 18, 2024

The latest installment of the 2024 Winter Classic Studio Update Show features our interview with the HPE mentor team who introduced our student teams to the joys (and potential sorrows) of the HPL (LINPACK) and accompany Read more…

Houston We Have a Solution: Addressing the HPC and Tech Talent Gap

March 15, 2024

Generations of Houstonian teachers, counselors, and parents have either worked in the aerospace industry or know people who do - the prospect of entering the field was normalized for boys in 1969 when the Apollo 11 missi Read more…

Apple Buys DarwinAI Deepening its AI Push According to Report

March 14, 2024

Apple has purchased Canadian AI startup DarwinAI according to a Bloomberg report today. Apparently the deal was done early this year but still hasn’t been publicly announced according to the report. Apple is preparing Read more…

Survey of Rapid Training Methods for Neural Networks

March 14, 2024

Artificial neural networks are computing systems with interconnected layers that process and learn from data. During training, neural networks utilize optimization algorithms to iteratively refine their parameters until Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, code-named Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

Nvidia Showcases Quantum Cloud, Expanding Quantum Portfolio at GTC24

March 18, 2024

Nvidia’s barrage of quantum news at GTC24 this week includes new products, signature collaborations, and a new Nvidia Quantum Cloud for quantum developers. Wh Read more…

Houston We Have a Solution: Addressing the HPC and Tech Talent Gap

March 15, 2024

Generations of Houstonian teachers, counselors, and parents have either worked in the aerospace industry or know people who do - the prospect of entering the fi Read more…

Survey of Rapid Training Methods for Neural Networks

March 14, 2024

Artificial neural networks are computing systems with interconnected layers that process and learn from data. During training, neural networks utilize optimizat Read more…

PASQAL Issues Roadmap to 10,000 Qubits in 2026 and Fault Tolerance in 2028

March 13, 2024

Paris-based PASQAL, a developer of neutral atom-based quantum computers, yesterday issued a roadmap for delivering systems with 10,000 physical qubits in 2026 a Read more…

India Is an AI Powerhouse Waiting to Happen, but Challenges Await

March 12, 2024

The Indian government is pushing full speed ahead to make the country an attractive technology base, especially in the hot fields of AI and semiconductors, but Read more…

Charles Tahan Exits National Quantum Coordination Office

March 12, 2024

(March 1, 2024) My first official day at the White House Office of Science and Technology Policy (OSTP) was June 15, 2020, during the depths of the COVID-19 loc Read more…

AI Bias In the Spotlight On International Women’s Day

March 11, 2024

What impact does AI bias have on women and girls? What can people do to increase female participation in the AI field? These are some of the questions the tech Read more…

Alibaba Shuts Down its Quantum Computing Effort

November 30, 2023

In case you missed it, China’s e-commerce giant Alibaba has shut down its quantum computing research effort. It’s not entirely clear what drove the change. Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Analyst Panel Says Take the Quantum Computing Plunge Now…

November 27, 2023

Should you start exploring quantum computing? Yes, said a panel of analysts convened at Tabor Communications HPC and AI on Wall Street conference earlier this y Read more…

Shutterstock 1285747942

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

DoD Takes a Long View of Quantum Computing

December 19, 2023

Given the large sums tied to expensive weapon systems – think $100-million-plus per F-35 fighter – it’s easy to forget the U.S. Department of Defense is a Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Intel’s Server and PC Chip Development Will Blur After 2025

January 15, 2024

Intel's dealing with much more than chip rivals breathing down its neck; it is simultaneously integrating a bevy of new technologies such as chiplets, artificia Read more…

Baidu Exits Quantum, Closely Following Alibaba’s Earlier Move

January 5, 2024

Reuters reported this week that Baidu, China’s giant e-commerce and services provider, is exiting the quantum computing development arena. Reuters reported � Read more…

Leading Solution Providers

Contributors

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Training of 1-Trillion Parameter Scientific AI Begins

November 13, 2023

A US national lab has started training a massive AI brain that could ultimately become the must-have computing resource for scientific researchers. Argonne N Read more…

Shutterstock 1179408610

Google Addresses the Mysteries of Its Hypercomputer 

December 28, 2023

When Google launched its Hypercomputer earlier this month (December 2023), the first reaction was, "Say what?" It turns out that the Hypercomputer is Google's t Read more…

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

AMD MI3000A

How AMD May Get Across the CUDA Moat

October 5, 2023

When discussing GenAI, the term "GPU" almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word "GPU" is assumed to mean "Nvidia" products. (As an aside, the popular Nvidia hardware used in GenAI are not technically... Read more…

Shutterstock 1606064203

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

January 25, 2024

In under two minutes, Meta's CEO, Mark Zuckerberg, laid out the company's AI plans, which included a plan to build an artificial intelligence system with the eq Read more…

Google Introduces ‘Hypercomputer’ to Its AI Infrastructure

December 11, 2023

Google ran out of monikers to describe its new AI system released on December 7. Supercomputer perhaps wasn't an apt description, so it settled on Hypercomputer Read more…

China Is All In on a RISC-V Future

January 8, 2024

The state of RISC-V in China was discussed in a recent report released by the Jamestown Foundation, a Washington, D.C.-based think tank. The report, entitled "E Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire