The National Energy Research Scientific Computing Center (NERSC) is on track to get its next supercomputer system, Cori, by mid-2016. While that’s more than a year away, it’s not too soon to start preparing for the new 30+ petaflops Cray machine, which will feature Intel’s next-generation Knights Landing architecture. So says Richard Gerber, Senior Science Advisor to NERSC Director Sudip Dosanjh and the NERSC User Services Group Lead.
Although Intel, Cray, NERSC and other parties are working to make this a smooth transition, the move to a manycore architecture means that codes will need to be written and optimized to exploit all that parallelism. Gerber’s article addresses this paradigm change and how to make the most of it.
“Getting your codes to run well (or at all) on NERSC’s first ‘many-core’ system it is going to take more than a simple recompile,” says Gerber.
The new system will sport over 9,300 nodes, equipped with next-generation Knights Landing architecture housed within a Cray XC environment. Each of these chips is capable of delivering more than three teraflops of double precision performance, which altogether is enabling a ten-fold application performance over Hopper (aka NERSC-6), for an estimated peak system performance in in the neighborhood of 30 petaflops.
“It’s no surprise that NERSC is getting a system like Cori; the HPC community has known for years what was coming,” Gerber writes. “Driven by the limits of physics and technology, as well as the cost of power and cooling, future HPC systems are going to get most of their processing power from energy-efficient many-core processors like GPUs and Intel Xeon Phis. These chips contain 10s to 100s of relatively slow processing cores, meaning that performance gains are only going to be achieved by codes that can simultaneously use these cores in parallel effectively.”
When it comes to manycore systems, some codes work better than others. Issues occur when codes do not offer sufficient fine-grained parallelism to keep all the cores busy or there is too much data going from the host processor to the accelerator. The benefit of accelerated computing has to outweigh the penalty of moving the data. Getting codes to run well requires a lot of effort.
Manycore systems like the GPU-based Titan at Oak Ridge have been in play for a while now, but programming for Cori is a little different, explains Gerber, owing to the Intel Xeon Phi “Knights Landing” parts. Unlike GPU-based systems and earlier Phi systems, Cori’s Knights Landing nodes will run in a “self-hosted” mode. This is a different setup than the host-processor model. Everything runs in the node, including the operating system. Getting data to a coprocessor is no longer an issue, but data locality is, as Gerber explains:
“Data locality” is something you’re going to be hearing a lot about. That’s partly because each KNL processor will have up to 16 GB of “on-package” or “High Bandwidth Memory” (HBM), which has extremely high bandwidth and is potentially very good for performance. However, if all your data structures don’t fit in the HBM, you will have to use some of each node’s traditional DRAM, and there will be a price to pay to bring data from there to the compute units. So you’ll want to do as much computing as possible using data that resides in in the HBM.
There’s also the matter of keeping 60+ cores busy. So what to do? Gerber recommends starting with MPI-implemented high-level coarse-grained parallelism then using OpenMP to thread (parallelize) compute intensive loops in the code. Vectorization comes next, and it’s particularly important for Phi-based architectures, where performance and efficiency depend on effective use of the vector units.
The advice gets more specific from there, so if this applies to you, read more here. You may also want to check out NERSC’s first “hack-a-thon” taking place February 25th at its Oakland Scientific Facility.
As an additional handy reference, Intel has compiled all publicly-available Knights landing disclosures on one webpage.