In our second video feature from the HPC User Forum panel, “The Who-What-When of Getting Applications Ready to Run On, And Across, Office of Science Next-Gen Leadership Computing Systems,” we learn more about the goals and challenges associated with getting science applications ready for the coming crop of Department of Energy (DOE) supercomputers, which in addition to being five-to-seven times faster than today’s generation, constitute significant architectural changes.
In part one, we heard from Tjerk Straatsma about the work being done at Oak Ridge Leadership Computing Facility’s Center for Accelerated Application Readiness to prepare for the coming Summit supercomputer. Next up, Katie Antypas, scientific computing and data services department head at the National Energy Research Scientific Computing Center (NERSC), weighs in on the transition to Cori. Named after American biochemist Gerty Cori, the Cray system will be installed at a brand-new, purpose-built facility at Lawrence Berkeley National Laboratory (LBNL) in mid-2016.
Cori: The Big Bet on Knights Landing and Enhanced “Silvermont”
As expected, getting the most from applications running on Cori will require more extensive use of parallelism (domain, threading, and data locality). NERSC’s Katie Antypas took a somewhat deeper look at what that means, including examples of approaches that help and some surprises that don’t.
Cori, of course, is on deck to become the pre-exascale supercomputer for the DOE’s Office of Science. Antypas pointed out that the designation “pre-exascale” suggests that there are technologies used by Cori that will continue into the exascale era. She then highlighted key elements of Cori’s architecture, and reviewed the NERSC Exascale Science Application Program goals, which include preparing the large DOE Office of Science user community for Cori’s manycore architecture.
This slide from Antypas’ talk captures key points:
• The good news – there does seem to be some low hanging fruit: improving vectorization
– Multiple codes say 10%-2X speed-up in one afternoon hackathon
– Examine key loops in code that break compiler’s ability to vectorize
• Understanding if your code is memory bandwidth, CPU or latency bound is key
– Users don’t seem to know this – and it’s complex, changing from kernel to kernel, architecture to architecture
– We’ve found Intel Vtune very helpful in measuring memory bandwidth
• Optimization techniques are completely dependent on specific code and algorithm
– We can provide teams guidance on where to go, but there is no fixed recipe
• Portability is a key concern – let’s debate on the panel
Discussion about the feasibility of achieving portability among next-gen systems and the true cost of portability on performance was lively.