OpenACC Reviews Latest Developments and Future Plans

By Tiffany Trader

November 11, 2015

This week during the lead up to SC15 the OpenACC standards group announced several new developments including the release and ratification of the 2.5 version of the OpenACC API specification, member support for multiple new OpenACC targets, and other progress with the standard.

“The 2.5 specification addresses an essential challenge of profiling code where a few simple directives transform serial instructions and spread the work across thousands of cores,” said Duncan Poole, president of OpenACC-Standard.org and director of platform alliances, accelerated computing, at NVIDIA. “Using tools that support OpenACC, developers have an important lead in creating code that performs well across a variety of multi-core host devices and accelerators, including Titan, and the upcoming DOE Coral systems.”

OpenACC simplifies the programming of accelerated computing systems through the use of directives, which identify compute intensive code to a compiler for acceleration or offload, while preserving a single code base. The key aim of OpenACC is to enable performance portability across a growing number of HPC processor types, including GPUs, manycore coprocessors and multicore CPUs. In addition to being available from different compiler vendors, the standard is supported by an expanding range of debuggers, profilers and other programming tools.

This year brought the addition of ARM CPU and x86 CPU code generation using OpenACC, noted Poole. The PGI compiler can now accelerate applications across multiple x86 cores, while the PathScale compiler now supports acceleration across the 48 cores of a Cavium ThunderX ARM processor. In 2016, the standards body and its partners will be focusing on OpenACC running on POWER with GPUs, ARM with GPUs and Xeon Phi. “Basically we’re fulfilling the mission that we set out,” said Poole, “which was to be able to build portable code and work with all of the relevant architectures that are either here or emerging.”

Poole also talked up the OpenACC Toolkit, which was introduced in July to give scientists and researchers the tools and documentation they need to be successful with OpenACC. “If you want to see some pickup by the academic and research community, you need a free, but robust compiler,” he said. “NVIDIA put together a combination of the PGI compiler coupled with some key profiling tools and other information that would help a new academic get started and created a toolkit that is free for academic and student use.”

This is one of the ways that OpenACC is extending its developer base. Adoption has grown to some 10,000+ OpenACC developers, a conservative estimate according to the team. Training courses are well-received too, with Cal Poly, for example, having added the OpenACC curriculum as a four-credit course. Further, around 1,850 participants have registered for the OpenACC online course and OpenACC Hackathons are over-subscribed.

“We’re seeing a mix of direct site enthusiasm coupled with support coming from labs and compiler developers,” Poole said.

Key codes being ported during the 2015 Hackathons span a variety of disciplines, including computational fluid dynamics (INCOM3D, HiPSTAR and Numeca), cosmology and astrophysics (CASTRO and MAESTRO), quantum chemistry (LSDALTON), computational physics (Nek-CEM) and many more.

At the NCSA Hackathon, a team successfully accelerated an advanced MRI reconstruction model using NVIDIA GPUs. The challenge for the team was to take serial code and get it running on Blue Waters. Naturally, runtime took a dip at first, but speedups ensued as directives were added. In just a few days, the team managed to reduce reconstruction time for a single high-resolution MRI scan from 40 days to a couple of hours.

“Now that we’ve seen how easy it is to program the GPU using OpenACC and the PGI compiler, we’re looking forward to translating more of our projects,” said Brad Sutton, associate professor of bioengineering and technical director of the Biomedical Imaging Center University of Illinois at Urbana-Champaign. The implementation may even be suitable for powering clinical work, an exciting idea for the staff at Blue Waters.

OpenACC 2.5 and Beyond

Michael Wolfe, PGI compiler engineer and OpenACC technical committee chair, characterized the just-ratified 2.5 spec as somewhat of an interim release. The group has been working on both highly significant features in addition to a number of minor features, he said, and the aim of 2.5 was to take all the features that they could complete for this deadline and put those out. Beyond making some clarifications and fixing some spelling errors, the new release adds the following features:

• Asynchronous Data Movement
• Queue Management Routines
• Kernel Construct Size Clauses
• Profile and Trace Interface

The biggest feature in OpenACC 2.5, according to Wolfe, is the final bullet point — profile and trace interface — which will allow third-party tools vendors to tie into OpenACC runtime so they can access and present the performance related information. The functionality was initially part of a prototype implementation in the PGI OpenACC compiler and now with some minor changes based on feedback from that effort, the standards body has added this capability to the specification with the expectation that it will start being supported this year. TU Dresden has been using the PGI interface for profiling work and will be running a demonstration in their booth (#1351) at SC next week.

The following major features, however, are still going to take some extra effort, said Wolfe:

• Deep Copy – Nested Dynamic Data Structures
+ Substantial User Feedback
+ Builds on 2014 Tech Report
• Exposed Memory Hierarchy Management
• Multiple Device Support

According to Wolfe, deep copy is the signature feature that OpenACC is pushing to have ready for the 3.0 release. He said users have been asking for this for years and there are current machines that could really benefit, such as Piz Daint and Titan. “It’s a way to handle nested dynamic data structures,” Wolfe explained. “If you have a large array and you’re going to compute this on a device like a GPU, you’d want to move the array over to the device. But what if the array is actually an array of struct and each element of that array is a struct that has a sub-member that’s another allocatable array and each one of those sub-members is a different size and you want to move this whole deep data structure and maybe each of those sub-members is an array that is another struct that has another allocatable sub-member? And yes, we’ve seen this at least three levels deep in real scientific applications, weather applications in particular.

“So we need a way to manage the data traffic between host and device for basically pointer following on these deep data structures. It’s the number one most-requested features we’ve had from users from the past couple of years and in particular at the Hackathons,” said Wolfe. “What we’ve been working on is a way to express this in a way that provides all the functionality that we need, so we capture the cases that we know about in a way that is general enough to capture those cases but simple enough that it’s relatively easy to express and to use.”

Moving on to the next bullet point, Wolfe said OpenACC is getting ready for the exposed memory hierarchies associated with upcoming systems. “Think Knights Landing, Xeon Phi with near and far memory, or AMD high-bandwidth memory systems, or NVIDIA in the Pascal Volta timeframe where you’ve got a true unified address space but separate physical memories,” Wolfe prodded, “There is a need to manage data movement across what is basically an exposed memory hierarchy in a way that respects performance but is also as natural and portable as possible across all the different various systems.”

Another upcoming feature slated for a future release is improved multiple device support. While OpenACC already supports multiple devices, it’s not as convenient to use as people would like, said Wolfe. “We’ve had success with people doing multiple MPI ranks and multiple OpenMP threads and having each process, or each thread attach to a different device, and that works, but maybe there’s a better way to make it within a single thread use multiple devices,” Wolfe stated. “Compiling for an x86 multicore or an ARM multicore — if we treat that like a device and you have GPUs, now you have heterogeneous devices. Can we spread the work across all the devices counting the host multicore as a device in itself?”

“The challenge there is more about the data than it is about the compute,” he clarified. “It’s relatively easy to spread the computation across resources, it’s more of a challenge to make sure the data’s in the right place so we get the performance we want. That’s coming up in the next release or following releases.”

Comparisons to OpenMP

We also had a chance to discuss the relationship between OpenMP and OpenACC, and the potential for merging in light of the fact that there are members that are common to both organizations. “If you think about the act of parallelizing your code, just figuring out where those directives should go, is the hard part, and there is overlap in terms of location and placement of these directives,” said Poole. “Because of the cross-over in membership, there’s some value in having developers getting real-world experience, giving their feedback and having real production compilers implement a standard. I think all of that is goodness flowing back into OpenMP. In some ways OpenACC may be the best thing that ever happened to OpenMP in terms of giving that real feedback ahead of time.”

“From a technical perspective, there are certainly ways that OpenMP, MPI, OpenACC, even CUDA can interoperate, so these are not insurmountable challenges; it doesn’t have to be only a single way of programming as I would call the Highlander approach,” he added.

We’ll be diving deeper into this topic in a future piece, but as a teaser, here is a slide that was shared during the briefing:

OpenACC and OpenMP 4 - Jan 2015 slide

For those of you headed to Austin for SC15 next week, OpenACC members will be participating in number of presentations, talks and discussions — more info is available here.

 

Browse News From SC15

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

Why HPC Storage Matters More Now Than Ever: Analyst Q&A

September 17, 2021

With soaring data volumes and insatiable computing driving nearly every facet of economic, social and scientific progress, data storage is seizing the spotlight. Hyperion Research analyst and noted storage expert Mark No Read more…

GigaIO Gets $14.7M in Series B Funding to Expand Its Composable Fabric Technology to Customers

September 16, 2021

Just before the COVID-19 pandemic began in March 2020, GigaIO introduced its Universal Composable Fabric technology, which allows enterprises to bring together any HPC and AI resources and integrate them with networking, Read more…

What’s New in HPC Research: Solar Power, ExaWorks, Optane & More

September 16, 2021

In this regular feature, HPCwire highlights newly published research in the high-performance computing community and related domains. From parallel programming to exascale to quantum computing, the details are here. Read more…

Cerebras Brings Its Wafer-Scale Engine AI System to the Cloud

September 16, 2021

Five months ago, when Cerebras Systems debuted its second-generation wafer-scale silicon system (CS-2), co-founder and CEO Andrew Feldman hinted of the company’s coming cloud plans, and now those plans have come to fruition. Today, Cerebras and Cirrascale Cloud Services are launching... Read more…

AI Hardware Summit: Panel on Memory Looks Forward

September 15, 2021

What will system memory look like in five years? Good question. While Monday's panel, Designing AI Super-Chips at the Speed of Memory, at the AI Hardware Summit, tackled several topics, the panelists also took a brief glimpse into the future. Unlike compute, storage and networking, which... Read more…

AWS Solution Channel

Supporting Climate Model Simulations to Accelerate Climate Science

The Amazon Sustainability Data Initiative (ASDI), AWS is donating cloud resources, technical support, and access to scalable infrastructure and fast networking providing high performance computing (HPC) solutions to support simulations of near-term climate using the National Center for Atmospheric Research (NCAR) Community Earth System Model Version 2 (CESM2) and its Whole Atmosphere Community Climate Model (WACCM). Read more…

ECMWF Opens Bologna Datacenter in Preparation for Atos Supercomputer

September 14, 2021

In January 2020, the European Centre for Medium-Range Weather Forecasts (ECMWF) – a juggernaut in the weather forecasting scene – signed a four-year, $89-million contract with European tech firm Atos to quintuple its supercomputing capacity. With the deal approaching the two-year mark, ECMWF... Read more…

GigaIO Gets $14.7M in Series B Funding to Expand Its Composable Fabric Technology to Customers

September 16, 2021

Just before the COVID-19 pandemic began in March 2020, GigaIO introduced its Universal Composable Fabric technology, which allows enterprises to bring together Read more…

Cerebras Brings Its Wafer-Scale Engine AI System to the Cloud

September 16, 2021

Five months ago, when Cerebras Systems debuted its second-generation wafer-scale silicon system (CS-2), co-founder and CEO Andrew Feldman hinted of the company’s coming cloud plans, and now those plans have come to fruition. Today, Cerebras and Cirrascale Cloud Services are launching... Read more…

AI Hardware Summit: Panel on Memory Looks Forward

September 15, 2021

What will system memory look like in five years? Good question. While Monday's panel, Designing AI Super-Chips at the Speed of Memory, at the AI Hardware Summit, tackled several topics, the panelists also took a brief glimpse into the future. Unlike compute, storage and networking, which... Read more…

ECMWF Opens Bologna Datacenter in Preparation for Atos Supercomputer

September 14, 2021

In January 2020, the European Centre for Medium-Range Weather Forecasts (ECMWF) – a juggernaut in the weather forecasting scene – signed a four-year, $89-million contract with European tech firm Atos to quintuple its supercomputing capacity. With the deal approaching the two-year mark, ECMWF... Read more…

Quantum Computer Market Headed to $830M in 2024

September 13, 2021

What is one to make of the quantum computing market? Energized (lots of funding) but still chaotic and advancing in unpredictable ways (e.g. competing qubit tec Read more…

Amazon, NCAR, SilverLining Team for Unprecedented Cloud Climate Simulations

September 10, 2021

Earth’s climate is, to put it mildly, not in a good place. In the wake of a damning report from the Intergovernmental Panel on Climate Change (IPCC), scientis Read more…

After Roadblocks and Renewals, EuroHPC Targets a Bigger, Quantum Future

September 9, 2021

The EuroHPC Joint Undertaking (JU) was formalized in 2018, beginning a new era of European supercomputing that began to bear fruit this year with the launch of several of the first EuroHPC systems. The undertaking, however, has not been without its speed bumps, and the Union faces an uphill... Read more…

How Argonne Is Preparing for Exascale in 2022

September 8, 2021

Additional details came to light on Argonne National Laboratory’s preparation for the 2022 Aurora exascale-class supercomputer, during the HPC User Forum, held virtually this week on account of pandemic. Exascale Computing Project director Doug Kothe reviewed some of the 'early exascale hardware' at Argonne, Oak Ridge and NERSC (Perlmutter), while Ti Leggett, Deputy Project Director & Deputy Director... Read more…

Ahead of ‘Dojo,’ Tesla Reveals Its Massive Precursor Supercomputer

June 22, 2021

In spring 2019, Tesla made cryptic reference to a project called Dojo, a “super-powerful training computer” for video data processing. Then, in summer 2020, Tesla CEO Elon Musk tweeted: “Tesla is developing a [neural network] training computer called Dojo to process truly vast amounts of video data. It’s a beast! … A truly useful exaflop at de facto FP32.” Read more…

Berkeley Lab Debuts Perlmutter, World’s Fastest AI Supercomputer

May 27, 2021

A ribbon-cutting ceremony held virtually at Berkeley Lab's National Energy Research Scientific Computing Center (NERSC) today marked the official launch of Perlmutter – aka NERSC-9 – the GPU-accelerated supercomputer built by HPE in partnership with Nvidia and AMD. Read more…

Esperanto, Silicon in Hand, Champions the Efficiency of Its 1,092-Core RISC-V Chip

August 27, 2021

Esperanto Technologies made waves last December when it announced ET-SoC-1, a new RISC-V-based chip aimed at machine learning that packed nearly 1,100 cores onto a package small enough to fit six times over on a single PCIe card. Now, Esperanto is back, silicon in-hand and taking aim... Read more…

Enter Dojo: Tesla Reveals Design for Modular Supercomputer & D1 Chip

August 20, 2021

Two months ago, Tesla revealed a massive GPU cluster that it said was “roughly the number five supercomputer in the world,” and which was just a precursor to Tesla’s real supercomputing moonshot: the long-rumored, little-detailed Dojo system. “We’ve been scaling our neural network training compute dramatically over the last few years,” said Milan Kovac, Tesla’s director of autopilot engineering. Read more…

CentOS Replacement Rocky Linux Is Now in GA and Under Independent Control

June 21, 2021

The Rocky Enterprise Software Foundation (RESF) is announcing the general availability of Rocky Linux, release 8.4, designed as a drop-in replacement for the soon-to-be discontinued CentOS. The GA release is launching six-and-a-half months after Red Hat deprecated its support for the widely popular, free CentOS server operating system. The Rocky Linux development effort... Read more…

Intel Completes LLVM Adoption; Will End Updates to Classic C/C++ Compilers in Future

August 10, 2021

Intel reported in a blog this week that its adoption of the open source LLVM architecture for Intel’s C/C++ compiler is complete. The transition is part of In Read more…

Google Launches TPU v4 AI Chips

May 20, 2021

Google CEO Sundar Pichai spoke for only one minute and 42 seconds about the company’s latest TPU v4 Tensor Processing Units during his keynote at the Google I Read more…

AMD-Xilinx Deal Gains UK, EU Approvals — China’s Decision Still Pending

July 1, 2021

AMD’s planned acquisition of FPGA maker Xilinx is now in the hands of Chinese regulators after needed antitrust approvals for the $35 billion deal were receiv Read more…

Leading Solution Providers

Contributors

Hot Chips: Here Come the DPUs and IPUs from Arm, Nvidia and Intel

August 25, 2021

The emergence of data processing units (DPU) and infrastructure processing units (IPU) as potentially important pieces in cloud and datacenter architectures was Read more…

10nm, 7nm, 5nm…. Should the Chip Nanometer Metric Be Replaced?

June 1, 2020

The biggest cool factor in server chips is the nanometer. AMD beating Intel to a CPU built on a 7nm process node* – with 5nm and 3nm on the way – has been i Read more…

Julia Update: Adoption Keeps Climbing; Is It a Python Challenger?

January 13, 2021

The rapid adoption of Julia, the open source, high level programing language with roots at MIT, shows no sign of slowing according to data from Julialang.org. I Read more…

HPE Wins $2B GreenLake HPC-as-a-Service Deal with NSA

September 1, 2021

In the heated, oft-contentious, government IT space, HPE has won a massive $2 billion contract to provide HPC and AI services to the United States’ National Security Agency (NSA). Following on the heels of the now-canceled $10 billion JEDI contract (reissued as JWCC) and a $10 billion... Read more…

Quantum Roundup: IBM, Rigetti, Phasecraft, Oxford QC, China, and More

July 13, 2021

IBM yesterday announced a proof for a quantum ML algorithm. A week ago, it unveiled a new topology for its quantum processors. Last Friday, the Technical Univer Read more…

Intel Launches 10nm ‘Ice Lake’ Datacenter CPU with Up to 40 Cores

April 6, 2021

The wait is over. Today Intel officially launched its 10nm datacenter CPU, the third-generation Intel Xeon Scalable processor, codenamed Ice Lake. With up to 40 Read more…

Frontier to Meet 20MW Exascale Power Target Set by DARPA in 2008

July 14, 2021

After more than a decade of planning, the United States’ first exascale computer, Frontier, is set to arrive at Oak Ridge National Laboratory (ORNL) later this year. Crossing this “1,000x” horizon required overcoming four major challenges: power demand, reliability, extreme parallelism and data movement. Read more…

Intel Unveils New Node Names; Sapphire Rapids Is Now an ‘Intel 7’ CPU

July 27, 2021

What's a preeminent chip company to do when its process node technology lags the competition by (roughly) one generation, but outmoded naming conventions make it seem like it's two nodes behind? For Intel, the response was to change how it refers to its nodes with the aim of better reflecting its positioning within the leadership semiconductor manufacturing space. Intel revealed its new node nomenclature, and... Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire