The Khronos Group today formally launched SYCL 2020, the parallel programming framework based on IS0 standard C++ that has been gaining traction in HPC and will, for example, be supported on the forthcoming exascale supercomputer Aurora (ANL) and pre-exascale system Perlmutter (NERSC/LBNL). SYCL 2020 builds on the functionality of SYCL 1.2.1. adding 40-plus new features and introduces a new naming convention based on the year. SYCL 2020 is based on C++17.
Parallel programming and associated tools are hardly new, but the recent rise of heterogeneous computing has spurred development of several parallel programing frameworks targeting not just multicore CPUs but a whole array of diverse accelerators (GPUs, FPGA, etc.) and domains. SYCL was introduced by the Khronos Group (consortium) in 2014 as a high-level programming model for OpenCL which is also based on C++ and targets heterogeneous platforms. OpenCL was introduced in 2009 by Khronos.
Loosely, one can think of SYCL as playing a role similar to OpenMP as an HPC language for C++, but with significant technical differences and distinct strengths, drawbacks, and roots. OpenMP first supported Fortran (1997) and then C/C++ (2000). OpenMP has always had strength in incremental parallelism, specifically in C and Fortran. SYCL’s strength is focused on modern C++ and support parameterization and dynamic composition of algorithms making it suitable to compose directly with C++ template libraries such as TensorFlow.
SYCL is described as:
“[A] royalty-free, cross-platform abstraction layer that builds on the underlying concepts, portability and efficiency of OpenCL that enables code for heterogeneous processors to be written in a “single-source” style using completely standard C++. SYCL enables single source development where C++ template functions can contain both host and device code to construct complex algorithms that use OpenCL acceleration, and then re-use them throughout their source code on different types of data.”
“While originally developed for use with OpenCL and SPIR, it is actually a more general heterogeneous framework able to target other systems. For example, the hipSYCL implementation targets ROCm and CUDA via AMD’s cross-vendor HIP. While the SYCL standard started as the higher-level programming model sub-group of the OpenCL working group, it is a Khronos Group workgroup independent from the OpenCL working group since September 20, 2019.”
Calling SYCL 2020 a significant advance, Michael Wong, Codeplay distinguished engineer, ISO C++ Directions Group and SYCL working group chair, told HPCwire in a briefing, “We’re seeing significant adoption in embedded desktop and HPC markets. We think that it can improve programmability, it will allow smaller code size and have faster performance. It’s based on C++ 17, and is backwards compatible with SYCL 1.121. It should ease porting of standard C++ applications to SYCL and it should enable closer alignment and integration with ISO C++. The other thing, of course, is we now enable multiple different kinds of back-end accelerators.” (Khronos also posted a blog today describing SYCL 2020.)
Here’s snapshot of SYCL’s major new features:
- Unified Shared Memory (USM) enables code with pointers to work naturally without buffers or accessors.
- Parallel reductions add a built-in reduction operation to avoid boilerplate code and enable maximum performance on hardware with built-in reduction operation acceleration.
- Work group and subgroup algorithms enable efficient parallel operations between work items.
- Class template argument deduction (CTAD) and template deduction guides simplify class template instantiation.
- Simplified use of Accessors with a built-in reduction operation, reducing boilerplate code and simplifying use of C++ software design patterns.
- Expanded interoperability for efficient acceleration by diverse backend acceleration APIs.
- Atomic operations are now closer to standard C++ atomics to enhance parallel programming freedom.
The latest version represents three years of effort, said Wong, who emphasized user input was key in determining new features. For example, the simplified use of accessors with a built-in reduction operator was important, he said, “because our users have asked us to get to a point where hello world no longer looks like it has lots of accessors and buffers. It just looks like plain hello world that you would see in C++.”
Apart from feature growth, it is interesting to look at the SYCL ecosystem. There are many pieces to the parallel programming puzzle. Wong has packed a lot into the next slide.
“This slide shows how SYCL fits within the larger framework of C++ programs, libraries, C++ application codes, and machine learning frameworks. [It also] shows how SCYL can work within those fairly complex applications that do complex machine learning,” said Wong. “There are libraries that involve oneMKL [and] oneDNN – these are just names from oneAPI – and also SYCL BLAS libraries and Eigen libraries. Even though these are used in fairly complex C++ template operations, they can be easily ingested by SYCL,” said Wong.
“The differentiation here is that these libraries would not be easily ingested by OpenMP because OpenMP cannot adapt to C++ template operations as easily. These template libraries, they can be absorbed by the SYCL compiler and separately by the CPU host compiler. The host compiler can be any compiler, could be LLVM, could be GCC, could be visual C++.
“Now the SYCL compiler would take a pass over the code and send a device code to an OpenCL back-end, or now with SYCL 2020, we can send it to other kinds of back-ends such as a PTX back-end for CUDA or OpenMP back-end [or] even a Vulkan back-end. Each of these back-ends can selectively distribute [code] to any number of heterogeneous devices,” he said.
“The real beauty here and the idea with using a C++ based language with SYCL is that it will enable things like kernel fusion, which gives you better performance on complex applications and libraries than hand-coding. SYCL is basically ideal for accelerating large C++ based engines and applications for performance portability.”
Perhaps the most prominent new addition to the SYCL ecosystem is Intel’s oneAPI effort which is built on what Intel calls data parallel C++ or DPC++ and being presented by Intel as an open standard for programing a variety of processor types. It will, for example, be the preferred method for porting code to Intel’s Xe GPU line. (See HPCwire coverage, Intel Debuts oneAPI Gold and Provides More Details on GPU Roadmap)
Wong is a oneAPI fan and has blogged about oneAPI. He told HPCwire, “I’ve been dreaming of something like oneAPI for a long time, basically, something that allows you to program to any device kind, any device workloads, across many different companies. Having said that, if there’s too much of an Intel label attached to it to the point where people aren’t aware that [it’s] for anybody, that’s going to be a challenge.”
Intel is hardly alone. In fact, Wong argues the number of SYCL development efforts is one of the clearest measures of SYCL’s growing traction. Xilinx has an effort as does AMD (with the University of Heidelberg) and its natural to wonder if those efforts could be merged if/when AMD’s acquisition of Xilinx is completed. Wong doesn’t think so. There’s a neoSYCL that is quite new targeting NEC and Intel processors. Wong packed a chart showing SYCL implementations. Take a moment to look at SYCL’s growing family tree and then read Wong’s comments.
“The SYCL implementations in development are now ballooning. Actually, we just put one in just in the last couple of weeks. Traditionally, there has always been Codeplay’s ComputeCpp. That’s the company I work for, which generates codes for any number of CPUs. GPUs have gone through OpenCL and SPIR-V that can work for Intel, AMD, Arm, Mali, IMG PowerVR, and the Renesas R-Car [devices]. But we also have one that goes through PTX to generate code for Nvidia’s GPUs,” said Wong.
“Then the big player that came in was Intel with their oneAPI. Inside oneAPI is a compiler called data parallel C++ (DPC++). They are doing that so they can generate code for Intel CPUs, GPUs, FPGAs, and I think in future for AI processors. They are using a Clang (compiler) implementation [and] so is Coldplay.
“We will also have the triSYCL from Xilinx, which is specifically for Xilinx FPGAs, and the hipSYCL, which has the support for AMD GPUs and Nvidia GPUs and they do it through an OpenMP back end. So implementers were already using different back-ends and OpenCL. So it just makes sense for us to legitimize that in the specification (as is done in SYCL 2020). On the far right is something we just added in the last couple of weeks based on announcement from HPC Asia by called neoSYCL for NEC for the vector engine. So it [also] supports x86 Intel CPUs, and the NEC vector engines. We’re very excited about that. That will be open source soon as they have an implementation. We don’t put things on unless there’s a there’s a confirmed implementation,” he noted.
You get the picture. There is a lot of activity around SYCL at the moment. This is noticeably so at the Department of Energy and in advanced systems generally. Wong argues the need for portable performance and multiple vendor support are driving factors. He contends the science project development path is changing in HPC. Again, he’s packed a lot into a single slide (based on a 2020 SYCLCon keynote by Hal Finkel, the newly promoted computer science program manager for the DOE Office of Advanced Scientific Computing Research in October). Check it out before reading Wong’s description.
“As you are well aware OpenMP has had staying power in HPC for a long time. So why use SYCL here,” said Wong. “HPC workloads persist usually for 20 or more years. But the hardware can change every five years with new exascale or petascale projects from DOE and they often could go to different vendors. They also basically need to serve three pillars of science problems. One is simulation [which] needs a high-performance computing language with solvers and parallel runtimes. [Second] one is data science that needs a high productivity language for big data. The third pillar is learning, training and inference and that needs a high productivity language for machine learning and deep learning,” said Wong.
“These have been supported by the top languages. The idea is that there’s OpenMP that’s mostly for C and Fortran. There’s a mix of CUDA, OpenACC, OpenCL, pthreads (POSIX threads). And now for C++ there’s SYCL, and the National Labs’ frameworks like Kokkos and Raja. Now, the development workflow over time has changed. These days a science project usually starts with choosing an algorithm. Unlike before when once you chose the algorithm you were done, now they’re finding the choice of the algorithm needs feedback – knowledge of the system architecture and the tool chain. These tools need to have control of the data layout, data movement, data locality, data affinity, so they can be optimized for portable performance.
“SYCL and these other C++ frameworks enable these parameterizations and dynamically configure the algorithms through the C++ capabilities like C++ templates, and inlining. The second thing that they do, after they choose the algorithm for the target, is they implement and test the algorithm. The third thing is, of course, is optimizing the algorithm and traditionally this was the only step that needed feedback from the architecture and tool knowledge. Today, it’s no longer the case. The choice of the algorithm also now needs that feedback. Languages like SYCL, Kokkos, and RAJA do that especially well because their template static polymorphism allows them to change the algorithm depending on the type of the parameters,” he explained.
“All these steps enable you to reach high performance portable code, but it needs to be using an open standard that everybody can collaborate in. So they are basically required to reach exascale computing for these four major systems of which two are now using SYCL. Aurora and NERSC’s Perlmutter are both adapting to SYCL and there are other ones and it’s not just SYCL. They will also use OpenMP and CUDA and OpenCL and OpenACC and pthreads. The other two systems coming in 2021, of course, Frontier (ORNL) and El Capitan (LLNL) are both AMD systems and SYCL has demonstrated to work on AMD systems as well. The key is this – parameterization and dynamically composed algorithms, along with compiler optimizations using an open standard programming model, is what we think will enable performance portability. That’s why DOE labs are adapting to it. They know that they need this to reach performance portability,” Wong said.
Wong is realistic but hopeful, “No language is perfect. I’ve been a language designer for the better part of 20 years of my professional life, starting with C++ than open MP, and then at SYCL. Every language is trying to serve a community, balancing between performance, portability and productivity.”
SYCL will need to determine how it balances those goals. Wong said the growing similarity of workflows in science to those in industry, largely driven by AI, should help SYCL expand its footprint further. The European Processor Initiative and RISC-V represent opportunities along with the embedded market such as automotive. “I think SYCL can do more in the embedded space, as well as some of the FPGA space. And that depends on having more of those vendors being on board, and that’s coming.”
It will be interesting to watch SYCL’s growth. SYCL 2020 seems an important step forward technically and from a market position. Its release cycle going forward, said Wong, will closely mirror the C++ cycle with a major release every three years. He said work has already started on SYCL 2023 which will be based on the just released C++ 2020. The three-year lag, he said was a necessary element in making sure all the released code was robust. Moreover, he said safety issue are becoming more important, such as in automotive.
As if to hammer home SYCL’s growing strength, Khronos released an unusually large number of testimonials with SYCL 2020. They are included below. Stay tuned.
TESTIMONIALS PROVIDED BY KHRONOS
“Our users will benefit from features in the SYCL 2020 specification. New features, such as support for unified memory (USM) and reductions, are important capabilities for programming high-performance-computing hardware. In addition, support for C++17 will allow our users to write better C++ code, with both language features (such as deduction guides) and library features (such as std::optional). Other new features (such as softening the requirements on kernel functions and sharing data between host and devices) are an important step for implementing backend support for SYCL in the Kokkos and RAJA performance portability ecosystems.” said Nevin Liber, computer scientist, Argonne National Laboratory’s Leadership Computing Facility
“At Cineca, based on our experience, we confirm the value that SYCL is bringing to the development of high-performance computing in a hybrid environment. In fact, through SYCL, it is possible to build a common and portable environment for the development of computing-intensive applications to be executed on HPC architectures configured with floating point accelerators, which allows industries and scientific communities to use the common availability of development tools, libraries of algorithms, accumulated experience,” said Sanzio Bassini, director of supercomputing, Application Innovation Dept, Cineca. “Cineca is already running the distributed Celerity runtime on top of several SYCL implementations on the new Marconi100 cluster, ranked no. 11 in the Top500, providing users with a unified API for both about 4,000 NVIDIA Volta V100 GPUs and IBM Power9 host processors. SYCL 2020 is a big step towards a much leaner API that unlocks all the potential provided by modern C++ standards for accelerated data-parallel kernels, making the development of large-scale scientific software easier and more sustainable, either for industrial oriented domain applications for industries, either for scientific domain oriented applications.”
“Codeplay has been deeply involved in SYCL from its original definition and we are now enabling the standard on a range of systems with our ComputeCpp product. We strongly believe SYCL is the only software standard to link all the high performance processors to a unified programming solution.” said Andrew Richards, founder and CEO, Codeplay Software “Developers will find that SYCL 2020 refines the standard to streamline their development and adds some crucial new enhancements to improve productivity.”
“Imagination recognizes the benefit of SYCL across multiple markets. Our software stacks have been designed to improve SYCL performance, enabling a straightforward path to exploit the teraflops of compute performance in our latest IP,” said Mark Butler, Vice President of Software Engineering, Imagination Technologies. “The ability to quickly port workloads from other proprietary APIs is a huge benefit, easing the transition from development on desktop to deployment on embedded systems. SYCL 2020 is a positive step forward for this API, enabling higher levels of performance, which will benefit developers and platform creators.”
“SYCL 2020 final specification brings significant features to the industry that enable C++ developers to more productively build high-performance heterogeneous applications with unified programming across XPU architectures,” said Jeff McVeigh, Intel vice president, Datacenter XPU Products and Solutions. “Several capabilities pioneered in the open source oneAPI C++/DPC++ compiler, such as unified shared memory, group algorithms, and sub-groups, contributed to this community effort. Open, cross-architecture programming is required for accelerated distributed computing; we look forward to continuing our collaboration to address the needs of the developer ecosystem.”
“With thousands of users and a wide range of applications using NERSC’s resources, we must support a wide range of programming models. In addition to directive-based approaches, we see modern C++ language-based approaches to accelerator programming, such as SYCL, as an important component of our programming environment offering for users of Perlmutter,” said Brandon Cook, application performance specialist at NERSC. “Further, this work supports the productivity of scientific application developers and users through performance portability of applications between Aurora and Perlmutter.”
“NSITEXE supports the SYCL 2020 technology, which is gaining attention in embedded applications,” said Hideki Sugimoto, CTO, NSITEXE, Inc. “SYCL is very important to increase productivity by hiding complexities from users. We are considering adopting this technology in our next generation of IP platforms.”
“For Renesas, SYCL is a key enabler for automotive ADAS/AD software developers that allows them to easily use the highly-efficient, heterogeneous accelerators of the R-Car SoC Series through the open Khronos standard” said Cyril Cordoba, Director of ADAS Segment Marketing Department, Renesas.
“We are excited about the extensive list of features and improvements released with the new SYCL 2020 specification,” said Thomas Fahringer, head of the Distributed and Parallel Systems Group at the University of Innsbruck. “The API becomes terser and more developer friendly, while also introducing new ways for expert users to exercise fine-grained control over state-of-the-art hardware features. The move to a generalized backend model opens up new possibilities to integrate with existing legacy solutions, which is especially important in scientific research environments. As co-developers of the Celerity project, together with the University of Salerno, we are welcoming these changes and look forward to applying them within distributed-memory research and industry applications, for example as part of the recently launched EuroHPC LIGATE project.”
“Xilinx is excited about the progress achieved with SYCL 2020,” said Ralph Wittig, fellow, Xilinx. “This single-source C++ framework unifies host and device code for various kinds of accelerators in the same C++ program. With host-fallback device execution, developers can emulate device code on a CPU, exploring hardware-software co-design for adaptable computing devices. SYCL is now extensible via customizable back-ends, enabling device plug-ins for FPGAs and ACAPs.”