SYCL 2020 Launches with New Name, New Features, and High Ambition

By John Russell

February 9, 2021

The Khronos Group today formally launched SYCL 2020, the parallel programming framework based on IS0 standard C++ that has been gaining traction in HPC and will, for example, be supported on the forthcoming exascale supercomputer Aurora (ANL) and pre-exascale system Perlmutter (NERSC/LBNL). SYCL 2020 builds on the functionality of SYCL 1.2.1. adding 40-plus new features and introduces a new naming convention based on the year. SYCL 2020 is based on C++17.

Parallel programming and associated tools are hardly new, but the recent rise of heterogeneous computing has spurred development of several parallel programing frameworks targeting not just multicore CPUs but a whole array of diverse accelerators (GPUs, FPGA, etc.) and domains. SYCL was introduced by the Khronos Group (consortium) in 2014 as a high-level programming model for OpenCL which is also based on C++ and targets heterogeneous platforms. OpenCL was introduced in 2009 by Khronos.

Loosely, one can think of SYCL as playing a role similar to OpenMP as an HPC language for C++, but with significant technical differences and distinct strengths, drawbacks, and roots. OpenMP first supported Fortran (1997) and then C/C++ (2000). OpenMP has always had strength in incremental parallelism, specifically in C and Fortran. SYCL’s strength is focused on modern C++ and support parameterization and dynamic composition of algorithms making it suitable to compose directly with C++ template libraries such as TensorFlow.

SYCL is described as:

“[A] royalty-free, cross-platform abstraction layer that builds on the underlying concepts, portability and efficiency of OpenCL that enables code for heterogeneous processors to be written in a “single-source” style using completely standard C++. SYCL enables single source development where C++ template functions can contain both host and device code to construct complex algorithms that use OpenCL acceleration, and then re-use them throughout their source code on different types of data.”

“While originally developed for use with OpenCL and SPIR, it is actually a more general heterogeneous framework able to target other systems. For example, the hipSYCL implementation targets ROCm and CUDA via AMD’s cross-vendor HIP. While the SYCL standard started as the higher-level programming model sub-group of the OpenCL working group, it is a Khronos Group workgroup independent from the OpenCL working group since September 20, 2019.”

Calling SYCL 2020 a significant advance, Michael Wong, Codeplay distinguished engineer, ISO C++ Directions Group and SYCL working group chair, told HPCwire in a briefing, “We’re seeing significant adoption in embedded desktop and HPC markets. We think that it can improve programmability, it will allow smaller code size and have faster performance. It’s based on C++ 17, and is backwards compatible with SYCL 1.121. It should ease porting of standard C++ applications to SYCL and it should enable closer alignment and integration with ISO C++. The other thing, of course, is we now enable multiple different kinds of back-end accelerators.” (Khronos also posted a blog today describing SYCL 2020.)

Here’s snapshot of SYCL’s major new features:

  • Unified Shared Memory (USM) enables code with pointers to work naturally without buffers or accessors.
  • Parallel reductions add a built-in reduction operation to avoid boilerplate code and enable maximum performance on hardware with built-in reduction operation acceleration.
  • Work group and subgroup algorithms enable efficient parallel operations between work items.
  • Class template argument deduction (CTAD) and template deduction guides simplify class template instantiation.
  • Simplified use of Accessors with a built-in reduction operation, reducing boilerplate code and simplifying use of C++ software design patterns.
  • Expanded interoperability for efficient acceleration by diverse backend acceleration APIs.
  • Atomic operations are now closer to standard C++ atomics to enhance parallel programming freedom.

The latest version represents three years of effort, said Wong, who emphasized user input was key in determining new features. For example, the simplified use of accessors with a built-in reduction operator was important, he said, “because our users have asked us to get to a point where hello world no longer looks like it has lots of accessors and buffers. It just looks like plain hello world that you would see in C++.”

Apart from feature growth, it is interesting to look at the SYCL ecosystem. There are many pieces to the parallel programming puzzle. Wong has packed a lot into the next slide.

“This slide shows how SYCL fits within the larger framework of C++ programs, libraries, C++ application codes, and machine learning frameworks. [It also] shows how SCYL can work within those fairly complex applications that do complex machine learning,” said Wong. “There are libraries that involve oneMKL [and] oneDNN – these are just names from oneAPI – and also SYCL BLAS libraries and Eigen libraries. Even though these are used in fairly complex C++ template operations, they can be easily ingested by SYCL,” said Wong.

“The differentiation here is that these libraries would not be easily ingested by OpenMP because OpenMP cannot adapt to C++ template operations as easily. These template libraries, they can be absorbed by the SYCL compiler and separately by the CPU host compiler. The host compiler can be any compiler, could be LLVM, could be GCC, could be visual C++.

“Now the SYCL compiler would take a pass over the code and send a device code to an OpenCL back-end, or now with SYCL 2020, we can send it to other kinds of back-ends such as a PTX back-end for CUDA or OpenMP back-end [or] even a Vulkan back-end. Each of these back-ends can selectively distribute [code] to any number of heterogeneous devices,” he said.

“The real beauty here and the idea with using a C++ based language with SYCL is that it will enable things like kernel fusion, which gives you better performance on complex applications and libraries than hand-coding. SYCL is basically ideal for accelerating large C++ based engines and applications for performance portability.”

Perhaps the most prominent new addition to the SYCL ecosystem is Intel’s oneAPI effort which is built on what Intel calls data parallel C++ or DPC++ and being presented by Intel as an open standard for programing a variety of processor types. It will, for example, be the preferred method for porting code to Intel’s Xe GPU line. (See HPCwire coverage, Intel Debuts oneAPI Gold and Provides More Details on GPU Roadmap)

Wong is a oneAPI fan and has blogged about oneAPI. He told HPCwire, “I’ve been dreaming of something like oneAPI for a long time, basically, something that allows you to program to any device kind, any device workloads, across many different companies. Having said that, if there’s too much of an Intel label attached to it to the point where people aren’t aware that [it’s] for anybody, that’s going to be a challenge.”

Intel is hardly alone. In fact, Wong argues the number of SYCL development efforts is one of the clearest measures of SYCL’s growing traction. Xilinx has an effort as does AMD (with the University of Heidelberg) and its natural to wonder if those efforts could be merged if/when AMD’s acquisition of Xilinx is completed. Wong doesn’t think so. There’s a neoSYCL that is quite new targeting NEC and Intel processors. Wong packed a chart showing SYCL implementations. Take a moment to look at SYCL’s growing family tree and then read Wong’s comments.

“The SYCL implementations in development are now ballooning. Actually, we just put one in just in the last couple of weeks. Traditionally, there has always been Codeplay’s ComputeCpp. That’s the company I work for, which generates codes for any number of CPUs. GPUs have gone through OpenCL and SPIR-V that can work for Intel, AMD, Arm, Mali, IMG PowerVR, and the Renesas R-Car [devices]. But we also have one that goes through PTX to generate code for Nvidia’s GPUs,” said Wong.

“Then the big player that came in was Intel with their oneAPI. Inside oneAPI is a compiler called data parallel C++ (DPC++). They are doing that so they can generate code for Intel CPUs, GPUs, FPGAs, and I think in future for AI processors. They are using a Clang (compiler) implementation [and] so is Coldplay.

“We will also have the triSYCL from Xilinx, which is specifically for Xilinx FPGAs, and the hipSYCL, which has the support for AMD GPUs and Nvidia GPUs and they do it through an OpenMP back end. So implementers were already using different back-ends and OpenCL. So it just makes sense for us to legitimize that in the specification (as is done in SYCL 2020). On the far right is something we just added in the last couple of weeks based on announcement from HPC Asia by called neoSYCL for NEC for the vector engine. So it [also] supports x86 Intel CPUs, and the NEC vector engines. We’re very excited about that. That will be open source soon as they have an implementation. We don’t put things on unless there’s a there’s a confirmed implementation,” he noted.

You get the picture. There is a lot of activity around SYCL at the moment. This is noticeably so at the Department of Energy and in advanced systems generally. Wong argues the need for portable performance and multiple vendor support are driving factors. He contends the science project development path is changing in HPC. Again, he’s packed a lot into a single slide (based on a 2020 SYCLCon keynote by Hal Finkel, the newly promoted computer science program manager for the DOE Office of Advanced Scientific Computing Research in October). Check it out before reading Wong’s description.

“As you are well aware OpenMP has had staying power in HPC for a long time. So why use SYCL here,” said Wong. “HPC workloads persist usually for 20 or more years. But the hardware can change every five years with new exascale or petascale projects from DOE and they often could go to different vendors. They also basically need to serve three pillars of science problems. One is simulation [which] needs a high-performance computing language with solvers and parallel runtimes. [Second] one is data science that needs a high productivity language for big data. The third pillar is learning, training and inference and that needs a high productivity language for machine learning and deep learning,” said Wong.

“These have been supported by the top languages. The idea is that there’s OpenMP that’s mostly for C and Fortran. There’s a mix of CUDA, OpenACC, OpenCL, pthreads (POSIX threads). And now for C++ there’s SYCL, and the National Labs’ frameworks like Kokkos and Raja. Now, the development workflow over time has changed. These days a science project usually starts with choosing an algorithm. Unlike before when once you chose the algorithm you were done, now they’re finding the choice of the algorithm needs feedback – knowledge of the system architecture and the tool chain. These tools need to have control of the data layout, data movement, data locality, data affinity, so they can be optimized for portable performance.

“SYCL and these other C++ frameworks enable these parameterizations and dynamically configure the algorithms through the C++ capabilities like C++ templates, and inlining. The second thing that they do, after they choose the algorithm for the target, is they implement and test the algorithm. The third thing is, of course, is optimizing the algorithm and traditionally this was the only step that needed feedback from the architecture and tool knowledge. Today, it’s no longer the case. The choice of the algorithm also now needs that feedback. Languages like SYCL, Kokkos, and RAJA do that especially well because their template static polymorphism allows them to change the algorithm depending on the type of the parameters,” he explained.

“All these steps enable you to reach high performance portable code, but it needs to be using an open standard that everybody can collaborate in. So they are basically required to reach exascale computing for these four major systems of which two are now using SYCL. Aurora and NERSC’s Perlmutter are both adapting to SYCL and there are other ones and it’s not just SYCL. They will also use OpenMP and CUDA and OpenCL and OpenACC and pthreads. The other two systems coming in 2021, of course, Frontier (ORNL) and El Capitan (LLNL) are both AMD systems and SYCL has demonstrated to work on AMD systems as well. The key is this – parameterization and dynamically composed algorithms, along with compiler optimizations using an open standard programming model, is what we think will enable performance portability. That’s why DOE labs are adapting to it. They know that they need this to reach performance portability,” Wong said.

Wong is realistic but hopeful, “No language is perfect. I’ve been a language designer for the better part of 20 years of my professional life, starting with C++ than open MP, and then at SYCL. Every language is trying to serve a community, balancing between performance, portability and productivity.”

SYCL will need to determine how it balances those goals. Wong said the growing similarity of workflows in science to those in industry, largely driven by AI, should help SYCL expand its footprint further. The European Processor Initiative and RISC-V represent opportunities along with the embedded market such as automotive. “I think SYCL can do more in the embedded space, as well as some of the FPGA space. And that depends on having more of those vendors being on board, and that’s coming.”

It will be interesting to watch SYCL’s growth. SYCL 2020 seems an important step forward technically and from a market position. Its release cycle going forward, said Wong, will closely mirror the C++ cycle with a major release every three years. He said work has already started on SYCL 2023 which will be based on the just released C++ 2020. The three-year lag, he said was a necessary element in making sure all the released code was robust. Moreover, he said safety issue are becoming more important, such as in automotive.

As if to hammer home SYCL’s growing strength, Khronos released an unusually large number of testimonials with SYCL 2020. They are included below. Stay tuned.

 

TESTIMONIALS PROVIDED BY KHRONOS

“Our users will benefit from features in the SYCL 2020 specification. New features, such as support for unified memory (USM) and reductions, are important capabilities for programming high-performance-computing hardware. In addition, support for C++17 will allow our users to write better C++ code, with both language features (such as deduction guides) and library features (such as std::optional). Other new features (such as softening the requirements on kernel functions and sharing data between host and devices) are an important step for implementing backend support for SYCL in the Kokkos and RAJA performance portability ecosystems.” said Nevin Liber, computer scientist, Argonne National Laboratory’s Leadership Computing Facility

“At Cineca, based on our experience, we confirm the value that SYCL is bringing to the development of high-performance computing in a hybrid environment. In fact, through SYCL, it is possible to build a common and portable environment for the development of computing-intensive applications to be executed on HPC architectures configured with floating point accelerators, which allows industries and scientific communities to use the common availability of development tools, libraries of algorithms, accumulated experience,” said Sanzio Bassini, director of supercomputing, Application Innovation Dept, Cineca. “Cineca is already running the distributed Celerity runtime on top of several SYCL implementations on the new Marconi100 cluster, ranked no. 11 in the Top500, providing users with a unified API for both about 4,000 NVIDIA Volta V100 GPUs and IBM Power9 host processors. SYCL 2020 is a big step towards a much leaner API that unlocks all the potential provided by modern C++ standards for accelerated data-parallel kernels, making the development of large-scale scientific software easier and more sustainable, either for industrial oriented domain applications for industries, either for scientific domain oriented applications.”

Codeplay has been deeply involved in SYCL from its original definition and we are now enabling the standard on a range of systems with our ComputeCpp product. We strongly believe SYCL is the only software standard to link all the high performance processors to a unified programming solution.” said Andrew Richards, founder and CEO, Codeplay Software “Developers will find that SYCL 2020 refines the standard to streamline their development and adds some crucial new enhancements to improve productivity.”

Imagination recognizes the benefit of SYCL across multiple markets. Our software stacks have been designed to improve SYCL performance, enabling a straightforward path to exploit the teraflops of compute performance in our latest IP,” said Mark Butler, Vice President of Software Engineering, Imagination Technologies. “The ability to quickly port workloads from other proprietary APIs is a huge benefit, easing the transition from development on desktop to deployment on embedded systems. SYCL 2020 is a positive step forward for this API, enabling higher levels of performance, which will benefit developers and platform creators.”

“SYCL 2020 final specification brings significant features to the industry that enable C++ developers to more productively build high-performance heterogeneous applications with unified programming across XPU architectures,” said Jeff McVeigh, Intel vice president, Datacenter XPU Products and Solutions. “Several capabilities pioneered in the open source oneAPI C++/DPC++ compiler, such as unified shared memory, group algorithms, and sub-groups, contributed to this community effort. Open, cross-architecture programming is required for accelerated distributed computing; we look forward to continuing our collaboration to address the needs of the developer ecosystem.”

“With thousands of users and a wide range of applications using NERSC’s resources, we must support a wide range of programming models. In addition to directive-based approaches, we see modern C++ language-based approaches to accelerator programming, such as SYCL, as an important component of our programming environment offering for users of Perlmutter,” said Brandon Cook, application performance specialist at NERSC. “Further, this work supports the productivity of scientific application developers and users through performance portability of applications between Aurora and Perlmutter.”

NSITEXE supports the SYCL 2020 technology, which is gaining attention in embedded applications,” said Hideki Sugimoto, CTO, NSITEXE, Inc. “SYCL is very important to increase productivity by hiding complexities from users. We are considering adopting this technology in our next generation of IP platforms.”

“For Renesas, SYCL is a key enabler for automotive ADAS/AD software developers that allows them to easily use the highly-efficient, heterogeneous accelerators of the R-Car SoC Series through the open Khronos standard” said Cyril Cordoba, Director of ADAS Segment Marketing Department, Renesas.

“We are excited about the extensive list of features and improvements released with the new SYCL 2020 specification,” said Thomas Fahringer, head of the Distributed and Parallel Systems Group at the University of Innsbruck. “The API becomes terser and more developer friendly, while also introducing new ways for expert users to exercise fine-grained control over state-of-the-art hardware features. The move to a generalized backend model opens up new possibilities to integrate with existing legacy solutions, which is especially important in scientific research environments. As co-developers of the Celerity project, together with the University of Salerno, we are welcoming these changes and look forward to applying them within distributed-memory research and industry applications, for example as part of the recently launched EuroHPC LIGATE project.”

Xilinx is excited about the progress achieved with SYCL 2020,” said Ralph Wittig, fellow, Xilinx. “This single-source C++ framework unifies host and device code for various kinds of accelerators in the same C++ program. With host-fallback device execution, developers can emulate device code on a CPU, exploring hardware-software co-design for adaptable computing devices. SYCL is now extensible via customizable back-ends, enabling device plug-ins for FPGAs and ACAPs.”

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

MLPerf Inference 4.0 Results Showcase GenAI; Nvidia Still Dominates

March 28, 2024

There were no startling surprises in the latest MLPerf Inference benchmark (4.0) results released yesterday. Two new workloads — Llama 2 and Stable Diffusion XL — were added to the benchmark suite as MLPerf continues Read more…

Q&A with Nvidia’s Chief of DGX Systems on the DGX-GB200 Rack-scale System

March 27, 2024

Pictures of Nvidia's new flagship mega-server, the DGX GB200, on the GTC show floor got favorable reactions on social media for the sheer amount of computing power it brings to artificial intelligence.  Nvidia's DGX Read more…

Call for Participation in Workshop on Potential NSF CISE Quantum Initiative

March 26, 2024

Editor’s Note: Next month there will be a workshop to discuss what a quantum initiative led by NSF’s Computer, Information Science and Engineering (CISE) directorate could entail. The details are posted below in a Ca Read more…

Waseda U. Researchers Reports New Quantum Algorithm for Speeding Optimization

March 25, 2024

Optimization problems cover a wide range of applications and are often cited as good candidates for quantum computing. However, the execution time for constrained combinatorial optimization applications on quantum device Read more…

NVLink: Faster Interconnects and Switches to Help Relieve Data Bottlenecks

March 25, 2024

Nvidia’s new Blackwell architecture may have stolen the show this week at the GPU Technology Conference in San Jose, California. But an emerging bottleneck at the network layer threatens to make bigger and brawnier pro Read more…

Who is David Blackwell?

March 22, 2024

During GTC24, co-founder and president of NVIDIA Jensen Huang unveiled the Blackwell GPU. This GPU itself is heavily optimized for AI work, boasting 192GB of HBM3E memory as well as the the ability to train 1 trillion pa Read more…

MLPerf Inference 4.0 Results Showcase GenAI; Nvidia Still Dominates

March 28, 2024

There were no startling surprises in the latest MLPerf Inference benchmark (4.0) results released yesterday. Two new workloads — Llama 2 and Stable Diffusion Read more…

Q&A with Nvidia’s Chief of DGX Systems on the DGX-GB200 Rack-scale System

March 27, 2024

Pictures of Nvidia's new flagship mega-server, the DGX GB200, on the GTC show floor got favorable reactions on social media for the sheer amount of computing po Read more…

NVLink: Faster Interconnects and Switches to Help Relieve Data Bottlenecks

March 25, 2024

Nvidia’s new Blackwell architecture may have stolen the show this week at the GPU Technology Conference in San Jose, California. But an emerging bottleneck at Read more…

Who is David Blackwell?

March 22, 2024

During GTC24, co-founder and president of NVIDIA Jensen Huang unveiled the Blackwell GPU. This GPU itself is heavily optimized for AI work, boasting 192GB of HB Read more…

Nvidia Looks to Accelerate GenAI Adoption with NIM

March 19, 2024

Today at the GPU Technology Conference, Nvidia launched a new offering aimed at helping customers quickly deploy their generative AI applications in a secure, s Read more…

The Generative AI Future Is Now, Nvidia’s Huang Says

March 19, 2024

We are in the early days of a transformative shift in how business gets done thanks to the advent of generative AI, according to Nvidia CEO and cofounder Jensen Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, codenamed Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

Nvidia Showcases Quantum Cloud, Expanding Quantum Portfolio at GTC24

March 18, 2024

Nvidia’s barrage of quantum news at GTC24 this week includes new products, signature collaborations, and a new Nvidia Quantum Cloud for quantum developers. Wh Read more…

Alibaba Shuts Down its Quantum Computing Effort

November 30, 2023

In case you missed it, China’s e-commerce giant Alibaba has shut down its quantum computing research effort. It’s not entirely clear what drove the change. Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Shutterstock 1285747942

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

DoD Takes a Long View of Quantum Computing

December 19, 2023

Given the large sums tied to expensive weapon systems – think $100-million-plus per F-35 fighter – it’s easy to forget the U.S. Department of Defense is a Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Intel’s Server and PC Chip Development Will Blur After 2025

January 15, 2024

Intel's dealing with much more than chip rivals breathing down its neck; it is simultaneously integrating a bevy of new technologies such as chiplets, artificia Read more…

Baidu Exits Quantum, Closely Following Alibaba’s Earlier Move

January 5, 2024

Reuters reported this week that Baidu, China’s giant e-commerce and services provider, is exiting the quantum computing development arena. Reuters reported � Read more…

Leading Solution Providers

Contributors

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Shutterstock 1179408610

Google Addresses the Mysteries of Its Hypercomputer 

December 28, 2023

When Google launched its Hypercomputer earlier this month (December 2023), the first reaction was, "Say what?" It turns out that the Hypercomputer is Google's t Read more…

AMD MI3000A

How AMD May Get Across the CUDA Moat

October 5, 2023

When discussing GenAI, the term "GPU" almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word "GPU" is assumed to mean "Nvidia" products. (As an aside, the popular Nvidia hardware used in GenAI are not technically... Read more…

Shutterstock 1606064203

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

January 25, 2024

In under two minutes, Meta's CEO, Mark Zuckerberg, laid out the company's AI plans, which included a plan to build an artificial intelligence system with the eq Read more…

Google Introduces ‘Hypercomputer’ to Its AI Infrastructure

December 11, 2023

Google ran out of monikers to describe its new AI system released on December 7. Supercomputer perhaps wasn't an apt description, so it settled on Hypercomputer Read more…

China Is All In on a RISC-V Future

January 8, 2024

The state of RISC-V in China was discussed in a recent report released by the Jamestown Foundation, a Washington, D.C.-based think tank. The report, entitled "E Read more…

Intel Won’t Have a Xeon Max Chip with New Emerald Rapids CPU

December 14, 2023

As expected, Intel officially announced its 5th generation Xeon server chips codenamed Emerald Rapids at an event in New York City, where the focus was really o Read more…

IBM Quantum Summit: Two New QPUs, Upgraded Qiskit, 10-year Roadmap and More

December 4, 2023

IBM kicks off its annual Quantum Summit today and will announce a broad range of advances including its much-anticipated 1121-qubit Condor QPU, a smaller 133-qu Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire