Commentary — In the second of a series of guest posts on heterogeneous computing, James Reinders, who returned to Intel last year after a short “retirement,” follows up on his piece about how SYCL will contribute to a heterogeneous future for C++. He is joined by Michael Wong, of Codeplay Software Ltd., who is also the current SYCL committee chair. Together, they offer their responses to what might be called the ‘Elephants in the SYCL Room.’
The case for C++ programming, with SYCL bringing in full heterogeneous support, has been well articulated by persons close to the SYCL specification including a recent article “Considering a Heterogeneous Future for C++” and numerous other resources enumerated on sycl.tech. SYCL is a Khronos standard that introduces support for fully heterogeneous data parallelism to C++. While SYCL is not a cure-all, it is a solution to one aspect of a larger problem: How do we enable adequately enable full heterogeneous programming given the emerging explosion in hardware diversity?
In this article, we offer our perspective on key questions about SYCL, based on our perspectives of being having worked in this domain for decades. These important questions are asked by software developers looking to understand if SYCL matters to them. Let’s face it: at some point, every major project has Elephants in the Room. Successful projects address their elephants openly.
Elephant 1: Aren’t GPUs enough? Do other accelerators really matter?
Valid questions exist about which accelerators will stay, and which will be a passing fad. For decades, different accelerators have come and gone while CPUs persist. Today, GPUs are present in the vast majority of computer systems. Writing our applications to leverage GPUs makes a lot of sense given their near ubiquity.
As a result, one of the first elephant questions is whether we really need to generalize, i.e., do we need to be multiarchitecture and multivendor?
The expectation that the future will require “dedicated or semi-dedicated hardware accelerators” as a must-have feature for computing in this decade is expected by experts including researchers led by Prof. Masaaki Kondo in “White Paper on Next-Generation Advanced Computing Infrastructure” and by Hennessy & Patterson in their paper “A New Golden Age for Computer Architecture”.
As long as we are talking about dedicated accelerators, why stop at GPUs? Optimizing for different types of accelerators is a great objective, but we don’t want to write different code for different types of accelerators. We believe that the industry will benefit from a standardized language, that everyone can contribute to, collaborate on, is not locked into a particular vendor, and can evolve organically based on its members and public requirements.
SYCL takes an interesting approach that allows us to use common code when we want and specialize when we want. In this way, SYCL embraces accelerators in general, leaving it to us, the developers, to decide when to write common cross-architecture code, and when we feel it is sufficiently advantageous to specialize code.
Its underlying programming model, SPMD, has shown to be usable across many architectures. SPMD is how most programmers using Nvidia CUDA/OpenCL/SYCL think: writing code from the perspective of operating on one work item and expecting it to run concurrently on most hardware such that multiple work-items fill vector hardware lanes.
SYCL offers a large degree of portability across vendors (e.g., many different sources of GPUs) as well as architecture (e.g., GPUs, FPGAs, ASICs).
Elephant 2: Why not just use Nvidia CUDA?
A vibrant GPU eco-system is emerging thanks to competition from multiple GPU vendors. This is part of a trend for more and more competition for accelerators in general. The installed base of CUDA applications that make use of Nvidia GPUs are poised to be able to adapt over time to an open, multivendor, multiarchitecture software approach created to serve all vendors, not just Nvidia.
While CUDA has earned a strong following given its value proposition and the strength of Nvidia GPUs in the ecosystem, there are increasing concerns regarding the lock-in that use of CUDA creates. Such concerns stem from the proprietary focus highlighted by these factors:
- The definition of CUDA, its implementation and evolution, is managed by Nvidia and evolves specifically to serve Nvidia GPU product designs. Details of new features in CUDA, are generally shielded from public view until NVIDIA has both hardware and software to support them. As discussed more fully below, this control stifles innovations from other vendors.
- The licensing for CUDA tools and libraries, from Nvidia, specifically states they must be used to “develop applications only for use in systems with Nvidia GPUs.” Even “open source” from Nvidia includes licensing language restricting key parts in the same manner.
Nvidia CUDA can claim credit for bringing accelerated computing to the masses using Nvidia GPUs.
With the explosion of competition in the accelerator market, it could appear that CUDA has become a walled garden in an increasingly open and transparent world.
The desire for an open, multivendor, multiarchitecture alternative to CUDA is not going away.
Elephant 3: Why not just use AMD HIP?
AMD Heterogeneous-Computing Interface for Portability (HIP) is a C++ dialect. AMD tools include a “HIPify tool” to help transform CUDA code into HIP. AMD states that “HIP code can run on AMD hardware (through the HCC compiler) or Nvidia hardware (through the NVCC compiler) with no performance loss compared with the original CUDA code.”
HIP is a “follow CUDA” strategy – i.e., where AMD develops an update to HIP as quickly as possible after Nvidia has released an update to its CUDA platform. The arguments in favor of HIP rest on the virtue of reuse of a large CUDA codebase for AMD GPUs. Unfortunately, given the opaqueness of CUDA no one can follow CUDA too closely, timely, or accurately. This offers no opportunity for AMD to expose unique AMD hardware innovation without forcing CUDA developers to change their code with #ifdefs for AMD GPUs.
While AMD has created value with HIP for those that seek AMD GPUs as an alternative to Nvidia GPUs, it is not hard to want more. Imagine having a solution that can keep pace with the feature innovation and performance of CUDA!
We believe that innovation will flourish the most in an open field rather than in the shadows of a walled garden.
[Editor’s note: There is a SYCL implementation called hipSYCL that sits on top of HIP and targets AMD GPUs running ROCm and Nvidia GPUs.]
Elephant 4: Why not just use OpenCL?
OpenCL provides an open multivendor alternative, but at a lower layer of the software stack than SYCL or CUDA offers. SYCL grew out of a desire to bring the benefits of OpenCL’s open, multivendor, multiarchitecture approach by providing a standard C++ interface for heterogenous parallel architectures. SYCL implementations often utilize OpenCL for their implementations, but also have the flexibility to use other backends under the hood as of SYCL2020. SYCL delivers on the promise of OpenCL, in a higher productivity form through its C++ abstractions.
Elephant 5: Can’t we just use C++ ?
Let’s start with the assumption that we want to program heterogeneous machines, we value portability, and we do not want to pay a penalty in performance for portability.
We might answer ”yes” – C++ is enough when you have SYCL support too. After all, C++ was built to be extended by template libraries like SYCL. SYCL adds no new keywords, but it does benefit from SYCL-aware C++ compilers to help with cross-compilation, fat binaries, and remote memories. Those are simply things C++ compilers have not historically made easy.
SYCL also offers a solution today, within standard C++, to address programming for full heterogeneous computing built on top of ISO C++. This includes device enumeration (info), defining work (kernels), submitting and coordinating work across devices (queue), and managing remote memories.
That brings us to “No” – the C++ standard does not define support for heterogeneous systems with disjoint (non-coherent) memories. Some think it will add that one day, and there is effort to go in that direction, but even those involved believe the current direction will take at least 10 years and it is limited by the need for C++ to maintain backwards compatibility with millions of lines of existing code. In fact, one of us (MW) has written papers urging C++ in that direction. The response from WG21 (ISO C++), understandably because of the backward compatibility concerns, has been to start with parallel algorithms and executors, and add forward progress guarantees instead of making radical surgical change to the memory and addressing model. Therefore, if you are programming heterogeneous machines it is not likely to be enough to claim “C++ is enough.” There are some trying to move in that direction and that is the beauty of a competitive industry, we can see what will work out in the best interest of the market and consumers. However, today what will work immediately is “C++ plus SYCL” or “C++ plus CUDA” or “C++ plus OpenCL.”
The purpose of adding SYCL support into our C++ compiler and runtimes, is to add capabilities so C++ supports full heterogeneous support that it does not offer today without SYCL. It is also a way to show how C++ can support heterogeneity in the future, as ISO standards tend to standardize best practices of pre-existing knowledge. We will show one such example below.
Elephant 6: Can SYCL queues can make it into ISO C++?
Queues are how SYCL assigns work to heterogeneous devices, including handing off data within complex memory systems (not necessarily unified and coherent).
It is easy to speculate on whether a queue class belongs in C++ long-term, but such speculation is premature.
Proposals for C++23 have included various constructs to direct execution to specific devices, including “std::execution” in p2300. We know C++23 will continue to rely on a unified global memory address space and will not support disjoint remote memories (complex memory systems).
It is easy to get caught up on syntax. Eventually, if C++ expands to include full heterogeneous support, the concepts embodied in SYCL queue will be needed. Until then, SYCL fills this void. Some important capabilities, such as parallel directives, and message passing, have remained independent standards (OpenMP and MPI). While it is possible C++ will not grow to include full heterogeneous support, we believe C++ will eventually add such support incrementally.
C++ aims to standardize established best practice instead of inventing new and unproven features, therefore SYCL is an important steppingstone as one of the many feeders of ‘established best practice’ into the intentionally slower moving C++ standardization process.
As C++23 settles, and C++26 is considered, the future of C++ for heterogeneous computing will begin to take shape, including syntax but likely a full solution will not emerge for another 5-10 years.
SYCL offers a solution today, within standard C++, to address programming for full heterogeneous computing. This includes device enumeration (info), defining work (kernels), submitting work to devices (queue), and managing remote memories.
Elephant 7: Who is behind SYCL? Is it really Open in the true sense of the word?
We believe that open, international standards and Open Source Software (OSS) projects are good for everyone. When individuals from Intel and Codeplay get involved, we have found that they work hard to help develop and promote such standards and OSS – from WiFi, USB, PCIe to OpenMP, MPI, Fortran, C, C++, OpenCL, and SYCL.
Apple was the original force behind OpenCL, which began as a set of C interfaces at a fairly low level. SYCL originally grew out of efforts within OpenCL to consider higher level interfaces, specifically using C++. After multiple years of very open debates, SYCL was born. Codeplay has been instrumental in SYCL from the very beginning. Intel’s interest in SYCL grew after entering both the FPGA market and announcing the Intel Xe architecture to include GPUs for compute. Intel is proud to be an active member in the SYCL committee, and an active contributor to implementations to support SYCL. SYCL is a community effort, and the homes of both authors of this article (Intel and Codeplay) are enthusiastic participants along with many others.
Elephant 8: I see a herd of elephants – why should I believe in SYCL?
If you have not yet needed to program an application for multiple heterogeneous machines, you may not yet feel the pain to really understand why we are so excited about the prospects for SYCL. Questioning the need is quite logical.
There are many use cases for heterogeneous programming. In our CPPCON 2021 tutorial, we taught programmers from large companies, small companies, and national labs, how to offload high throughput workloads to specialized accelerators.
Based on many experiences like that, we have every reason to be confident that interest in SYCL will continue to grow at a rapid pace because of the need for C++ programming for heterogeneous platforms.
If you believe in the power of diversity of hardware and want to harness the impending explosion in architectural diversity, then SYCL is worth a look. Not only it open, multivendor, multiarchitecture play – but it is the key one for C++ programmers (as detailed in “Considering a Heterogeneous Future for C++”).
Open, Industry Standards are Critical to Enable High-Volume Markets
New technology often starts as proprietary developments, which may be sufficient to enable niche applications and markets. But, as these niche applications grow into technology ecosystems, so does the need for competition and industry standardization to enable widespread adoption. Accelerated computing, for many years only a niche capability, has certainly emerged with the status of “here to stay.” Multiple factors contributed to this, and they are not all going away (power wall, IPC wall, memory wall).
SYCL and related efforts, like oneAPI, were introduced to bring open, industry standards to the historically proprietary universe of accelerated computing.
The biggest question is: how many influencers are eager to promote a move to standards, vs. how many are locked up by proprietary interests?
As the Cambrian explosion of novel computer architectures unfolds, the case for open, multivendor, multiarchitecture standards only grow stronger.
SYCL is an open standard that invites feedback and contributions from everyone to the standard and the open source projects to implement it. The shared goal by everyone involved is to unambiguously ensure paths to high performance for all accelerators in this exciting new golden age for computer architecture.
We invite you to https://sycl.tech/ to learn more. There you will find numerous online tutorials, a link for our SYCL book (free PDF download), and a link to the current SYCL 2020 standards specification.
James Reinders believes the full benefits of the evolution to full heterogeneous computing will be best realized with an open, multivendor, multiarchitecture approach. Reinders rejoined Intel a year ago, specifically because he believes Intel can meaningfully help realize this open future. Reinders is an author (or co-author and/or editor) of ten technical books related to parallel programming; his latest book is about SYCL (it can be freely downloaded here).
Michael Wong is the Distinguished Engineer at Codeplay Software. He is a current Director and VP of ISOCPP Foundation, and a senior member of the C++ Standards Committee with more than 25 years of experience. He is a member of the C++ Directions Group. He chairs the WG21 SG19 Machine Learning and SG14 Games Development/Low Latency/Financials C++ groups and is the co-author of a number C++/OpenMP/Transactional memory features including generalized attributes, user-defined literals, inheriting constructors, weakly ordered memory models, and explicit conversion operators. He has published numerous research papers and is the author of a book on C++11. He has been an invited speaker and keynote at numerous conferences. He is currently the editor of SG1 Concurrency TS and SG5 Transactional Memory TS. He is also the Chair of the SYCL standard and all Programming Languages for Standards Council of Canada. Previously, he was CEO of OpenMP involved with taking OpenMP toward Accelerator support and the Technical Strategy Architect responsible for moving IBM’s compilers to Clang/LLVM after leading IBM’s XL C++ compiler team.
 Elephants in the Room can be defined as important questions that are obvious, but no one mentions them because they make at least some persons uncomfortable.