State of SYCL – ECP BOF Showcases Progress and Performance

By John Russell

February 28, 2023

Enabling interoperability across U.S. exascale supercomputers is one of the chief goals for the U.S. Exascale Computing Project (ECP), which has broadly overseen development of the early software ecosystem needed to support the new class of supercomputers. Earlier this month, ECP held its annual community BOF days, a virtual event spanning a wide range of topics – including a session on SYCL, which has been gaining momentum as a programming framework for heterogeneous computing.

The rise of heterogeneous computing (typically CPU/GPU pairings) is the big challenge that many hope SYCL can help address, and the first round of U.S. exascale supercomputers is something of a poster child for that challenge.

Frontier, the first U.S. exascale system, is now up and running at the Oak Ridge Leadership Computing Facility (OLCF), and El Capitan, the third scheduled system (hosted by Lawrence Livermore National Laboratory), both use CPU/GPU pairings from AMD. Aurora, which is now being stood up at Argonne Leadership Computing Facility (ALCF), will use an Intel CPU/GPU. Add to this the practical reality that virtually all of the big pre-exascale systems, such as Perlmutter at the National Energy Research Scientific Computing Center (NERSC), use Nvidia GPUs, and you get a sense of the challenge to maintain interoperability.

As the number of accelerators – mostly GPUs – has grown, so has the number of programming models created to support them. Currently, CUDA (Nvidia), ROCm with HIP (AMD), and most recently SYCL/oneAPI (Intel) are the big players. Others will likely emerge as more vendors bring GPUs and other accelerators to market. These tools will need to able to talk to each other in productive ways to wring maximum value from exascale systems and other major supercomputers.

At the core of ECP’s recent SYCL BOF were lightning talks from ALCF, OCLF, and NERSC on their efforts to support interoperability, as well as a somewhat longer update from Intel – a SYCL/oneAPI driver – on SYCL’s evolving feature set. Unfortunately, the sessions were not recorded, but BOF leader Abhishek Bagusetty, a computer scientist at ACLF, said he would compile the slides and make them available.

By way of quick background, SYCL is a royalty-free, cross-platform abstraction layer that builds on the underlying concepts, portability and efficiency from OpenCL. The idea is to enable code for heterogeneous processors to be written in a “single-source” style using standard C++. Intel introduced DPC++ (data parallel C++) based on SYCL and used it in the oneAPI framework that will be used with Aurora. SYCL itself is a Khronos Group project begun in 2014, and Codeplay is the company that’s been most involved in SYCL development. The latest version is SYCL 2020.

Here’s the Khronos description:

“SYCL defines abstractions to enable heterogeneous device programming, an important capability in the modern world which has not yet been solved directly in ISO C++. SYCL has evolved with the intent of influencing C++ direction around heterogeneous compute by creating productized proof points that can be considered in the context of C++ evolution.

“A major goal of SYCL is to enable different heterogeneous devices to be used in a single application — for example simultaneous use of CPUs, GPUs, and FPGAs. Although optimized kernel code may differ across the architectures (since SYCL does not guarantee automatic and perfect performance portability across architectures), it provides a consistent language, APIs, and ecosystem in which to write and tune code for accelerator architectures. An application can coherently define variants of code optimized for architectures of interest, and can find and dispatch code to those architectures.

“SYCL uses generic programming with templates and generic lambda functions to enable higher-level application software to be cleanly coded with optimized acceleration of kernel code across an extensive range of acceleration backend APIs, such as OpenCL and CUDA.”

Presented here are a few brief comments from ALCF, OLCF, NERSC talks and a bit more from Intel SYCL update.

ALCF – Leveraging oneAPI and CUDA APIs

Not surprisingly, there’s a significant effort readying Aurora for use with oneAPI/SYCL. Bagusetty provided the ALCF briefing.

“At ALCF, our natural experience is with Aurora, so we use pre-developed or started developing applications for Aurora. One additional machine that we have is a pre-exascale machine, Polaris, which is a similar architecture to NERSC’s Perlmutter. I’ve listed how SYCL can be used on these two machines at ALCF. One important difference is the driver library; as most of you know LevelZero is the API for Aurora. Similarly, the standard CUDA Driver API’s are the runtime library for Polaris, and that too is driven by SYCL,” said Bagusetty.

“We have modules for ease of access on both of these machines, with different cadences on how they get updated. Aurora has all the libraries that we that we use for the applications. On Polaris, since it’s a CUDA-based machine, we do have cu-based libraries – it could be cu-BLAS, cu-SPARSE, all the math libraries. In addition to oneAPI, which provides a single compiler, currently there is bit of testing going on to make oneMKL and oneDPL available. The first one that will be available is oneMKL. OneDPL is still [undergoing] testing, so watch out for that if you have an application that uses SYCL or oneMKL or oneDPL in that in that space. It’s worth giving it a shot on Polaris and scaling up to several hundreds of nodes,” he said.

OLCF – Working on Prototype DP++ for AMD GPUs

The Frontier exascale system at OCLF is an AMD CPU/GPU machine and relies heavily on AMD’s HIP – Heterogeneous Interface for Portability – as a programming model and tool. Balint Joo, a group leader at OLCF, talked about efforts to also become able to use SYCL to gain interoperability with Aurora.

“There has been interest in DPC++ and SYCL at OLCF. The primary focus is from facilities interoperability. We’d like to be interoperable with Aurora, and this is sort of complementary to the HIP-LZ (HIP on Level Zero) effort at Argonne, which is to have compatibility with HIP from Frontier. This is a potential additional onramp for other applications and libraries. It’s just another way to try getting on if you already have (SYCL) in your code. So this is a risk mitigation exercise. [Thus far] we have part-funded and participated in prototype work with Codeplay, and that was a prototype for DPC++ on AMD GPUs. It’s basically an AMD plugin like the CUDA plugin. Some of us have been doing personal tests in their own directory looking at hipSYCL, which has now become openSYCL,” said Joo.

“In terms of what’s installed on the systems currently, Codeplay does nightly builds of their of their prototype work on Spock, which is an MI100 (AMD GPU) machine here. This is what’s called user managed software. So the modules you need to load are ums for user and then ums15, which I guess is Codeplay’s account there. Then you can use odpcpp and should get a client++, which can compile SYCL on the back end to the extent the features are available in that prototype (see slide below). John Holman has made the public DPC++ installation on Crusher (development system for Frontier). I’m not 100% sure what the difference between that and the prototype is, [but] I suspect they’re the same. If you log on to Crusher, you can just do a module of DPC++ and you can get it (see slide). Right now, we’re in discussions with Codeplay to finish the prototype for Frontier. The prototype already has a lot of features, but it’s not fully-featured. So the plan is to make that the case,” said Joo.

NERSC – Attempts Support for Wide Range of Programming Model

The Perlmutter architecture at NERSC has been optimized for AI workloads and, while not an exascale system, it is formidable at 64.6 petaflops and was number seven on the most recent Top500 List. It uses AMD CPUs and Nvidia GPUs. Brandon Cook, acting group lead for the Programming Environments and Models group at NERSC, provided the update which was quite brief.

“We were supporting sort of every programming model we can and compiler possible. One of the key ones is the Intel LLVM and SYCL programming model. We have a new group that’s focused on this space. Throwing some support behind these types of activities is something that NERSC is doing. Last year I mentioned our prototype LLVM programming environment, which we have found to be very successful [based on] the positive feedback. The maturity of the components has really increased in the past year. We’re currently working on integrating that as the default set of modules that will kind of plug right into the standard Cray Programming Environment.”

INTEL – SYCL Performance on Par with CUDA and HIP

John Pennycook, an application engineer from Intel, had the longest presentation of the BOF and presented snapshots of SYCL conformance, performance against Nvidia and AMD tools, and reviewed the evolving feature set. One sore spot in the latter currently is the lack of formal support for complex numbers and greater than 3-dimensional coding structure in SYCL although there are workarounds.

“I want to give a very quick status update on the oneAPI DPC++ compiler, which is the official name of the open source project (SYCL) at Intel. For those of you who may be new to SYCL and Khronos specifications, in general, with a Khronos specification, before an implementer can actually say that they are conformant, or compliant with a particular version of the specification, they have to pass what is known as the conformance test suite, or CTS. SYCL 2020 doesn’t actually have a CTS yet. But Intel is working with contractors and other members of the SYCL Kronos working group to get this CTS ready so that we can start to test all of the different implementations and see if any of them are conforming,” said Pennycook.

“If you look at this graph that I grabbed from GitHub, you can see that over the last two years, we’ve kind of got this steady upward trend in terms of activity in the CTS. I think we’re on track to actually have better coverage in the SYCL 2020 CTS than we did for version 1.2.1. And if you look at the right-hand side of this slide, you can see a list of some of the features we now have tests for. This isn’t an exhaustive list, but stuff I thought would be of interest to this audience. So, things like atomics group algorithms, reductions, subgroups, unified shared memory, we all have tested all of those things now. DPC++ is tracking these tests quite closely. Every time a new test comes out, we want to aim to pass it as quickly as we can. And we aim to pass all of the tests by the end of 2023.”

Pennycook presented comparative performance data on SYCL versus CUDA on Nvidia GPUs (A100) and HIP on AMD GPUs.

“Don’t pay too much attention to which bar is higher or lower. For each of these applications, the point that I really want to make is that they’re comparable. So SYCL is getting a comparable performance to CUDA across all of these applications and this demonstrates that SYCL is a high-performance language for NVIDIA GPUs. The reason that there is some difference here is because we’re not just comparing languages, but it’s actually also a comparison of compilers. CUDA is being compiled with NVCC (Nvidia CUDA Compiler) and SYCL is being compiled with the open source Clang pts/x end. In some cases, the compilers make different choices and perform different optimizations,” said Pennycook.

Pennycook presented similar graph comparing SYCL versus HIP on an MI100 (AMD GPU).

“Again, some of the [performance bars] are higher, some of them are lower, but things are comparable. I think that’s a good place for us to be. It used to be until fairly recently that if you wanted to use DPC++ to compile for NVIDIA GPUs or AMD GPUs, you actually had to build everything from source yourself. That’s no longer true. So as of the 2023 release of the Intel DPC++ compiler, you can now actually install these plugins alongside the Intel toolkit and have a single compiler that supports Intel CPUs and GPUs, NVIDIA GPUs and AMD GPUs. These plugins are freely available,” he said.

Pennycook also discussed some of the new features being incorporated into SYCL and a few of those slides are shown below.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

Empowering High-Performance Computing for Artificial Intelligence

April 19, 2024

Artificial intelligence (AI) presents some of the most challenging demands in information technology, especially concerning computing power and data movement. As a result of these challenges, high-performance computing Read more…

Kathy Yelick on Post-Exascale Challenges

April 18, 2024

With the exascale era underway, the HPC community is already turning its attention to zettascale computing, the next of the 1,000-fold performance leaps that have occurred about once a decade. With this in mind, the ISC Read more…

2024 Winter Classic: Texas Two Step

April 18, 2024

Texas Tech University. Their middle name is ‘tech’, so it’s no surprise that they’ve been fielding not one, but two teams in the last three Winter Classic cluster competitions. Their teams, dubbed Matador and Red Read more…

2024 Winter Classic: The Return of Team Fayetteville

April 18, 2024

Hailing from Fayetteville, NC, Fayetteville State University stayed under the radar in their first Winter Classic competition in 2022. Solid students for sure, but not a lot of HPC experience. All good. They didn’t Read more…

Software Specialist Horizon Quantum to Build First-of-a-Kind Hardware Testbed

April 18, 2024

Horizon Quantum Computing, a Singapore-based quantum software start-up, announced today it would build its own testbed of quantum computers, starting with use of Rigetti’s Novera 9-qubit QPU. The approach by a quantum Read more…

2024 Winter Classic: Meet Team Morehouse

April 17, 2024

Morehouse College? The university is well-known for their long list of illustrious graduates, the rigor of their academics, and the quality of the instruction. They were one of the first schools to sign up for the Winter Read more…

Kathy Yelick on Post-Exascale Challenges

April 18, 2024

With the exascale era underway, the HPC community is already turning its attention to zettascale computing, the next of the 1,000-fold performance leaps that ha Read more…

Software Specialist Horizon Quantum to Build First-of-a-Kind Hardware Testbed

April 18, 2024

Horizon Quantum Computing, a Singapore-based quantum software start-up, announced today it would build its own testbed of quantum computers, starting with use o Read more…

MLCommons Launches New AI Safety Benchmark Initiative

April 16, 2024

MLCommons, organizer of the popular MLPerf benchmarking exercises (training and inference), is starting a new effort to benchmark AI Safety, one of the most pre Read more…

Exciting Updates From Stanford HAI’s Seventh Annual AI Index Report

April 15, 2024

As the AI revolution marches on, it is vital to continually reassess how this technology is reshaping our world. To that end, researchers at Stanford’s Instit Read more…

Intel’s Vision Advantage: Chips Are Available Off-the-Shelf

April 11, 2024

The chip market is facing a crisis: chip development is now concentrated in the hands of the few. A confluence of events this week reminded us how few chips Read more…

The VC View: Quantonation’s Deep Dive into Funding Quantum Start-ups

April 11, 2024

Yesterday Quantonation — which promotes itself as a one-of-a-kind venture capital (VC) company specializing in quantum science and deep physics  — announce Read more…

Nvidia’s GTC Is the New Intel IDF

April 9, 2024

After many years, Nvidia's GPU Technology Conference (GTC) was back in person and has become the conference for those who care about semiconductors and AI. I Read more…

Google Announces Homegrown ARM-based CPUs 

April 9, 2024

Google sprang a surprise at the ongoing Google Next Cloud conference by introducing its own ARM-based CPU called Axion, which will be offered to customers in it Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Intel’s Server and PC Chip Development Will Blur After 2025

January 15, 2024

Intel's dealing with much more than chip rivals breathing down its neck; it is simultaneously integrating a bevy of new technologies such as chiplets, artificia Read more…

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Baidu Exits Quantum, Closely Following Alibaba’s Earlier Move

January 5, 2024

Reuters reported this week that Baidu, China’s giant e-commerce and services provider, is exiting the quantum computing development arena. Reuters reported � Read more…

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Shutterstock 1179408610

Google Addresses the Mysteries of Its Hypercomputer 

December 28, 2023

When Google launched its Hypercomputer earlier this month (December 2023), the first reaction was, "Say what?" It turns out that the Hypercomputer is Google's t Read more…

AMD MI3000A

How AMD May Get Across the CUDA Moat

October 5, 2023

When discussing GenAI, the term "GPU" almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word "GPU" is assumed to mean "Nvidia" products. (As an aside, the popular Nvidia hardware used in GenAI are not technically... Read more…

Leading Solution Providers

Contributors

Shutterstock 1606064203

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

January 25, 2024

In under two minutes, Meta's CEO, Mark Zuckerberg, laid out the company's AI plans, which included a plan to build an artificial intelligence system with the eq Read more…

China Is All In on a RISC-V Future

January 8, 2024

The state of RISC-V in China was discussed in a recent report released by the Jamestown Foundation, a Washington, D.C.-based think tank. The report, entitled "E Read more…

Shutterstock 1285747942

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, codenamed Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

DoD Takes a Long View of Quantum Computing

December 19, 2023

Given the large sums tied to expensive weapon systems – think $100-million-plus per F-35 fighter – it’s easy to forget the U.S. Department of Defense is a Read more…

Eyes on the Quantum Prize – D-Wave Says its Time is Now

January 30, 2024

Early quantum computing pioneer D-Wave again asserted – that at least for D-Wave – the commercial quantum era has begun. Speaking at its first in-person Ana Read more…

GenAI Having Major Impact on Data Culture, Survey Says

February 21, 2024

While 2023 was the year of GenAI, the adoption rates for GenAI did not match expectations. Most organizations are continuing to invest in GenAI but are yet to Read more…

The GenAI Datacenter Squeeze Is Here

February 1, 2024

The immediate effect of the GenAI GPU Squeeze was to reduce availability, either direct purchase or cloud access, increase cost, and push demand through the roof. A secondary issue has been developing over the last several years. Even though your organization secured several racks... Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire