Argonne Researchers Enhance MadGraph Code for Next-Gen Supercomputing Challenges

January 29, 2024

Jan. 29, 2024 — As part of an ongoing Aurora Early Science Program (ESP) project to prepare the ATLAS experiment at CERN’s Large Hadron Collider (LHC) for the exascale era of computing—“Simulating and Learning in the ATLAS Detector at the Exascale,” led by Walter Hopkins— researchers are porting and optimizing the codes that will enable the experiment to run its simulation and data analysis tasks on an array of next-generation architectures.

Among them is the soon-to-launch Intel-HPE Aurora system housed at the Argonne Leadership Computer Facility (ALCF), a U.S. Department of Energy (DOE) Office of Science user facility at Argonne National Laboratory.

One such code is MadGraph, a particle interaction simulator for LHC experiments that performs particle-physics calculations to generate expected LHC-detector particle interactions. As a framework, MadGraph aims at a complete Standard Model and Beyond Standard Model phenomenology, including such elements as cross-section computations as well as event manipulation and analysis.

Nathan Nichols, a postdoctoral appointee at the ALCF, joined the ATLAS ESP project to study performance portability frameworks in the context of the MadGraph code to enable experiments like ATLAS to use modern supercomputers for their large computational needs.

Defining Performance Portability for MadGraph

“To me, to be portable and performant means that the application needs to be capable of running efficiently and effectively on as many devices as possible—irrespective of vendor, whether it’s NVIDIA or Intel or AMD,” Nichols said. “If an application runs really well on Intel GPUs but has problems running on NVIDIA GPUs, it’s not really performance portable. As a developer you also want the code to be easily maintainable, so you don’t want to have a patchwork of different chunks of code as your code base—you want to write one code and have that code be performant on all different devices.”

Because MadGraph, which was originally developed over a decade ago, had a legacy code base that enables users to simulate most physical processes of interest at the LHC, realizing portability was less straightforward than would be expected for a traditional standalone or scientific application.

“The application contains Python scripts that write Fortran code to generate a simulation of whatever physics experiment is running at CERN at the time; it’s very generic and writes code on the fly—potentially a new set of source code for each physics process, depending on the experiment,” Nichols said. “It could be challenging to port the application performantly, or to write a performant GPU kernel because the kernel could be any sort of particle configuration, and we need to be able to generate and run on devices effectively and efficiently whatever physics process that physicists might want to explore. That was somewhat daunting to tackle.”

Determining Which Portability Framework to Adopt

The team had already settled on testing three portability frameworks, SYCLKokkos, and alpaka, when Nichols joined the project. He would take the lead on developing the SYCL version.

Five representative physics processes were chosen as standard cases to test code performance. When it became time to work in earnest on the SYCL port of MadGraph, Nichols’s first step was to examine the native CUDA code to identify areas in which performance gains could be made.

“After everything was written and we had working versions of the software for different portability frameworks, we needed to narrow down our options to determine which made the most sense to support in the future,” he explained.

Nichols led the Argonne effort to measure and compare the performance of the frameworks across multiple architectures.

Using GitLab, Nichols set up continuation integration software pipelines in order to carry out regular performance tests on the various devices hosted at Argonne’s Joint Laboratory for System Evaluation (JLSE). The systems on which these nightly performance tests ran included Nvidia GPUs (V100 and A100), Intel GPUs (early versions of those in Aurora), Intel CPUs (Skylake), and AMD GPUs (MI50, MI100, MI250).

The testing setup afforded by JLSE systems made judging frameworks for performance straightforward in terms of the five different physics processes under evaluation: Nichols would begin with computationally simple physics process before progressively ramping up the level of computational difficulty. Throughout the testing Nichols would make slight alterations to the software stack to evaluate how performance was affected.

He conducted performance scaling to see which portability framework delivered the best performance across the different GPUs. Outperforming even the native CUDA and CPU codes, it was eventually determined that the SYCL port was the most performant on all tested systems, with Kokkos a close second. Given these metrics, the ATLAS team chose to move forward with SYCL and discontinue development with the other portability frameworks.

With the SYCL port selected, Nichols updated the Python-based code generator in the MadGraph framework to optionally output the SYCL matrix element calculations. Originally, this only generated the Fortran-based matrix element code for the user-specified physics process.

While Nichols worked on the SYCL port, the CERN team had developed a code-mixing bridge to facilitate the Fortran code’s ability to call the functions in the multiple C++ portability libraries being tested.

“We needed the SYCL library and the Fortran code to talk to each other,” Nichols said. “But we couldn’t get the SYCL library and Fortran code to link properly, on account of the different programming languages in play, which was the first of many bugs we discovered. Luckily my team had been working on the problem closely with contacts at Intel, and now all of those bugs are being taken care of and we should be able to smooth them out.”

Dealing with High Register Pressure

Some of those bugs stem from issues that the MadGraph code has with high register pressure. A register file is an allocation of space in which different objects can be stored, with critical impacts on GPU performance. Register pressure, loosely speaking, is the term for when a register file is approaching its maximum capacity.

“Once the registry is full, the objects inside have to be transferred to a different memory location, which takes time, and then the emptied registry begins filling with new items,” Nichols said. “Now an application I’m running has to call items not just from the ordinary register file, but from throughout the system’s global memory.”

Apart from correcting dips in performance that resulted from transfers triggered by register pressure, the MadGraph team has had to debug the code in order to take advantage of available performance profiling tools, which themselves require register-file reservation.

“Since we’re having these register spills, the performance tools can’t reserve that space themselves, so we can’t get really in-depth analysis of what our code is doing. This in turn means that we have to rely on educated guesses informed by prior programming experience,” Nichols said.

Toward Deployment on Aurora

The SYCL port of MadGraph has displayed superior performance on Intel GPUs compared to other versions of the application to date.

After running the SYCL port on the entirety of the Aurora test and development system Sunspot—which has 128 available nodes—Nichols and the ATLAS team began tuning I/O and communication for MadGraph-generated files to ensure efficient functionality when deployed at scale on Aurora.

Capable of delegating an individual process separately to each GPU with linear scaling, MadGraph is an embarrassingly parallel application.

“Because MadGraph is embarrassingly parallel, scaling is not a worry as far as Aurora deployment goes,” Nichols pointed out.

On the other hand, as the full Aurora system comes online, the ATLAS team must measure whether the current workflow developed for the code is effective on the exascale computer’s vast array of compute nodes.

Nichols also developed a custom math function library that allows for swapping primitive data types with SYCL vector types without breaking the MadGraph code.

“The SYCL vector types allow the code to take advantage of vector instruction sets available on CPUs giving a performance boost on those devices,” he said. “Using SYCL vector types in this ad hoc way is in contrast to the recommended approach, which is to rely on auto-vectorization and use function widening. However, for a large legacy codebase like MadGraph, conventional approaches are often insufficient to gain the desired performance.” He added that the library still requires further revision because—while use with the SYCL vector type delivers strong performance on CPUs—on GPUs its performance slows considerably, and its compilation time increases substantially.

Nichols intends to improve the library through systematic testing of the code, and by consulting with other developers to glean a diverse array of perspectives.

Such collaboration has played a large role in bringing MadGraph to exascale and in tuning Aurora for future users; regular workshops with Intel staff at the Center of Excellence, in particular, helped originate ideas for improving the performance of the MadGraph code and pointing out improvements in the compilers. ESP projects in general are an essential vehicle for resolving issues inherent to the rollout of complex, large-scale HPC systems.

The SYCL portability framework allows developers to launch parallel kernels (that is, methods for initiating running code on GPUs) using different methods, which led to experimentation and comparison with different kernel launch methods involving basic data parallel kernels, work group data parallel kernels, and hierarchical data parallel kernels; hierarchical data parallel kernels were found to perform best.

“The relevant piece of the SYCL specification has been under revision recently, so I’ve been interested in helping with that process, to which end I developed some different test applications,” Nichols mentioned. “Just testing the various options available under the SYCL specification has been the best way to determine how to improve performance.”

The Argonne ATLAS team explored offloading different parts of the full software pipeline to GPU, but Nichols discovered that almost all of the performance bottlenecks could be attributed to matrix element calculations. By being GPU-offloaded, these calculations were accelerated to such a degree that elements of the code localized to CPU operation experienced minor negative impacts. (The effort necessary to offload the impacted elements outweighed any performance boosts that could be gained by doing so.)

Nichols is currently working toward the MadGraph release, which entails completing documentation, testing and ensuring that all physics processes are functioning correctly, and honing the SYCL port to ensure code maintenance is as simple and straightforward as possible for users. These efforts are intended to culminate in extending the ATLAS project to Aurora via eventual deployment of the SYCL port.


Source: Nils Heinonen, ALCF

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

U.S. Quantum Director Charles Tahan Calls for NQIA Reauthorization Now

February 29, 2024

(February 29, 2024) Origin stories make the best superhero movies. I am no superhero, but I still remember what my undergraduate thesis advisor said when I told him that I wanted to design quantum computers in graduate s Read more…

pNFS Provides Performance and New Possibilities

February 29, 2024

At the cusp of a new era in technology, enterprise IT stands on the brink of the most profound transformation since the Internet's inception. This seismic shift is propelled by the advent of artificial intelligence (AI), Read more…

Celebrating 35 Years of HPCwire by Recognizing 35 HPC Trailblazers

February 29, 2024

In 1988, a new IEEE conference debuted in Orlando, Florida. The planners were expecting 200-300 attendees because the conference was focused on an obscure topic called supercomputing, but when it was announced that S Read more…

Forrester’s State of AI Report Suggests a Wave of Disruption Is Coming

February 28, 2024

The explosive growth of generative artificial intelligence (GenAI) heralds opportunity and disruption across industries. It is transforming how we interact with technology itself. During this early phase of GenAI technol Read more…

Q-Roundup: Google on Optimizing Circuits; St. Jude Uses GenAI; Hunting Majorana; Global Movers

February 27, 2024

Last week, a Google-led team reported developing a new tool - AlphaTensor Quantum - based on deep reinforcement learning (DRL) to better optimize circuits. A week earlier a team working with St. Jude Children’s Hospita Read more…

AWS Solution Channel

Shutterstock 2283618597

Deep-dive into Ansys Fluent performance on Ansys Gateway powered by AWS

Today, we’re going to deep-dive into the performance and associated cost of running computational fluid dynamics (CFD) simulations on AWS using Ansys Fluent through the Ansys Gateway powered by AWS (or just “Ansys Gateway” for the rest of this post). Read more…

Argonne Aurora Walk About Video

February 27, 2024

In November 2023, Aurora was ranked #2 on the Top 500 list. That ranking was with half of Aurora running the HPL benchmark. It seems after much delay, 2024 will finally be Aurora's time in the spotlight. For those cur Read more…

Royalty-free stock illustration ID: 1988202119

pNFS Provides Performance and New Possibilities

February 29, 2024

At the cusp of a new era in technology, enterprise IT stands on the brink of the most profound transformation since the Internet's inception. This seismic shift Read more…

Celebrating 35 Years of HPCwire by Recognizing 35 HPC Trailblazers

February 29, 2024

In 1988, a new IEEE conference debuted in Orlando, Florida. The planners were expecting 200-300 attendees because the conference was focused on an obscure t Read more…

Forrester’s State of AI Report Suggests a Wave of Disruption Is Coming

February 28, 2024

The explosive growth of generative artificial intelligence (GenAI) heralds opportunity and disruption across industries. It is transforming how we interact with Read more…

Q-Roundup: Google on Optimizing Circuits; St. Jude Uses GenAI; Hunting Majorana; Global Movers

February 27, 2024

Last week, a Google-led team reported developing a new tool - AlphaTensor Quantum - based on deep reinforcement learning (DRL) to better optimize circuits. A we Read more…

South African Cluster Competition Team Enjoys Big Texas HPC Adventure

February 26, 2024

Texas A&M University's High-Performance Research Computing (HPRC) hosted an elite South African delegation on February 8 - undergraduate computer science (a Read more…

A Big Memory Nvidia GH200 Next to Your Desk: Closer Than You Think

February 22, 2024

Students of the microprocessor may recall that the original 8086/8088 processors did not have floating point units. The motherboard often had an extra socket fo Read more…

Apple Rolls out Post Quantum Security for iOS

February 21, 2024

Think implementing so-called Post Quantum Cryptography (PQC) isn't important because quantum computers able to decrypt current RSA codes don’t yet exist? Not Read more…

QED-C Issues New Quantum Benchmarking Paper

February 20, 2024

The Quantum Economic Development Consortium last week released a new paper on benchmarking – Quantum Algorithm Exploration using Application-Oriented Performa Read more…

Training of 1-Trillion Parameter Scientific AI Begins

November 13, 2023

A US national lab has started training a massive AI brain that could ultimately become the must-have computing resource for scientific researchers. Argonne N Read more…

Alibaba Shuts Down its Quantum Computing Effort

November 30, 2023

In case you missed it, China’s e-commerce giant Alibaba has shut down its quantum computing research effort. It’s not entirely clear what drove the change. Read more…

Nvidia Wins SC23, But Gets Socked by Microsoft’s AI Chip

November 16, 2023

Nvidia was invisible with a very small booth and limited floor presence, but thanks to its sheer AI dominance, it was a winner at the Supercomputing 2023. Nv Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Analyst Panel Says Take the Quantum Computing Plunge Now…

November 27, 2023

Should you start exploring quantum computing? Yes, said a panel of analysts convened at Tabor Communications HPC and AI on Wall Street conference earlier this y Read more…

Royalty-free stock illustration ID: 1675260034

RISC-V Summit: Ghosts of x86 and ARM Linger

November 12, 2023

Editor note: See SC23 RISC-V events at the end of the article At this year's RISC-V Summit, the unofficial motto was "drain the swamp," that is, x86 and Read more…

China Deploys Massive RISC-V Server in Commercial Cloud

November 8, 2023

If the U.S. government intends to curb China's adoption of emerging RISC-V architecture to develop homegrown chips, it may be getting late. Last month, China Read more…

DoD Takes a Long View of Quantum Computing

December 19, 2023

Given the large sums tied to expensive weapon systems – think $100-million-plus per F-35 fighter – it’s easy to forget the U.S. Department of Defense is a Read more…

Leading Solution Providers

Contributors

Shutterstock 1285747942

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

Intel’s Server and PC Chip Development Will Blur After 2025

January 15, 2024

Intel's dealing with much more than chip rivals breathing down its neck; it is simultaneously integrating a bevy of new technologies such as chiplets, artificia Read more…

Chinese Company Developing 64-core RISC-V Chip with Tech from U.S.

November 13, 2023

Chinese chip maker SophGo is developing a RISC-V chip based on designs from the U.S. company SiFive, which highlights challenges the U.S. government may face in Read more…

Baidu Exits Quantum, Closely Following Alibaba’s Earlier Move

January 5, 2024

Reuters reported this week that Baidu, China’s giant e-commerce and services provider, is exiting the quantum computing development arena. Reuters reported � Read more…

Royalty-free stock illustration ID: 1182444949

Forget Zettascale, Trouble is Brewing in Scaling Exascale Supercomputers

November 14, 2023

In 2021, Intel famously declared its goal to get to zettascale supercomputing by 2027, or scaling today's Exascale computers by 1,000 times. Moving forward t Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Shutterstock 1179408610

Google Addresses the Mysteries of Its Hypercomputer 

December 28, 2023

When Google launched its Hypercomputer earlier this month (December 2023), the first reaction was, "Say what?" It turns out that the Hypercomputer is Google's t Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire