Virtual ALCF Workshop Provides Guidance on Using AI and Supercomputing Tools for Science

March 19, 2021

March 19, 2021 — In the world of high-performance computing (HPC), the convergence of artificial intelligence (AI) and data science with traditional modeling and simulation is changing the way researchers use supercomputers for scientific discovery.

To help scientists find their footing in this ever-changing landscape, the Argonne Leadership Computing Facility (ALCF), a U.S. Department of Energy (DOE) Office of Science User Facility, has placed a premium on training researchers to use emerging AI and machine learning tools and techniques efficiently on its supercomputing resources.

Through a regular series of workshops and webinars and a special allocation program called the ALCF Data Science Program (ADSP), the facility has been working to build a community of scientists who can employ these methods at a scale that requires the DOE’s leadership-class computing resources.

“Ten years ago, people were primarily only using our supercomputers for numerical simulations. These large simulations output a lot of data, but they didn’t need a lot of data input to do their analysis,” said Taylor Childers, ALCF computer scientist. “With AI, deep learning and machine learning bursting onto the scene in the last five years, we’ve put a lot of effort into onboarding new research teams that are not accustomed to using HPC.”

Childers was one of the organizers of the ALCF’s recent Simulation, Data, and Learning Workshop, an annual event focused on helping participants improve the performance and productivity of data-intensive and machine learning applications. Held virtually in December, the event welcomed more than 100 attendees from across the world.

“Our workshops not only provide guidance on using AI and HPC for science, they also give people an opportunity to engage with us and find out how ALCF resources can potentially benefit their research,” Childers said.

ALCF researchers Misha Salim (top) and Sam Foreman guide workshop attendees through a session on using the DeepHyper tool on the facility’s Theta system. (Image: Argonne National Laboratory)

In addition to training events, the ALCF has been building the hardware and software infrastructure needed to support research enabled by the confluence of simulation, data, and learning methods. On the hardware side, the facility recently deployed ThetaGPU — an upgrade to its Theta supercomputer powered by graphics processing units (GPUs). The augmented system provides enhanced capabilities for data analytics and AI training and learning. The ALCF’s next-generation systems, Polaris and Aurora, will also be hybrid CPU-GPU machines designed to handle AI and data-intensive workloads.

In the software space, the ALCF has been building up its support for machine learning frameworks, including TensorFlow, PyTorch, Horovod, and Python-based modules, libraries, and performance profilers, as well as tools for data-intensive science, such as JupyterHub and MongoDB. In addition, the facility has developed and deployed its own tools, such as the Balsam HPC workflow and edge service, and DeepHyper, a scalable, automated machine learning package for hyperparameter optimization and neural architecture search. Together with the expertise of ALCF staff, these hardware and software tools are helping researchers open new frontiers in scientific computing.

Going virtual 

In 2020, the ALCF’s traditionally in-person workshops transitioned to virtual events due to the COVID-19 pandemic. While the facility has been hosting online training webinars for years, the ALCF Computational Performance Workshop in May was its first large-scale user workshop to go completely virtual. Leveraging tools like Zoom for video conferencing and Slack for instant messaging and collaboration, the ALCF’s virtual workshops have been successful in recreating the collaborative, hands-on nature of its on-site training events. For the Simulation, Data, and Learning Workshop, the ALCF employed a GitHub repository that contained all of the code and instructions for the planned activities.

To make the experience more engaging, the workshop was structured entirely around hands-on tutorials with opportunities to interact throughout. Day one was designed to help participants get distributed deep learning code and data pipelines running on ALCF systems; day two covered using DeepHyper for hyperparameter optimization and how to profile and improve application performance; and day three was dedicated to getting the attendees’ deep learning networks deployed at scale in a simulation.

“The first day covered what you need to scale up a deep learning problem from your laptop to ALCF resources,” Childers said. “The next two days were successive advancements, focused on improving performance and ultimately on how to use the trained model in your research.”

The virtual format has the added benefit of making events more accessible to a larger base of participants. While a majority of attendees were from U.S. institutions, the workshop also welcomed international participants from England, Argentina, and Ghana.

“Virtual events have always meant an open chance to participate in events that I would never have been able to otherwise,” said Dario Dematties, a postdoctoral researcher at the CONICET Mendoza Technological Scientific Center in Argentina.

Attendees take their research to the next level

For Dematties, the workshop presented an opportunity to advance his work involving contrastive learning for visual representations, an approach that utilizes machine learning to identify similarities and differences in images. He was conducting his research with a Director’s Discretionary allocation on Cooley, the ALCF’s visualization and analysis cluster, but was having trouble getting his workflow to run on several nodes. Working with ALCF staff at the workshop, Dematties transitioned his work to the larger ThetaGPU system.

“After attending the first section on distributed deep learning, I decided to approach some experts. We didn’t succeed on the first try, but we did on the last day of the workshop,” Dematties said. “I had never run a PyTorch machine learning workflow on so many distributed GPUs. That was amazing. Thanks to this event, I now have an approved allocation for running my project on ThetaGPU.”

Maruti Mudunuru, an Earth scientist at DOE’s Pacific Northwest National Laboratory, attended the workshop to learn how distributed deep learning and hyperparameter optimization can be used to enhance watershed modeling.

“The focus of my research at the ALCF is to advance watershed modeling at the system-scale using machine learning and develop reduced-order models of river corridor processes,” Mudunuru said. “My goal is to test a proof of concept with my discretionary allocation. If my outcomes are successful, I plan to submit a proposal for the ALCF Data Science Program next year.”

Improving the performance of ongoing projects

In addition to helping new users scale up their research for ALCF systems, the workshop is also useful for existing facility users looking to employ new tools and techniques that can accelerate their research.

Argonne’s Ming Du, for example, attended the workshop to learn how machine learning frameworks can advance his project aimed at developing an accurate and efficient HPC framework for solving the dense 3D reconstruction problem in X-ray microscopy. The research began as part of an ADSP project and is now being pursued under a project awarded by DOE’s Advanced Scientific Computing Research (ASCR) Leadership Computing Challenge.

During the hands-on sessions, Du worked with ALCF staff members to perform a scaling test for distributed training using the PyTorch DistributedDataParallel (DDP) module.

“This experience has pointed us to a clear pathway for improving the scaling performance of our framework in the future,” Du said. “We are planning to employ the more efficient DDP, which is expected to reduce the communication overhead of our application by a huge factor.”

Du also benefitted from the workshop session focused on coupling simulation in C++ with deep learning in Python. He plans to use the knowledge he picked up at the event to train a deep neural network surrogate model that will be used to perform wave propagation simulations more quickly and efficiently than their current method.

The opportunity to use ThetaGPU at the workshop was another perk that will help Du and his colleagues prepare their research for the ALCF’s next-generation systems.

“This was the first time I got a chance to run applications on this machine, and that experience will be extremely helpful for us to optimize our framework for both ThetaGPU and the upcoming Aurora system, which will also be a GPU-accelerated machine,” Du said.

Registration is now open for the 2021 ALCF Computational Performance Workshop, which will be held virtually May 4-6, 2021. For additional training opportunities, visit the ALCF events webpage.

The Argonne Leadership Computing Facility provides supercomputing capabilities to the scientific and engineering community to advance fundamental discovery and understanding in a broad range of disciplines. Supported by the U.S. Department of Energy’s (DOE’s) Office of Science, Advanced Scientific Computing Research (ASCR) program, the ALCF is one of two DOE Leadership Computing Facilities in the nation dedicated to open science.

About the Advanced Photon Source

The U. S. Department of Energy Office of Science’s Advanced Photon Source (APS) at Argonne National Laboratory is one of the world’s most productive X-ray light source facilities. The APS provides high-brightness X-ray beams to a diverse community of researchers in materials science, chemistry, condensed matter physics, the life and environmental sciences, and applied research. These X-rays are ideally suited for explorations of materials and biological structures; elemental distribution; chemical, magnetic, electronic states; and a wide range of technologically important engineering systems from batteries to fuel injector sprays, all of which are the foundations of our nation’s economic, technological, and physical well-being. Each year, more than 5,000 researchers use the APS to produce over 2,000 publications detailing impactful discoveries, and solve more vital biological protein structures than users of any other X-ray light source research facility. APS scientists and engineers innovate technology that is at the heart of advancing accelerator and light-source operations. This includes the insertion devices that produce extreme-brightness X-rays prized by researchers, lenses that focus the X-rays down to a few nanometers, instrumentation that maximizes the way the X-rays interact with samples being studied, and software that gathers and manages the massive quantity of data resulting from discovery research at the APS.

This research used resources of the Advanced Photon Source, a U.S. DOE Office of Science User Facility operated for the DOE Office of Science by Argonne National Laboratory under Contract No. DE-AC02-06CH11357.

About Argonne National Laboratory 

Argonne National Laboratory seeks solutions to pressing national problems in science and technology. The nation’s first national laboratory, Argonne conducts leading-edge basic and applied scientific research in virtually every scientific discipline. Argonne researchers work closely with researchers from hundreds of companies, universities, and federal, state and municipal agencies to help them solve their specific problems, advance America’s scientific leadership and prepare the nation for a better future. With employees from more than 60 nations, Argonne is managed by  UChicago Argonne, LLC for the U.S. Department of Energy’s Office of Science.

About the U.S. Department of Energy’s Office of Science

The U.S. Department of Energy’s Office of Science is the single largest supporter of basic research in the physical sciences in the United States and is working to address some of the most pressing challenges of our time. For more information, visit https://​ener​gy​.gov/​s​c​ience.


Source: Argonne National Laboratory

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

Intel’s Silicon Brain System a Blueprint for Future AI Computing Architectures

April 24, 2024

Intel is releasing a whole arsenal of AI chips and systems hoping something will stick in the market. Its latest entry is a neuromorphic system called Hala Point. The system includes Intel's research chip called Loihi 2, Read more…

Anders Dam Jensen on HPC Sovereignty, Sustainability, and JU Progress

April 23, 2024

The recent 2024 EuroHPC Summit meeting took place in Antwerp, with attendance substantially up since 2023 to 750 participants. HPCwire asked Intersect360 Research senior analyst Steve Conway, who closely tracks HPC, AI, Read more…

AI Saves the Planet this Earth Day

April 22, 2024

Earth Day was originally conceived as a day of reflection. Our planet’s life-sustaining properties are unlike any other celestial body that we’ve observed, and this day of contemplation is meant to provide all of us Read more…

Intel Announces Hala Point – World’s Largest Neuromorphic System for Sustainable AI

April 22, 2024

As we find ourselves on the brink of a technological revolution, the need for efficient and sustainable computing solutions has never been more critical.  A computer system that can mimic the way humans process and s Read more…

Empowering High-Performance Computing for Artificial Intelligence

April 19, 2024

Artificial intelligence (AI) presents some of the most challenging demands in information technology, especially concerning computing power and data movement. As a result of these challenges, high-performance computing Read more…

Kathy Yelick on Post-Exascale Challenges

April 18, 2024

With the exascale era underway, the HPC community is already turning its attention to zettascale computing, the next of the 1,000-fold performance leaps that have occurred about once a decade. With this in mind, the ISC Read more…

Intel’s Silicon Brain System a Blueprint for Future AI Computing Architectures

April 24, 2024

Intel is releasing a whole arsenal of AI chips and systems hoping something will stick in the market. Its latest entry is a neuromorphic system called Hala Poin Read more…

Anders Dam Jensen on HPC Sovereignty, Sustainability, and JU Progress

April 23, 2024

The recent 2024 EuroHPC Summit meeting took place in Antwerp, with attendance substantially up since 2023 to 750 participants. HPCwire asked Intersect360 Resear Read more…

AI Saves the Planet this Earth Day

April 22, 2024

Earth Day was originally conceived as a day of reflection. Our planet’s life-sustaining properties are unlike any other celestial body that we’ve observed, Read more…

Kathy Yelick on Post-Exascale Challenges

April 18, 2024

With the exascale era underway, the HPC community is already turning its attention to zettascale computing, the next of the 1,000-fold performance leaps that ha Read more…

Software Specialist Horizon Quantum to Build First-of-a-Kind Hardware Testbed

April 18, 2024

Horizon Quantum Computing, a Singapore-based quantum software start-up, announced today it would build its own testbed of quantum computers, starting with use o Read more…

MLCommons Launches New AI Safety Benchmark Initiative

April 16, 2024

MLCommons, organizer of the popular MLPerf benchmarking exercises (training and inference), is starting a new effort to benchmark AI Safety, one of the most pre Read more…

Exciting Updates From Stanford HAI’s Seventh Annual AI Index Report

April 15, 2024

As the AI revolution marches on, it is vital to continually reassess how this technology is reshaping our world. To that end, researchers at Stanford’s Instit Read more…

Intel’s Vision Advantage: Chips Are Available Off-the-Shelf

April 11, 2024

The chip market is facing a crisis: chip development is now concentrated in the hands of the few. A confluence of events this week reminded us how few chips Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Intel’s Server and PC Chip Development Will Blur After 2025

January 15, 2024

Intel's dealing with much more than chip rivals breathing down its neck; it is simultaneously integrating a bevy of new technologies such as chiplets, artificia Read more…

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Baidu Exits Quantum, Closely Following Alibaba’s Earlier Move

January 5, 2024

Reuters reported this week that Baidu, China’s giant e-commerce and services provider, is exiting the quantum computing development arena. Reuters reported � Read more…

Shutterstock 1179408610

Google Addresses the Mysteries of Its Hypercomputer 

December 28, 2023

When Google launched its Hypercomputer earlier this month (December 2023), the first reaction was, "Say what?" It turns out that the Hypercomputer is Google's t Read more…

AMD MI3000A

How AMD May Get Across the CUDA Moat

October 5, 2023

When discussing GenAI, the term "GPU" almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word "GPU" is assumed to mean "Nvidia" products. (As an aside, the popular Nvidia hardware used in GenAI are not technically... Read more…

Leading Solution Providers

Contributors

Shutterstock 1606064203

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

January 25, 2024

In under two minutes, Meta's CEO, Mark Zuckerberg, laid out the company's AI plans, which included a plan to build an artificial intelligence system with the eq Read more…

China Is All In on a RISC-V Future

January 8, 2024

The state of RISC-V in China was discussed in a recent report released by the Jamestown Foundation, a Washington, D.C.-based think tank. The report, entitled "E Read more…

Shutterstock 1285747942

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, codenamed Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

Eyes on the Quantum Prize – D-Wave Says its Time is Now

January 30, 2024

Early quantum computing pioneer D-Wave again asserted – that at least for D-Wave – the commercial quantum era has begun. Speaking at its first in-person Ana Read more…

GenAI Having Major Impact on Data Culture, Survey Says

February 21, 2024

While 2023 was the year of GenAI, the adoption rates for GenAI did not match expectations. Most organizations are continuing to invest in GenAI but are yet to Read more…

The GenAI Datacenter Squeeze Is Here

February 1, 2024

The immediate effect of the GenAI GPU Squeeze was to reduce availability, either direct purchase or cloud access, increase cost, and push demand through the roof. A secondary issue has been developing over the last several years. Even though your organization secured several racks... Read more…

Intel’s Xeon General Manager Talks about Server Chips 

January 2, 2024

Intel is talking data-center growth and is done digging graves for its dead enterprise products, including GPUs, storage, and networking products, which fell to Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire