Virtual ALCF Workshop Provides Guidance on Using AI and Supercomputing Tools for Science

March 19, 2021

March 19, 2021 — In the world of high-performance computing (HPC), the convergence of artificial intelligence (AI) and data science with traditional modeling and simulation is changing the way researchers use supercomputers for scientific discovery.

To help scientists find their footing in this ever-changing landscape, the Argonne Leadership Computing Facility (ALCF), a U.S. Department of Energy (DOE) Office of Science User Facility, has placed a premium on training researchers to use emerging AI and machine learning tools and techniques efficiently on its supercomputing resources.

Through a regular series of workshops and webinars and a special allocation program called the ALCF Data Science Program (ADSP), the facility has been working to build a community of scientists who can employ these methods at a scale that requires the DOE’s leadership-class computing resources.

“Ten years ago, people were primarily only using our supercomputers for numerical simulations. These large simulations output a lot of data, but they didn’t need a lot of data input to do their analysis,” said Taylor Childers, ALCF computer scientist. “With AI, deep learning and machine learning bursting onto the scene in the last five years, we’ve put a lot of effort into onboarding new research teams that are not accustomed to using HPC.”

Childers was one of the organizers of the ALCF’s recent Simulation, Data, and Learning Workshop, an annual event focused on helping participants improve the performance and productivity of data-intensive and machine learning applications. Held virtually in December, the event welcomed more than 100 attendees from across the world.

“Our workshops not only provide guidance on using AI and HPC for science, they also give people an opportunity to engage with us and find out how ALCF resources can potentially benefit their research,” Childers said.

ALCF researchers Misha Salim (top) and Sam Foreman guide workshop attendees through a session on using the DeepHyper tool on the facility’s Theta system. (Image: Argonne National Laboratory)

In addition to training events, the ALCF has been building the hardware and software infrastructure needed to support research enabled by the confluence of simulation, data, and learning methods. On the hardware side, the facility recently deployed ThetaGPU — an upgrade to its Theta supercomputer powered by graphics processing units (GPUs). The augmented system provides enhanced capabilities for data analytics and AI training and learning. The ALCF’s next-generation systems, Polaris and Aurora, will also be hybrid CPU-GPU machines designed to handle AI and data-intensive workloads.

In the software space, the ALCF has been building up its support for machine learning frameworks, including TensorFlow, PyTorch, Horovod, and Python-based modules, libraries, and performance profilers, as well as tools for data-intensive science, such as JupyterHub and MongoDB. In addition, the facility has developed and deployed its own tools, such as the Balsam HPC workflow and edge service, and DeepHyper, a scalable, automated machine learning package for hyperparameter optimization and neural architecture search. Together with the expertise of ALCF staff, these hardware and software tools are helping researchers open new frontiers in scientific computing.

Going virtual 

In 2020, the ALCF’s traditionally in-person workshops transitioned to virtual events due to the COVID-19 pandemic. While the facility has been hosting online training webinars for years, the ALCF Computational Performance Workshop in May was its first large-scale user workshop to go completely virtual. Leveraging tools like Zoom for video conferencing and Slack for instant messaging and collaboration, the ALCF’s virtual workshops have been successful in recreating the collaborative, hands-on nature of its on-site training events. For the Simulation, Data, and Learning Workshop, the ALCF employed a GitHub repository that contained all of the code and instructions for the planned activities.

To make the experience more engaging, the workshop was structured entirely around hands-on tutorials with opportunities to interact throughout. Day one was designed to help participants get distributed deep learning code and data pipelines running on ALCF systems; day two covered using DeepHyper for hyperparameter optimization and how to profile and improve application performance; and day three was dedicated to getting the attendees’ deep learning networks deployed at scale in a simulation.

“The first day covered what you need to scale up a deep learning problem from your laptop to ALCF resources,” Childers said. “The next two days were successive advancements, focused on improving performance and ultimately on how to use the trained model in your research.”

The virtual format has the added benefit of making events more accessible to a larger base of participants. While a majority of attendees were from U.S. institutions, the workshop also welcomed international participants from England, Argentina, and Ghana.

“Virtual events have always meant an open chance to participate in events that I would never have been able to otherwise,” said Dario Dematties, a postdoctoral researcher at the CONICET Mendoza Technological Scientific Center in Argentina.

Attendees take their research to the next level

For Dematties, the workshop presented an opportunity to advance his work involving contrastive learning for visual representations, an approach that utilizes machine learning to identify similarities and differences in images. He was conducting his research with a Director’s Discretionary allocation on Cooley, the ALCF’s visualization and analysis cluster, but was having trouble getting his workflow to run on several nodes. Working with ALCF staff at the workshop, Dematties transitioned his work to the larger ThetaGPU system.

“After attending the first section on distributed deep learning, I decided to approach some experts. We didn’t succeed on the first try, but we did on the last day of the workshop,” Dematties said. “I had never run a PyTorch machine learning workflow on so many distributed GPUs. That was amazing. Thanks to this event, I now have an approved allocation for running my project on ThetaGPU.”

Maruti Mudunuru, an Earth scientist at DOE’s Pacific Northwest National Laboratory, attended the workshop to learn how distributed deep learning and hyperparameter optimization can be used to enhance watershed modeling.

“The focus of my research at the ALCF is to advance watershed modeling at the system-scale using machine learning and develop reduced-order models of river corridor processes,” Mudunuru said. “My goal is to test a proof of concept with my discretionary allocation. If my outcomes are successful, I plan to submit a proposal for the ALCF Data Science Program next year.”

Improving the performance of ongoing projects

In addition to helping new users scale up their research for ALCF systems, the workshop is also useful for existing facility users looking to employ new tools and techniques that can accelerate their research.

Argonne’s Ming Du, for example, attended the workshop to learn how machine learning frameworks can advance his project aimed at developing an accurate and efficient HPC framework for solving the dense 3D reconstruction problem in X-ray microscopy. The research began as part of an ADSP project and is now being pursued under a project awarded by DOE’s Advanced Scientific Computing Research (ASCR) Leadership Computing Challenge.

During the hands-on sessions, Du worked with ALCF staff members to perform a scaling test for distributed training using the PyTorch DistributedDataParallel (DDP) module.

“This experience has pointed us to a clear pathway for improving the scaling performance of our framework in the future,” Du said. “We are planning to employ the more efficient DDP, which is expected to reduce the communication overhead of our application by a huge factor.”

Du also benefitted from the workshop session focused on coupling simulation in C++ with deep learning in Python. He plans to use the knowledge he picked up at the event to train a deep neural network surrogate model that will be used to perform wave propagation simulations more quickly and efficiently than their current method.

The opportunity to use ThetaGPU at the workshop was another perk that will help Du and his colleagues prepare their research for the ALCF’s next-generation systems.

“This was the first time I got a chance to run applications on this machine, and that experience will be extremely helpful for us to optimize our framework for both ThetaGPU and the upcoming Aurora system, which will also be a GPU-accelerated machine,” Du said.

Registration is now open for the 2021 ALCF Computational Performance Workshop, which will be held virtually May 4-6, 2021. For additional training opportunities, visit the ALCF events webpage.

The Argonne Leadership Computing Facility provides supercomputing capabilities to the scientific and engineering community to advance fundamental discovery and understanding in a broad range of disciplines. Supported by the U.S. Department of Energy’s (DOE’s) Office of Science, Advanced Scientific Computing Research (ASCR) program, the ALCF is one of two DOE Leadership Computing Facilities in the nation dedicated to open science.

About the Advanced Photon Source

The U. S. Department of Energy Office of Science’s Advanced Photon Source (APS) at Argonne National Laboratory is one of the world’s most productive X-ray light source facilities. The APS provides high-brightness X-ray beams to a diverse community of researchers in materials science, chemistry, condensed matter physics, the life and environmental sciences, and applied research. These X-rays are ideally suited for explorations of materials and biological structures; elemental distribution; chemical, magnetic, electronic states; and a wide range of technologically important engineering systems from batteries to fuel injector sprays, all of which are the foundations of our nation’s economic, technological, and physical well-being. Each year, more than 5,000 researchers use the APS to produce over 2,000 publications detailing impactful discoveries, and solve more vital biological protein structures than users of any other X-ray light source research facility. APS scientists and engineers innovate technology that is at the heart of advancing accelerator and light-source operations. This includes the insertion devices that produce extreme-brightness X-rays prized by researchers, lenses that focus the X-rays down to a few nanometers, instrumentation that maximizes the way the X-rays interact with samples being studied, and software that gathers and manages the massive quantity of data resulting from discovery research at the APS.

This research used resources of the Advanced Photon Source, a U.S. DOE Office of Science User Facility operated for the DOE Office of Science by Argonne National Laboratory under Contract No. DE-AC02-06CH11357.

About Argonne National Laboratory 

Argonne National Laboratory seeks solutions to pressing national problems in science and technology. The nation’s first national laboratory, Argonne conducts leading-edge basic and applied scientific research in virtually every scientific discipline. Argonne researchers work closely with researchers from hundreds of companies, universities, and federal, state and municipal agencies to help them solve their specific problems, advance America’s scientific leadership and prepare the nation for a better future. With employees from more than 60 nations, Argonne is managed by  UChicago Argonne, LLC for the U.S. Department of Energy’s Office of Science.

About the U.S. Department of Energy’s Office of Science

The U.S. Department of Energy’s Office of Science is the single largest supporter of basic research in the physical sciences in the United States and is working to address some of the most pressing challenges of our time. For more information, visit https://​ener​gy​.gov/​s​c​ience.


Source: Argonne National Laboratory

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

What’s After Exascale? The Internet of Workflows Says HPE’s Nicolas Dubé

July 29, 2021

With the race to exascale computing in its final leg, it’s natural to wonder what the Post Exascale Era will look like. Nicolas Dubé, VP and chief technologist for HPE’s HPC business unit, agrees and shared his vision at Supercomputing Frontiers Europe 2021 held last week. The next big thing, he told the virtual audience at SFE21, is something that will connect HPC and (broadly) all of IT – into what Dubé calls The Internet of Workflows. Read more…

How UK Scientists Developed Transformative, HPC-Powered Coronavirus Sequencing System

July 29, 2021

In November 2020, the COVID-19 Genomics UK Consortium (COG-UK) won the HPCwire Readers’ Choice Award for Best HPC Collaboration for its CLIMB-COVID sequencing project. Launched in March 2020, CLIMB-COVID has now resulted in the sequencing of over 675,000 coronavirus genomes – an increasingly critical task as variants like Delta threaten the tenuous prospect of a return to normalcy in much of the world. Read more…

KAUST Leverages Mixed Precision for Geospatial Data

July 28, 2021

For many computationally intensive tasks, exacting precision is not necessary for every step of the entire task to obtain a suitably precise result. The alternative is mixed-precision computing: using high precision wher Read more…

Oak Ridge Supercomputer Enables Next-Gen Jet Turbine Research

July 27, 2021

Air travel is notoriously carbon-inefficient, with many airlines going as far as to offer purchasable carbon offsets to ease the guilt over large-footprint travel. But even over just the last decade, major aircraft model Read more…

IBM and University of Tokyo Roll Out Quantum System One in Japan

July 27, 2021

IBM and the University of Tokyo today unveiled an IBM Quantum System One as part of the IBM-Japan quantum program announced in 2019. The system is the second IBM Quantum System One assembled outside the U.S. and follows Read more…

AWS Solution Channel

Data compression with increased performance and lower costs

Many customers associate a performance cost with data compression, but that’s not the case with Amazon FSx for Lustre. With FSx for Lustre, data compression reduces storage costs and increases aggregate file system throughput. Read more…

Intel Unveils New Node Names; Sapphire Rapids Is Now an ‘Intel 7’ CPU

July 27, 2021

What's a preeminent chip company to do when its process node technology lags the competition by (roughly) one generation, but outmoded naming conventions make it seem like it's two nodes behind? For Intel, the response was to change how it refers to its nodes with the aim of better reflecting its positioning within the leadership semiconductor manufacturing space. Intel revealed its new node nomenclature, and... Read more…

What’s After Exascale? The Internet of Workflows Says HPE’s Nicolas Dubé

July 29, 2021

With the race to exascale computing in its final leg, it’s natural to wonder what the Post Exascale Era will look like. Nicolas Dubé, VP and chief technologist for HPE’s HPC business unit, agrees and shared his vision at Supercomputing Frontiers Europe 2021 held last week. The next big thing, he told the virtual audience at SFE21, is something that will connect HPC and (broadly) all of IT – into what Dubé calls The Internet of Workflows. Read more…

How UK Scientists Developed Transformative, HPC-Powered Coronavirus Sequencing System

July 29, 2021

In November 2020, the COVID-19 Genomics UK Consortium (COG-UK) won the HPCwire Readers’ Choice Award for Best HPC Collaboration for its CLIMB-COVID sequencing project. Launched in March 2020, CLIMB-COVID has now resulted in the sequencing of over 675,000 coronavirus genomes – an increasingly critical task as variants like Delta threaten the tenuous prospect of a return to normalcy in much of the world. Read more…

IBM and University of Tokyo Roll Out Quantum System One in Japan

July 27, 2021

IBM and the University of Tokyo today unveiled an IBM Quantum System One as part of the IBM-Japan quantum program announced in 2019. The system is the second IB Read more…

Intel Unveils New Node Names; Sapphire Rapids Is Now an ‘Intel 7’ CPU

July 27, 2021

What's a preeminent chip company to do when its process node technology lags the competition by (roughly) one generation, but outmoded naming conventions make it seem like it's two nodes behind? For Intel, the response was to change how it refers to its nodes with the aim of better reflecting its positioning within the leadership semiconductor manufacturing space. Intel revealed its new node nomenclature, and... Read more…

Will Approximation Drive Post-Moore’s Law HPC Gains?

July 26, 2021

“Hardware-based improvements are going to get more and more difficult,” said Neil Thompson, an innovation scholar at MIT’s Computer Science and Artificial Intelligence Lab (CSAIL). “I think that’s something that this crowd will probably, actually, be already familiar with.” Thompson, speaking... Read more…

With New Owner and New Roadmap, an Independent Omni-Path Is Staging a Comeback

July 23, 2021

Put on a shelf by Intel in 2019, Omni-Path faced a uncertain future, but under new custodian Cornelis Networks, OmniPath is looking to make a comeback as an independent high-performance interconnect solution. A "significant refresh" – called Omni-Path Express – is coming later this year according to the company. Cornelis Networks formed last September as a spinout of Intel's Omni-Path division. Read more…

Chameleon’s HPC Testbed Sharpens Its Edge, Presses ‘Replay’

July 22, 2021

“One way of saying what I do for a living is to say that I develop scientific instruments,” said Kate Keahey, a senior fellow at the University of Chicago a Read more…

Summer Reading: “High-Performance Computing Is at an Inflection Point”

July 21, 2021

At last month’s 11th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies (HEART), a group of researchers led by Martin Schulz of the Leibniz Supercomputing Center (Munich) presented a “position paper” in which they argue HPC architectural landscape... Read more…

AMD Chipmaker TSMC to Use AMD Chips for Chipmaking

May 8, 2021

TSMC has tapped AMD to support its major manufacturing and R&D workloads. AMD will provide its Epyc Rome 7702P CPUs – with 64 cores operating at a base cl Read more…

Intel Launches 10nm ‘Ice Lake’ Datacenter CPU with Up to 40 Cores

April 6, 2021

The wait is over. Today Intel officially launched its 10nm datacenter CPU, the third-generation Intel Xeon Scalable processor, codenamed Ice Lake. With up to 40 Read more…

Berkeley Lab Debuts Perlmutter, World’s Fastest AI Supercomputer

May 27, 2021

A ribbon-cutting ceremony held virtually at Berkeley Lab's National Energy Research Scientific Computing Center (NERSC) today marked the official launch of Perlmutter – aka NERSC-9 – the GPU-accelerated supercomputer built by HPE in partnership with Nvidia and AMD. Read more…

Ahead of ‘Dojo,’ Tesla Reveals Its Massive Precursor Supercomputer

June 22, 2021

In spring 2019, Tesla made cryptic reference to a project called Dojo, a “super-powerful training computer” for video data processing. Then, in summer 2020, Tesla CEO Elon Musk tweeted: “Tesla is developing a [neural network] training computer called Dojo to process truly vast amounts of video data. It’s a beast! … A truly useful exaflop at de facto FP32.” Read more…

Google Launches TPU v4 AI Chips

May 20, 2021

Google CEO Sundar Pichai spoke for only one minute and 42 seconds about the company’s latest TPU v4 Tensor Processing Units during his keynote at the Google I Read more…

CentOS Replacement Rocky Linux Is Now in GA and Under Independent Control

June 21, 2021

The Rocky Enterprise Software Foundation (RESF) is announcing the general availability of Rocky Linux, release 8.4, designed as a drop-in replacement for the soon-to-be discontinued CentOS. The GA release is launching six-and-a-half months after Red Hat deprecated its support for the widely popular, free CentOS server operating system. The Rocky Linux development effort... Read more…

CERN Is Betting Big on Exascale

April 1, 2021

The European Organization for Nuclear Research (CERN) involves 23 countries, 15,000 researchers, billions of dollars a year, and the biggest machine in the worl Read more…

Iran Gains HPC Capabilities with Launch of ‘Simorgh’ Supercomputer

May 18, 2021

Iran is said to be developing domestic supercomputing technology to advance the processing of scientific, economic, political and military data, and to strengthen the nation’s position in the age of AI and big data. On Sunday, Iran unveiled the Simorgh supercomputer, which will deliver.... Read more…

Leading Solution Providers

Contributors

HPE Launches Storage Line Loaded with IBM’s Spectrum Scale File System

April 6, 2021

HPE today launched a new family of storage solutions bundled with IBM’s Spectrum Scale Erasure Code Edition parallel file system (description below) and featu Read more…

Julia Update: Adoption Keeps Climbing; Is It a Python Challenger?

January 13, 2021

The rapid adoption of Julia, the open source, high level programing language with roots at MIT, shows no sign of slowing according to data from Julialang.org. I Read more…

10nm, 7nm, 5nm…. Should the Chip Nanometer Metric Be Replaced?

June 1, 2020

The biggest cool factor in server chips is the nanometer. AMD beating Intel to a CPU built on a 7nm process node* – with 5nm and 3nm on the way – has been i Read more…

GTC21: Nvidia Launches cuQuantum; Dips a Toe in Quantum Computing

April 13, 2021

Yesterday Nvidia officially dipped a toe into quantum computing with the launch of cuQuantum SDK, a development platform for simulating quantum circuits on GPU-accelerated systems. As Nvidia CEO Jensen Huang emphasized in his keynote, Nvidia doesn’t plan to build... Read more…

Microsoft to Provide World’s Most Powerful Weather & Climate Supercomputer for UK’s Met Office

April 22, 2021

More than 14 months ago, the UK government announced plans to invest £1.2 billion ($1.56 billion) into weather and climate supercomputing, including procuremen Read more…

Quantum Roundup: IBM, Rigetti, Phasecraft, Oxford QC, China, and More

July 13, 2021

IBM yesterday announced a proof for a quantum ML algorithm. A week ago, it unveiled a new topology for its quantum processors. Last Friday, the Technical Univer Read more…

Q&A with Jim Keller, CTO of Tenstorrent, and an HPCwire Person to Watch in 2021

April 22, 2021

As part of our HPCwire Person to Watch series, we are happy to present our interview with Jim Keller, president and chief technology officer of Tenstorrent. One of the top chip architects of our time, Keller has had an impactful career. Read more…

Senate Debate on Bill to Remake NSF – the Endless Frontier Act – Begins

May 18, 2021

The U.S. Senate today opened floor debate on the Endless Frontier Act which seeks to remake and expand the National Science Foundation by creating a technology Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire