2 New NSF-Funded Systems at PSC to Scale HPC for Data Science, AI

November 25, 2020

Nov. 25, 2020 — The Oct. 20, 2020, XSEDE ECSS Symposium featured overviews of two new NSF-funded HPC systems at the Pittsburgh Supercomputing Center (PSC). The new resources, called Bridges-2 and Neocortex, will continue the center’s exploration in scaling HPC for data science and AI on behalf of new communities and research paradigms. Both systems are currently preparing  their early user programs. The two systems will be available at no cost for research and education, and at cost-recovery rates for other purposes.

Bridges-2 at PSC will continue Bridges’ mission to ease entry to heterogeneous HPC for new research communities by enabling rapidly evolving research such as scalable HPC-powered AI; data-centric computing both in fields that require massive datasets and many small datasets; and research via popular cloud-based applications, containers and user-focused platforms.

Bridges-2: Scaling Deep Learning and Data Science for Expanding Applications

“One of the motivations for us to build Bridges-2 was rapidly evolving science and engineering,” said Shawn Brown, PSC’s director and PI for that system, in introducing that $20-million, XSEDE-allocated HPC platform, integrated by HPE. “The landscape of high performance computing and computational research has changed drastically over the last decade; we really wanted to build a machine that supported the new ways of doing computational science and not necessarily only traditional computational science,” especially in the areas of artificial intelligence and complex data science.

Bridges-2’s predecessor, Bridges, broke new ground in easing entry to heterogeneous HPC for research communities that never before required computing, let alone supercomputing. Bridges-2 will continue this mission and add expanded capabilities for fields such as scalable HPC-powered AI; data-centric computing both in fields that require massive datasets and many small datasets; and research via popular cloud-based applications, containers and user-focused platforms.

“We’re not just going to be supporting the command line, we want to be able to support all sorts of modes of computation to make this as applicable to [new] communities as possible,” Brown said. “We [want to] remove barriers to people using high performance computing for their research rather than us training them to do it the way that we do things—we want to … enable them to do their research in their own particular idiom.”

Like Bridges, Bridges-2 will offer a heterogeneous system designed to allow complex workflows leveraging different computational nodes with speed and efficiency. This will include:

  • 488 256-GB-RAM regular-memory (RM) nodes and 16 512-GB-RAM large-memory (LM) nodes, featuring two AMD EPYC “Rome” 7742 CPUs each
  • Four 4-TB extreme-memory (EM) nodes with four Intel Xeon Platinum 8260M “Cascade Lake” CPUs
  • 24 GPU nodes with eight NVIDIA Tesla V100-32 GB SXM2 GPUs, two Intel Xeon Gold “Cascade Lake” CPUs and 512 GB RAM
  • A Mellanox ConnectX-6 HDR InfiniBand 200Gb/s interconnect
  • An efficient tiered storage system including a flash array with greater than 100 TB usable storage; a Lustre file system with 21 PB raw storage; and an HPE StoreEver MSL6480 Tape Library with 7.2 PB uncompressed, ~8.6 PB compressed space

“We want Bridges-2 … to work interoperably with all sorts of different [computational resources], including workflows, engines, heterogeneous computing, cloud resources,” Brown said. “We want this thing to be a member of the ecosystem—not just a standalone machine, but really a resource that’s widely available and applicable to a number of different rapidly evolving research paradigms.”

PSC will be integrating Bridges-2 with its extant Bridges-AI system, featuring an NVIDIA DGX-2 enterprise research AI system tightly coupled with 16 NVIDIA Tesla V100 (Volta) GPUs with 32 GB of GPU memory each.

Brown encouraged researchers to take advantage of Bridges-2’s Early User Program, which is now accepting proposals and is scheduled to begin early in 2021. This program will allow users to port, tune and optimize their applications early, and make progress on their research, while providing PSC with feedback on the system and how it can be better tuned to users’ needs. Information on applying as well as program updates can be found at https://psc.edu/bridges-2/eup-apply.

Updates on the system in general can be found at http://www.psc.edu/resources/computing/bridges-2.

Neocortex: Democratizing Access to Game-Changing Compute Power in Deep Learning

The CS-1, a new generation of “wafer-scale” engine, is the largest chip ever built: a 46-square-centimeter processor featuring 1.2 trillion transistors. Its design principle is to accelerate training to shorten this critical and lengthy component of deep learning.

Sergiu Sanielevici, Neocortex’s co-PI and director of user support for scientific applications at PSC, introduced the $5 million, Cerebras Systems/HPE system on behalf of PI Paola Buitrago, director of artificial intelligence & data science at PSC. Neocortex was granted via the NSF’s new category 2 awards, which fund systems intended to explore innovative HPC architectures. Neocortex will feature two Cerebras CS-1 systems and an HPE Superdome Flex HPC server robustly provisioned to drive the CS-1 systems simultaneously at maximum speed and support the complementary requirements of AI and high performance data analytics workflows.

“Neocortex is specifically designed for AI training—to explore how [the CS-1s] can be used, how that can be integrated into research workflows,” Sanielevici said. “We want to get to this ecosystem that [spans] from what Bridges-2 can do … to the things that really require this specialized hardware that our partners at Cerebras provide.”

The CS-1, a new generation of “wafer-scale” engine, is the largest chip ever built: a 46-square-centimeter processor featuring 1.2 trillion transistors. Its design principle is to accelerate training to shorten this critical and lengthy component of deep learning.

“Machine-learning workflows are of course not simple,” Sanielevici said. “Training is not a linear process … it’s a highly iterative  process with lots of parameters. The goal here is to vastly shorten the time required for deep learning training and in the larger ecosystem foster integration of deep learning with scientific workflows—to really see what this revolutionary hardware can do.”

The CS-1 fabric connects cluster-scale compute in a single system to eliminate communication bottlenecks and make model-parallel training easy, he added. Without orchestration or synchronization headaches, the system offers a profound advantage for machine learning training with small batches at high utilization, obviating the need for tricky learning schedules and optimizers.

A major design innovation was to connect the two CS-1 servers via an HPE Superdome Flex system. The combination is expected to provide substantial capability for preprocessing and other complementary aspects of AI workflows, enabling training on very large datasets with exceptional ease and supporting both CS-1s independently and together to explore scaling.

Neocortex accepted early user proposals in August through September 2020; 42 applications are currently being assessed. Proposals represent research areas including AI Theory, Bioinformatics, Neurophysiology, Materials Science, Electrical and Computer Engineering, Medical Imaging, Geophysics, Civil Engineering, IoT, Social Science, Drug Discovery, Fluid Dynamics, Ecology and Chemistry. Information about the system and its progress can be found at https://www.cmu.edu/psc/aibd/neocortex/.

You can find a video and slides for both presentations at https://www.xsede.org/for-users/ecss/ecss-symposium.


Source: XSEDE

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

MLPerf Inference 4.0 Results Showcase GenAI; Nvidia Still Dominates

March 28, 2024

There were no startling surprises in the latest MLPerf Inference benchmark (4.0) results released yesterday. Two new workloads — Llama 2 and Stable Diffusion XL — were added to the benchmark suite as MLPerf continues Read more…

Q&A with Nvidia’s Chief of DGX Systems on the DGX-GB200 Rack-scale System

March 27, 2024

Pictures of Nvidia's new flagship mega-server, the DGX GB200, on the GTC show floor got favorable reactions on social media for the sheer amount of computing power it brings to artificial intelligence.  Nvidia's DGX Read more…

Call for Participation in Workshop on Potential NSF CISE Quantum Initiative

March 26, 2024

Editor’s Note: Next month there will be a workshop to discuss what a quantum initiative led by NSF’s Computer, Information Science and Engineering (CISE) directorate could entail. The details are posted below in a Ca Read more…

Waseda U. Researchers Reports New Quantum Algorithm for Speeding Optimization

March 25, 2024

Optimization problems cover a wide range of applications and are often cited as good candidates for quantum computing. However, the execution time for constrained combinatorial optimization applications on quantum device Read more…

NVLink: Faster Interconnects and Switches to Help Relieve Data Bottlenecks

March 25, 2024

Nvidia’s new Blackwell architecture may have stolen the show this week at the GPU Technology Conference in San Jose, California. But an emerging bottleneck at the network layer threatens to make bigger and brawnier pro Read more…

Who is David Blackwell?

March 22, 2024

During GTC24, co-founder and president of NVIDIA Jensen Huang unveiled the Blackwell GPU. This GPU itself is heavily optimized for AI work, boasting 192GB of HBM3E memory as well as the the ability to train 1 trillion pa Read more…

MLPerf Inference 4.0 Results Showcase GenAI; Nvidia Still Dominates

March 28, 2024

There were no startling surprises in the latest MLPerf Inference benchmark (4.0) results released yesterday. Two new workloads — Llama 2 and Stable Diffusion Read more…

Q&A with Nvidia’s Chief of DGX Systems on the DGX-GB200 Rack-scale System

March 27, 2024

Pictures of Nvidia's new flagship mega-server, the DGX GB200, on the GTC show floor got favorable reactions on social media for the sheer amount of computing po Read more…

NVLink: Faster Interconnects and Switches to Help Relieve Data Bottlenecks

March 25, 2024

Nvidia’s new Blackwell architecture may have stolen the show this week at the GPU Technology Conference in San Jose, California. But an emerging bottleneck at Read more…

Who is David Blackwell?

March 22, 2024

During GTC24, co-founder and president of NVIDIA Jensen Huang unveiled the Blackwell GPU. This GPU itself is heavily optimized for AI work, boasting 192GB of HB Read more…

Nvidia Looks to Accelerate GenAI Adoption with NIM

March 19, 2024

Today at the GPU Technology Conference, Nvidia launched a new offering aimed at helping customers quickly deploy their generative AI applications in a secure, s Read more…

The Generative AI Future Is Now, Nvidia’s Huang Says

March 19, 2024

We are in the early days of a transformative shift in how business gets done thanks to the advent of generative AI, according to Nvidia CEO and cofounder Jensen Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, codenamed Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

Nvidia Showcases Quantum Cloud, Expanding Quantum Portfolio at GTC24

March 18, 2024

Nvidia’s barrage of quantum news at GTC24 this week includes new products, signature collaborations, and a new Nvidia Quantum Cloud for quantum developers. Wh Read more…

Alibaba Shuts Down its Quantum Computing Effort

November 30, 2023

In case you missed it, China’s e-commerce giant Alibaba has shut down its quantum computing research effort. It’s not entirely clear what drove the change. Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Shutterstock 1285747942

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

DoD Takes a Long View of Quantum Computing

December 19, 2023

Given the large sums tied to expensive weapon systems – think $100-million-plus per F-35 fighter – it’s easy to forget the U.S. Department of Defense is a Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Intel’s Server and PC Chip Development Will Blur After 2025

January 15, 2024

Intel's dealing with much more than chip rivals breathing down its neck; it is simultaneously integrating a bevy of new technologies such as chiplets, artificia Read more…

Baidu Exits Quantum, Closely Following Alibaba’s Earlier Move

January 5, 2024

Reuters reported this week that Baidu, China’s giant e-commerce and services provider, is exiting the quantum computing development arena. Reuters reported � Read more…

Leading Solution Providers

Contributors

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Shutterstock 1179408610

Google Addresses the Mysteries of Its Hypercomputer 

December 28, 2023

When Google launched its Hypercomputer earlier this month (December 2023), the first reaction was, "Say what?" It turns out that the Hypercomputer is Google's t Read more…

AMD MI3000A

How AMD May Get Across the CUDA Moat

October 5, 2023

When discussing GenAI, the term "GPU" almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word "GPU" is assumed to mean "Nvidia" products. (As an aside, the popular Nvidia hardware used in GenAI are not technically... Read more…

Shutterstock 1606064203

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

January 25, 2024

In under two minutes, Meta's CEO, Mark Zuckerberg, laid out the company's AI plans, which included a plan to build an artificial intelligence system with the eq Read more…

Google Introduces ‘Hypercomputer’ to Its AI Infrastructure

December 11, 2023

Google ran out of monikers to describe its new AI system released on December 7. Supercomputer perhaps wasn't an apt description, so it settled on Hypercomputer Read more…

China Is All In on a RISC-V Future

January 8, 2024

The state of RISC-V in China was discussed in a recent report released by the Jamestown Foundation, a Washington, D.C.-based think tank. The report, entitled "E Read more…

Intel Won’t Have a Xeon Max Chip with New Emerald Rapids CPU

December 14, 2023

As expected, Intel officially announced its 5th generation Xeon server chips codenamed Emerald Rapids at an event in New York City, where the focus was really o Read more…

IBM Quantum Summit: Two New QPUs, Upgraded Qiskit, 10-year Roadmap and More

December 4, 2023

IBM kicks off its annual Quantum Summit today and will announce a broad range of advances including its much-anticipated 1121-qubit Condor QPU, a smaller 133-qu Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire