2 New NSF-Funded Systems at PSC to Scale HPC for Data Science, AI

November 25, 2020

Nov. 25, 2020 — The Oct. 20, 2020, XSEDE ECSS Symposium featured overviews of two new NSF-funded HPC systems at the Pittsburgh Supercomputing Center (PSC). The new resources, called Bridges-2 and Neocortex, will continue the center’s exploration in scaling HPC for data science and AI on behalf of new communities and research paradigms. Both systems are currently preparing  their early user programs. The two systems will be available at no cost for research and education, and at cost-recovery rates for other purposes.

Bridges-2 at PSC will continue Bridges’ mission to ease entry to heterogeneous HPC for new research communities by enabling rapidly evolving research such as scalable HPC-powered AI; data-centric computing both in fields that require massive datasets and many small datasets; and research via popular cloud-based applications, containers and user-focused platforms.

Bridges-2: Scaling Deep Learning and Data Science for Expanding Applications

“One of the motivations for us to build Bridges-2 was rapidly evolving science and engineering,” said Shawn Brown, PSC’s director and PI for that system, in introducing that $20-million, XSEDE-allocated HPC platform, integrated by HPE. “The landscape of high performance computing and computational research has changed drastically over the last decade; we really wanted to build a machine that supported the new ways of doing computational science and not necessarily only traditional computational science,” especially in the areas of artificial intelligence and complex data science.

Bridges-2’s predecessor, Bridges, broke new ground in easing entry to heterogeneous HPC for research communities that never before required computing, let alone supercomputing. Bridges-2 will continue this mission and add expanded capabilities for fields such as scalable HPC-powered AI; data-centric computing both in fields that require massive datasets and many small datasets; and research via popular cloud-based applications, containers and user-focused platforms.

“We’re not just going to be supporting the command line, we want to be able to support all sorts of modes of computation to make this as applicable to [new] communities as possible,” Brown said. “We [want to] remove barriers to people using high performance computing for their research rather than us training them to do it the way that we do things—we want to … enable them to do their research in their own particular idiom.”

Like Bridges, Bridges-2 will offer a heterogeneous system designed to allow complex workflows leveraging different computational nodes with speed and efficiency. This will include:

  • 488 256-GB-RAM regular-memory (RM) nodes and 16 512-GB-RAM large-memory (LM) nodes, featuring two AMD EPYC “Rome” 7742 CPUs each
  • Four 4-TB extreme-memory (EM) nodes with four Intel Xeon Platinum 8260M “Cascade Lake” CPUs
  • 24 GPU nodes with eight NVIDIA Tesla V100-32 GB SXM2 GPUs, two Intel Xeon Gold “Cascade Lake” CPUs and 512 GB RAM
  • A Mellanox ConnectX-6 HDR InfiniBand 200Gb/s interconnect
  • An efficient tiered storage system including a flash array with greater than 100 TB usable storage; a Lustre file system with 21 PB raw storage; and an HPE StoreEver MSL6480 Tape Library with 7.2 PB uncompressed, ~8.6 PB compressed space

“We want Bridges-2 … to work interoperably with all sorts of different [computational resources], including workflows, engines, heterogeneous computing, cloud resources,” Brown said. “We want this thing to be a member of the ecosystem—not just a standalone machine, but really a resource that’s widely available and applicable to a number of different rapidly evolving research paradigms.”

PSC will be integrating Bridges-2 with its extant Bridges-AI system, featuring an NVIDIA DGX-2 enterprise research AI system tightly coupled with 16 NVIDIA Tesla V100 (Volta) GPUs with 32 GB of GPU memory each.

Brown encouraged researchers to take advantage of Bridges-2’s Early User Program, which is now accepting proposals and is scheduled to begin early in 2021. This program will allow users to port, tune and optimize their applications early, and make progress on their research, while providing PSC with feedback on the system and how it can be better tuned to users’ needs. Information on applying as well as program updates can be found at https://psc.edu/bridges-2/eup-apply.

Updates on the system in general can be found at http://www.psc.edu/resources/computing/bridges-2.

Neocortex: Democratizing Access to Game-Changing Compute Power in Deep Learning

The CS-1, a new generation of “wafer-scale” engine, is the largest chip ever built: a 46-square-centimeter processor featuring 1.2 trillion transistors. Its design principle is to accelerate training to shorten this critical and lengthy component of deep learning.

Sergiu Sanielevici, Neocortex’s co-PI and director of user support for scientific applications at PSC, introduced the $5 million, Cerebras Systems/HPE system on behalf of PI Paola Buitrago, director of artificial intelligence & data science at PSC. Neocortex was granted via the NSF’s new category 2 awards, which fund systems intended to explore innovative HPC architectures. Neocortex will feature two Cerebras CS-1 systems and an HPE Superdome Flex HPC server robustly provisioned to drive the CS-1 systems simultaneously at maximum speed and support the complementary requirements of AI and high performance data analytics workflows.

“Neocortex is specifically designed for AI training—to explore how [the CS-1s] can be used, how that can be integrated into research workflows,” Sanielevici said. “We want to get to this ecosystem that [spans] from what Bridges-2 can do … to the things that really require this specialized hardware that our partners at Cerebras provide.”

The CS-1, a new generation of “wafer-scale” engine, is the largest chip ever built: a 46-square-centimeter processor featuring 1.2 trillion transistors. Its design principle is to accelerate training to shorten this critical and lengthy component of deep learning.

“Machine-learning workflows are of course not simple,” Sanielevici said. “Training is not a linear process … it’s a highly iterative  process with lots of parameters. The goal here is to vastly shorten the time required for deep learning training and in the larger ecosystem foster integration of deep learning with scientific workflows—to really see what this revolutionary hardware can do.”

The CS-1 fabric connects cluster-scale compute in a single system to eliminate communication bottlenecks and make model-parallel training easy, he added. Without orchestration or synchronization headaches, the system offers a profound advantage for machine learning training with small batches at high utilization, obviating the need for tricky learning schedules and optimizers.

A major design innovation was to connect the two CS-1 servers via an HPE Superdome Flex system. The combination is expected to provide substantial capability for preprocessing and other complementary aspects of AI workflows, enabling training on very large datasets with exceptional ease and supporting both CS-1s independently and together to explore scaling.

Neocortex accepted early user proposals in August through September 2020; 42 applications are currently being assessed. Proposals represent research areas including AI Theory, Bioinformatics, Neurophysiology, Materials Science, Electrical and Computer Engineering, Medical Imaging, Geophysics, Civil Engineering, IoT, Social Science, Drug Discovery, Fluid Dynamics, Ecology and Chemistry. Information about the system and its progress can be found at https://www.cmu.edu/psc/aibd/neocortex/.

You can find a video and slides for both presentations at https://www.xsede.org/for-users/ecss/ecss-symposium.


Source: XSEDE

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, code-named Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from its predecessors, including the red-hot H100 and A100 GPUs. Read more…

Nvidia Showcases Quantum Cloud, Expanding Quantum Portfolio at GTC24

March 18, 2024

Nvidia’s barrage of quantum news at GTC24 this week includes new products, signature collaborations, and a new Nvidia Quantum Cloud for quantum developers. While Nvidia may not spring to mind when thinking of the quant Read more…

2024 Winter Classic: Meet the HPE Mentors

March 18, 2024

The latest installment of the 2024 Winter Classic Studio Update Show features our interview with the HPE mentor team who introduced our student teams to the joys (and potential sorrows) of the HPL (LINPACK) and accompany Read more…

Houston We Have a Solution: Addressing the HPC and Tech Talent Gap

March 15, 2024

Generations of Houstonian teachers, counselors, and parents have either worked in the aerospace industry or know people who do - the prospect of entering the field was normalized for boys in 1969 when the Apollo 11 missi Read more…

Apple Buys DarwinAI Deepening its AI Push According to Report

March 14, 2024

Apple has purchased Canadian AI startup DarwinAI according to a Bloomberg report today. Apparently the deal was done early this year but still hasn’t been publicly announced according to the report. Apple is preparing Read more…

Survey of Rapid Training Methods for Neural Networks

March 14, 2024

Artificial neural networks are computing systems with interconnected layers that process and learn from data. During training, neural networks utilize optimization algorithms to iteratively refine their parameters until Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, code-named Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

Nvidia Showcases Quantum Cloud, Expanding Quantum Portfolio at GTC24

March 18, 2024

Nvidia’s barrage of quantum news at GTC24 this week includes new products, signature collaborations, and a new Nvidia Quantum Cloud for quantum developers. Wh Read more…

Houston We Have a Solution: Addressing the HPC and Tech Talent Gap

March 15, 2024

Generations of Houstonian teachers, counselors, and parents have either worked in the aerospace industry or know people who do - the prospect of entering the fi Read more…

Survey of Rapid Training Methods for Neural Networks

March 14, 2024

Artificial neural networks are computing systems with interconnected layers that process and learn from data. During training, neural networks utilize optimizat Read more…

PASQAL Issues Roadmap to 10,000 Qubits in 2026 and Fault Tolerance in 2028

March 13, 2024

Paris-based PASQAL, a developer of neutral atom-based quantum computers, yesterday issued a roadmap for delivering systems with 10,000 physical qubits in 2026 a Read more…

India Is an AI Powerhouse Waiting to Happen, but Challenges Await

March 12, 2024

The Indian government is pushing full speed ahead to make the country an attractive technology base, especially in the hot fields of AI and semiconductors, but Read more…

Charles Tahan Exits National Quantum Coordination Office

March 12, 2024

(March 1, 2024) My first official day at the White House Office of Science and Technology Policy (OSTP) was June 15, 2020, during the depths of the COVID-19 loc Read more…

AI Bias In the Spotlight On International Women’s Day

March 11, 2024

What impact does AI bias have on women and girls? What can people do to increase female participation in the AI field? These are some of the questions the tech Read more…

Alibaba Shuts Down its Quantum Computing Effort

November 30, 2023

In case you missed it, China’s e-commerce giant Alibaba has shut down its quantum computing research effort. It’s not entirely clear what drove the change. Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Analyst Panel Says Take the Quantum Computing Plunge Now…

November 27, 2023

Should you start exploring quantum computing? Yes, said a panel of analysts convened at Tabor Communications HPC and AI on Wall Street conference earlier this y Read more…

Shutterstock 1285747942

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

DoD Takes a Long View of Quantum Computing

December 19, 2023

Given the large sums tied to expensive weapon systems – think $100-million-plus per F-35 fighter – it’s easy to forget the U.S. Department of Defense is a Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Intel’s Server and PC Chip Development Will Blur After 2025

January 15, 2024

Intel's dealing with much more than chip rivals breathing down its neck; it is simultaneously integrating a bevy of new technologies such as chiplets, artificia Read more…

Baidu Exits Quantum, Closely Following Alibaba’s Earlier Move

January 5, 2024

Reuters reported this week that Baidu, China’s giant e-commerce and services provider, is exiting the quantum computing development arena. Reuters reported � Read more…

Leading Solution Providers

Contributors

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Training of 1-Trillion Parameter Scientific AI Begins

November 13, 2023

A US national lab has started training a massive AI brain that could ultimately become the must-have computing resource for scientific researchers. Argonne N Read more…

Shutterstock 1179408610

Google Addresses the Mysteries of Its Hypercomputer 

December 28, 2023

When Google launched its Hypercomputer earlier this month (December 2023), the first reaction was, "Say what?" It turns out that the Hypercomputer is Google's t Read more…

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

AMD MI3000A

How AMD May Get Across the CUDA Moat

October 5, 2023

When discussing GenAI, the term "GPU" almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word "GPU" is assumed to mean "Nvidia" products. (As an aside, the popular Nvidia hardware used in GenAI are not technically... Read more…

Shutterstock 1606064203

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

January 25, 2024

In under two minutes, Meta's CEO, Mark Zuckerberg, laid out the company's AI plans, which included a plan to build an artificial intelligence system with the eq Read more…

Google Introduces ‘Hypercomputer’ to Its AI Infrastructure

December 11, 2023

Google ran out of monikers to describe its new AI system released on December 7. Supercomputer perhaps wasn't an apt description, so it settled on Hypercomputer Read more…

China Is All In on a RISC-V Future

January 8, 2024

The state of RISC-V in China was discussed in a recent report released by the Jamestown Foundation, a Washington, D.C.-based think tank. The report, entitled "E Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire