Imagine a Beowulf Cluster of SuperNODEs …
(They did)

By Doug Eadline

December 6, 2023

Clustering resources for faster performance is not new. In the early days of clustering, the Beowulf project demonstrated that high performance was achievable from commodity hardware. These days, the “Beowulf cluster meme” gets used every time some new technology is deployed. For instance, “Imagine a Beowulf cluster of Frontier systems.” Funny enough, but a little closer to reality regarding the recent announcement by GigaIO and TensorWave.

GigaIO introduced the first 32 GPU single-node supercomputer, the SuperNODE™, in June of this year. The SuperNode won two coveted HPCwire Editors’ Choice Awards at Supercomputing 2023 in Denver last month: Best AI Product or Technology and Top 5 New Products or Technologies to Watch. HPCwire has reported on the performance of the 32 GPU GigaiO superNODE and the 64-GPU SuperDuperNODE. Now, it seems the “imagine a Beowulf cluster of these” has been taken to heart by GigaIO and TensorWave.

Today, GigaIO announced the most significant order yet for its flagship SuperNODE™, which will eventually utilize tens of thousands of the AMD Instinct MI300X accelerators that are also launching today at the AMD “Advancing AI” event. GigaIO’s novel infrastructure will form the backbone of a bare-metal specialized AI cloud code named “TensorNODE,” to be built by cloud provider TensorWave for supplying access to AMD data center GPUs, especially for use in LLMs.

As stated in an interview with GigaIO’s Chief Technical Officer, Global Sales, Matt Demas, “We took our SuperNODE and created a large cluster for TensorWave.” Each SuperNODE has two servers for redundancy and has access to all GPU memory across the TensorNODE. There is also a large amount of scratch disk available on each TensorNODE.

The TensorNODE deployment will build upon the GigaIO SuperNODE architecture to a far grander scale, leveraging GigaIO’s PCIe Gen-5 memory fabric to provide a more straightforward workload setup and deployment than is possible with legacy networks and eliminating the associated performance tax.

TensorWave will use GigaIO’s FabreX to create the first petabyte-scale GPU memory pool without the performance impact of non-memory-centric networks. The first installment of TensorNODE is expected to be operational starting in early 2024 with an architecture that will support up to 5,760 GPUs across a single FabreX memory fabric domain. Extremely large models will be possible because all GPUs will have access to all other GPUs VRAM within the domain. Workloads can access more than a petabyte of VRAM in a single job from any node, enabling even the largest jobs to be completed in record time. Throughout 2024, multiple TensorNODEs will be deployed.

TensorNODE is an all-AMD solution featuring 4th Gen AMD CPUs and MI300X accelerators. The expected performance of the TensorNODE is made possible by the MI300X, which delivers 192GB of HBM3 memory per accelerator. The memory capacity of these accelerators, combined with GigaIO’s memory fabric, which allows for near-perfect scaling with little performance degradation, solves the challenge of underutilized or idle GPU cores due to distributed memory models.

SuperNode™ ResNet 50 Scaling (Source GigaIO)

TensorWave is excited to bring this innovative solution to market with GigaIO and AMD,” said Darrick Horton, CEO of TensorWave. “We selected the GigaIO platform because of its superior capabilities, in addition to GigaIO’s alignment with our values and commitment to open standards. We’re leveraging this novel infrastructure to support large-scale AI workloads, and we are proud to be collaborating with AMD as one of the first cloud providers to deploy the MI300X accelerator solutions.”

The composable nature of GigaIO’s dynamic infrastructure provides TensorWave with unique flexibility and agility over standard static infrastructure; as LLMs and AI users need to evolve, the infrastructure can be tuned on the fly to meet current and future needs. Additionally, TensorWave’s cloud will be greener than alternatives by eliminating GPU server hosts (often 4-8 GPUs per server) and associated networking equipment, saving cost, complexity, space, water, and power.

GigaIO SuperNODE™ offers a shared PCIe fabric that eliminates networked GPU server hosts (Source: GigaIO)

“We are thrilled to power TensorWave’s infrastructure at scale by combining the power of the revolutionary AMD Instinct MI300X accelerators with GigaIO’s AI infrastructure architecture, including our unique memory fabric, FabreX. This deployment validates our pioneering approach to reimagining data center infrastructure,” said Alan Benjamin, CEO of GigaIO. “The TensorWave team brings a visionary approach to cloud computing and a deep expertise in standing up and deploying very sophisticated accelerated data centers.”

Given the appetite for memory by GenAI models, the significant memory size and bandwidth offered by GigaIO and AMD should make the TensorWave TensorNode attractive to many customers who are building and offering AI solutions in the cloud.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

A Big Memory Nvidia GH200 Next to Your Desk: Closer Than You Think

February 22, 2024

Students of the microprocessor may recall that the original 8086/8088 processors did not have floating point units. The motherboard often had an extra socket for an optional 8087 math coprocessor. The math coprocessor ma Read more…

IonQ Reports Advance on Path to Networked Quantum Computing

February 22, 2024

IonQ reported reaching a milestone in its efforts to use entangled photon-ion connectivity to scale its quantum computers. IonQ’s quantum computers are based on trapped ions which feature long coherence times and qubit Read more…

Apple Rolls out Post Quantum Security for iOS

February 21, 2024

Think implementing so-called Post Quantum Cryptography (PQC) isn't important because quantum computers able to decrypt current RSA codes don’t yet exist? Not Apple. Today the consumer electronics giant started rolling Read more…

GenAI Having Major Impact on Data Culture, Survey Says

February 21, 2024

While 2023 was the year of GenAI, the adoption rates for GenAI did not match expectations. Most organizations are continuing to invest in GenAI but are yet to derive any substantial value from it. However, the GenAI hyp Read more…

QED-C Issues New Quantum Benchmarking Paper

February 20, 2024

The Quantum Economic Development Consortium last week released a new paper on benchmarking – Quantum Algorithm Exploration using Application-Oriented Performance Benchmarks – that builds on earlier work and is an eff Read more…

AWS Solution Channel

Shutterstock 2283618597

Deep-dive into Ansys Fluent performance on Ansys Gateway powered by AWS

Today, we’re going to deep-dive into the performance and associated cost of running computational fluid dynamics (CFD) simulations on AWS using Ansys Fluent through the Ansys Gateway powered by AWS (or just “Ansys Gateway” for the rest of this post). Read more…

Atom Computing Reports Advance in Scaling Up Neutral Atom Qubit Arrays

February 15, 2024

The scale-up challenge facing quantum computing (QC) is daunting and varied. It’s commonly held that 1 million qubits (or more) will be needed to deliver practical fault tolerant QC. It’s also a varied challenge beca Read more…

A Big Memory Nvidia GH200 Next to Your Desk: Closer Than You Think

February 22, 2024

Students of the microprocessor may recall that the original 8086/8088 processors did not have floating point units. The motherboard often had an extra socket fo Read more…

Apple Rolls out Post Quantum Security for iOS

February 21, 2024

Think implementing so-called Post Quantum Cryptography (PQC) isn't important because quantum computers able to decrypt current RSA codes don’t yet exist? Not Read more…

QED-C Issues New Quantum Benchmarking Paper

February 20, 2024

The Quantum Economic Development Consortium last week released a new paper on benchmarking – Quantum Algorithm Exploration using Application-Oriented Performa Read more…

The Pulse of HPC: Tracking 4.5 Million Heartbeats of 3D Coronary Flow

February 15, 2024

Working in Duke University's Randles Lab, Cyrus Tanade, a National Science Foundation graduate student fellow and Ph.D. candidate in biomedical engineering, is Read more…

It Doesn’t Get Much SWEETER: The Winter HPC Computing Festival in Corpus Christi

February 14, 2024

(Main Photo by Visit Corpus Christi CrowdRiff) Texas A&M University's High-Performance Research Computing (HPRC) team hosted the "SWEETER Winter Comput Read more…

Q-Roundup: Diraq’s War Chest, DARPA’s Bet on Topological Qubits, Citi/Classiq Explore Optimization, WEF’s Quantum Blueprint

February 13, 2024

Yesterday, Australian start-up Diraq added $15 million to its war chest (now $120 million) to build a fault tolerant computer based on quantum dots. Last week D Read more…

2024 Winter Classic: Razor Thin Margins in HPL/HPCG

February 12, 2024

The first task for the 11 teams in the 2024 Winter Classic student cluster competition was to run and optimize the LINPACK and HPCG benchmarks. As usual, the Read more…

2024 Winter Classic: We’re Back!

February 9, 2024

The fourth edition of the Winter Classic Invitational Student Cluster Competition is up and running. This year, we have 11 teams of eager students representin Read more…

CORNELL I-WAY DEMONSTRATION PITS PARASITE AGAINST VICTIM

October 6, 1995

Ithaca, NY --Visitors to this year's Supercomputing '95 (SC'95) conference will witness a life-and-death struggle between parasite and victim, using virtual Read more…

SGI POWERS VIRTUAL OPERATING ROOM USED IN SURGEON TRAINING

October 6, 1995

Surgery simulations to date have largely been created through the development of dedicated applications requiring considerable programming and computer graphi Read more…

U.S. Will Relax Export Restrictions on Supercomputers

October 6, 1995

New York, NY -- U.S. President Bill Clinton has announced that he will definitely relax restrictions on exports of high-performance computers, giving a boost Read more…

Dutch HPC Center Will Have 20 GFlop, 76-Node SP2 Online by 1996

October 6, 1995

Amsterdam, the Netherlands -- SARA, (Stichting Academisch Rekencentrum Amsterdam), Academic Computing Services of Amsterdam recently announced that it has pur Read more…

Cray Delivers J916 Compact Supercomputer to Solvay Chemical

October 6, 1995

Eagan, Minn. -- Cray Research Inc. has delivered a Cray J916 low-cost compact supercomputer and Cray's UniChem client/server computational chemistry software Read more…

NEC Laboratory Reviews First Year of Cooperative Projects

October 6, 1995

Sankt Augustin, Germany -- NEC C&C (Computers and Communication) Research Laboratory at the GMD Technopark has wrapped up its first year of operation. Read more…

Sun and Sybase Say SQL Server 11 Benchmarks at 4544.60 tpmC

October 6, 1995

Mountain View, Calif. -- Sun Microsystems, Inc. and Sybase, Inc. recently announced the first benchmark results for SQL Server 11. The result represents a n Read more…

New Study Says Parallel Processing Market Will Reach $14B in 1999

October 6, 1995

Mountain View, Calif. -- A study by the Palo Alto Management Group (PAMG) indicates the market for parallel processing systems will increase at more than 4 Read more…

Leading Solution Providers

Contributors

CORNELL I-WAY DEMONSTRATION PITS PARASITE AGAINST VICTIM

October 6, 1995

Ithaca, NY --Visitors to this year's Supercomputing '95 (SC'95) conference will witness a life-and-death struggle between parasite and victim, using virtual Read more…

SGI POWERS VIRTUAL OPERATING ROOM USED IN SURGEON TRAINING

October 6, 1995

Surgery simulations to date have largely been created through the development of dedicated applications requiring considerable programming and computer graphi Read more…

U.S. Will Relax Export Restrictions on Supercomputers

October 6, 1995

New York, NY -- U.S. President Bill Clinton has announced that he will definitely relax restrictions on exports of high-performance computers, giving a boost Read more…

Dutch HPC Center Will Have 20 GFlop, 76-Node SP2 Online by 1996

October 6, 1995

Amsterdam, the Netherlands -- SARA, (Stichting Academisch Rekencentrum Amsterdam), Academic Computing Services of Amsterdam recently announced that it has pur Read more…

Cray Delivers J916 Compact Supercomputer to Solvay Chemical

October 6, 1995

Eagan, Minn. -- Cray Research Inc. has delivered a Cray J916 low-cost compact supercomputer and Cray's UniChem client/server computational chemistry software Read more…

NEC Laboratory Reviews First Year of Cooperative Projects

October 6, 1995

Sankt Augustin, Germany -- NEC C&C (Computers and Communication) Research Laboratory at the GMD Technopark has wrapped up its first year of operation. Read more…

Sun and Sybase Say SQL Server 11 Benchmarks at 4544.60 tpmC

October 6, 1995

Mountain View, Calif. -- Sun Microsystems, Inc. and Sybase, Inc. recently announced the first benchmark results for SQL Server 11. The result represents a n Read more…

New Study Says Parallel Processing Market Will Reach $14B in 1999

October 6, 1995

Mountain View, Calif. -- A study by the Palo Alto Management Group (PAMG) indicates the market for parallel processing systems will increase at more than 4 Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire