Blueprints to Shake up Next-generation Server Designs Emerge at OCP Summit

By Agam Shah

October 24, 2022

A new generation of designs ready to shake up conventional server architecture emerged at the recent Open Compute Project Summit, where Google, Facebook and Microsoft showed off new blueprints for high-performance computers.

The hardware demonstrated at the trade show, which was held in Santa Clara, California, showed that cloud providers were continuing to deprioritize CPUs, while putting more focus on networking, storage and accelerators like GPUs and AI chips. The OCP designs can be replicated and improved upon by server makers.

Grand Teton. Source: Meta.

The headliner was Meta’s server design called Grand Teton, which the company is deploying for datacenters to run artificial intelligence applications. Meta’s goal is to bring to its mega datacenters more AI capacity, which underpins many functions on its social media platforms, but also prepare for its metaverse future, said Alexis Bjorlin, vice president of engineering at Meta, in a blog entry.

OCP includes the who’s who of the server world – Meta, Google and others – with all the cool new hardware showing up here before it comes to standard racks from Dell, HPE and Lenovo, said Dylan Patel, said founder of SemiAnalysis, a semiconductor research and consulting firm, who attended the show.

“When we’re talking about that hardware it was a lot of much higher power but also efficient. It might be high power because it is used in Facebook’s AI or it might be high power because it is a server that is packed very dense,” Patel said.

Patel also noted that many next-generation servers were also shown with Intel next-generation Xeon server CPU codenamed Sapphire Rapids and AMD’s upcoming Genoa.

Bjorlin last month said that Meta plans to build mega clusters with over 4,000 accelerators by 2025. The cores, which will be organized as a mesh, will have a bandwidth of 1 terabyte per second among accelerators. Bjorlin detailed those plans during a speech at the AI Hardware Summit last month, but did not share hardware details. The company uses Nvidia GPUs extensively.

Meta’s fundamental approach to server designs includes stripping out unnecessary components, and shrinking hardware at the system and chip levels. The shrinkage of system and chip size would contribute to the creation of AI training clusters that would draw more power, but also deliver significantly more performance per watt.

The deep-learning models are growing significantly to tens of trillions of parameters, and “can require a zettaflop of compute to train,” Bjorlin said in the Grand Teton announcement.

“AI and machine learning models are becoming increasingly powerful and sophisticated and need more high-performance infrastructure to match,” Bjorlin said.

Grand Teton is the successor to the Zion-EX scale-out system introduced in 2021. Grand Teton is significantly faster than its predecessor with four times more host-to-GPU bandwidth, and two times the compute capacity and throughput.

“Grand Teton also has an integrated chassis in contrast to Zion-EX, which comprises multiple independent subsystems,” Bjorlin said.

Microsoft showed off a modular system called Mt. Shasta, which is a chassis that can house accelerators for AI and high-performance computing. The module fits into high-performance servers through a 48-volt power feed. The module can be hotswapped and accommodate multiple accelerators. The system was designed with Molex and Quanta, and is compatible with OCP’s Open Rack V3 design, which opens up rack-level disaggregation for systems.

Microsoft Mt. Shasta modular architecture. Source: Microsoft.

The Mt. Shasta module solves common problems faced in implementing accelerators in datacenters, Microsoft said in a blog post. The accelerators can be implemented easily within a datacenter’s power, cooling and connectivity guidelines, and automates interfacing with software-based management for hardware control. A node-level hook makes the module hot-swappable, which can also be difficult within the PCI Gen 3.0 interface, which is old but still used on older servers.

Diversifying server hardware for accelerators has always been a priority, but there was a lot of excitement this year around CXL (Compute Express Link), which provides the hooks to easily add a range of accelerators, said Nathan Brookwood, principal analyst at Insight 64, who visited the show.

“Clearly, the guys who are deploying in the cloud – you’re looking at Google, Microsoft, and such – those guys know what they need. They would probably be the ones who would strip out more bells and whistles that HPE and Dell put into general purpose, enterprise-class products,” Brookwood said.

CXL is a critical building block that is set to change the way servers are designed, customized and configured. CXL allows for easier selection and assembly of building blocks of servers. The technology provides a communication link between computing, memory and storage systems, and includes tools to provision and manage computing across the server.

“CXL is moving rapidly toward acceptance, which is surprising, because the general-purpose processors that support it haven’t yet been released, including [Intel’s] Sapphire Rapids, and [AMD’s] Genoa,” Brookwood said.

While Facebook’s Grand Teton was an integrated server, Google focused on a “multi-brained” server of the future, which consolidates storage, accelerators, memory and infrastructure processing units into separate trays. The modular hardware architecture is based on interconnects that include CXL and NVMe and distributed system management tools such as OpenBMC and RedFish.

The excitement around CXL was equally shared by smaller server makers, Brookwood said.

“As those come out, I think smaller server makers, especially in the cloud, are going to be looking at that,” Brookwood said.

Source: Wiwynn

IT infrastructure company Wiwynn, a subsidiary of the Taiwan-based Wistron Group, is focusing on the building blocks to tailor server designs. The company previously specialized in integrated server designs for OCP, but this year’s focus was on custom designs built to specific requirements.

Wiwynn’s building blocks include OCP-certified cooling, power, components, interconnects, NICs and security modules. The CXL interconnect was also in the design, sitting in the middle to facilitate communications between storage, memory and processing units.

The design is for a wide range of x86 server chips from Intel and AMD, and Arm server chips like Ampere’s CPUs. It also supports accelerators like Habana Gaudi AI processors from Intel.

The change in focus to building blocks came from clients as they are interested in building servers closer to their datacenter requirements, said Steven Hwang, executive director, for sales enablement at Wiwynn, during a press briefing ahead of the OCP Summit.

Specifically, there was a lot of interest in power conversion components, Hwang said, adding, “a lot of datacenters are going green and the energy becomes very, very sensitive… so the power loss from DC to AC and AC back to DC is certainly something that people can benefit [from] immediately.”

Source: Wiwynn

At OCP, Google, Microsoft, Nvidia and AMD also partnered to create a specification, called Caliptra, that lets system makers embed security layers at the chip and system levels. The specification, which is in version 0.5, focuses on creating a root of trust in silicon.

“As a reusable open source, silicon-level block for integration into systems on a chip – such as CPUs, GPUs, and accelerators – Caliptra provides trustworthy and easily verifiable attestation,” said Mark Russinovich, Microsoft’s Azure CTO, in a blog entry.

The Caliptra spec includes a series of blocks to store and encrypt data, and to make sure only authorized parties get access to data in a secure enclave. It also ensures security of data so it is not exposed to hardware-based hacks like Spectre and Meltdown while on-premise or in the cloud. The cloud providers are interested in Caliptra to improve confidential computing offerings and to secure virtual machines.

 

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

Nvidia Delivering New Options for MLPerf and HPC Performance

September 28, 2023

As HPCwire reported recently, the latest MLperf benchmarks are out. Not unsurprisingly, Nvidia was the leader across many categories. The HGX H100 GPU systems, which contain eight H100 GPUs, delivered the highest throughput on every MLPerf inference test in this round. Read more…

Hakeem Oluseyi Explores His Unlikely Journey from the Street to the Stars in SC23 Keynote

September 28, 2023

Defying the odds In the heart of one of the toughest neighborhoods in the country, young Hakeem Oluseyi’s world was a confined space, but his imagination soared to the stars. While other kids roamed the streets, he Read more…

Nvidia Takes Another Shot at Trying to Get AI to Mobile Devices

September 28, 2023

Nvidia takes another shot at trying to get to mobile devices Long before the current situation of Nvidia's GPUs holding AI hostage, the company tried to put its chips in mobile devices but failed. The Tegra mobile chi Read more…

IonQ Announces 2 New Quantum Systems; Suggests Quantum Advantage is Nearing

September 27, 2023

It’s been a busy week for IonQ, the quantum computing start-up focused on developing trapped-ion-based systems. At the Quantum World Congress today, the company announced two new systems (Forte Enterprise and Tempo) in Read more…

Rethinking ‘Open’ for AI

September 27, 2023

What does “open” mean in the context of AI? Must we accept hidden layers? Do copyrights and patents still hold sway? And do consumers have the right to opt out of data collection? These are the types of questions tha Read more…

AWS Solution Channel

Shutterstock 1024337068

Introducing a Community Recipe Library for HPC Infrastructure on AWS

We want to make it easier for customers to extend and build on AWS using tools like AWS ParallelCluster, Amazon FSx for Lustre, and some of the hundreds of other AWS services that customers often use to make discoveries from their data or simulations. Read more…

QCT Solution Channel

QCT and Intel Codeveloped QCT DevCloud Program to Jumpstart HPC and AI Development

Organizations and developers face a variety of issues in developing and testing HPC and AI applications. Challenges they face can range from simply having access to a wide variety of hardware, frameworks, and toolkits to time spent on installation, development, testing, and troubleshooting which can lead to increases in cost. Read more…

Leveraging Machine Learning in Dark Matter Research for the Aurora Exascale System 

September 25, 2023

Scientists have unlocked many secrets about particle interactions at atomic and subatomic levels. However, one mystery that has eluded researchers is dark matter. Current supercomputers don’t have the capability to run Read more…

Nvidia Delivering New Options for MLPerf and HPC Performance

September 28, 2023

As HPCwire reported recently, the latest MLperf benchmarks are out. Not unsurprisingly, Nvidia was the leader across many categories. The HGX H100 GPU systems, which contain eight H100 GPUs, delivered the highest throughput on every MLPerf inference test in this round. Read more…

IonQ Announces 2 New Quantum Systems; Suggests Quantum Advantage is Nearing

September 27, 2023

It’s been a busy week for IonQ, the quantum computing start-up focused on developing trapped-ion-based systems. At the Quantum World Congress today, the compa Read more…

Rethinking ‘Open’ for AI

September 27, 2023

What does “open” mean in the context of AI? Must we accept hidden layers? Do copyrights and patents still hold sway? And do consumers have the right to opt Read more…

Aurora Image

Leveraging Machine Learning in Dark Matter Research for the Aurora Exascale System 

September 25, 2023

Scientists have unlocked many secrets about particle interactions at atomic and subatomic levels. However, one mystery that has eluded researchers is dark matte Read more…

Watsonx Brings AI Visibility to Banking Systems

September 21, 2023

A new set of AI-based code conversion tools is available with IBM watsonx. Before introducing the new "watsonx," let's talk about the previous generation Watson Read more…

Intel’s Gelsinger Lays Out Vision and Map at Innovation 2023 Conference

September 20, 2023

Intel’s sprawling, optimistic vision for the future was on full display yesterday in CEO Pat Gelsinger’s opening keynote at the Intel Innovation 2023 confer Read more…

Intel Showcases ‘AI Everywhere’ Strategy in MLPerf Inferencing v3.1

September 18, 2023

Intel used the latest MLPerf Inference (version 3.1) results as a platform to reinforce its developing “AI Everywhere” vision, which rests upon 4th gen Xeon Read more…

China’s Quiet Journey into Exascale Computing

September 17, 2023

As reported in the South China Morning Post HPC pioneer Jack Dongarra mentioned the lack of benchmarks from recent HPC systems built by China. “It’s a we Read more…

CORNELL I-WAY DEMONSTRATION PITS PARASITE AGAINST VICTIM

October 6, 1995

Ithaca, NY --Visitors to this year's Supercomputing '95 (SC'95) conference will witness a life-and-death struggle between parasite and victim, using virtual Read more…

SGI POWERS VIRTUAL OPERATING ROOM USED IN SURGEON TRAINING

October 6, 1995

Surgery simulations to date have largely been created through the development of dedicated applications requiring considerable programming and computer graphi Read more…

U.S. Will Relax Export Restrictions on Supercomputers

October 6, 1995

New York, NY -- U.S. President Bill Clinton has announced that he will definitely relax restrictions on exports of high-performance computers, giving a boost Read more…

Dutch HPC Center Will Have 20 GFlop, 76-Node SP2 Online by 1996

October 6, 1995

Amsterdam, the Netherlands -- SARA, (Stichting Academisch Rekencentrum Amsterdam), Academic Computing Services of Amsterdam recently announced that it has pur Read more…

Cray Delivers J916 Compact Supercomputer to Solvay Chemical

October 6, 1995

Eagan, Minn. -- Cray Research Inc. has delivered a Cray J916 low-cost compact supercomputer and Cray's UniChem client/server computational chemistry software Read more…

NEC Laboratory Reviews First Year of Cooperative Projects

October 6, 1995

Sankt Augustin, Germany -- NEC C&C (Computers and Communication) Research Laboratory at the GMD Technopark has wrapped up its first year of operation. Read more…

Sun and Sybase Say SQL Server 11 Benchmarks at 4544.60 tpmC

October 6, 1995

Mountain View, Calif. -- Sun Microsystems, Inc. and Sybase, Inc. recently announced the first benchmark results for SQL Server 11. The result represents a n Read more…

New Study Says Parallel Processing Market Will Reach $14B in 1999

October 6, 1995

Mountain View, Calif. -- A study by the Palo Alto Management Group (PAMG) indicates the market for parallel processing systems will increase at more than 4 Read more…

Leading Solution Providers

Contributors

CORNELL I-WAY DEMONSTRATION PITS PARASITE AGAINST VICTIM

October 6, 1995

Ithaca, NY --Visitors to this year's Supercomputing '95 (SC'95) conference will witness a life-and-death struggle between parasite and victim, using virtual Read more…

SGI POWERS VIRTUAL OPERATING ROOM USED IN SURGEON TRAINING

October 6, 1995

Surgery simulations to date have largely been created through the development of dedicated applications requiring considerable programming and computer graphi Read more…

U.S. Will Relax Export Restrictions on Supercomputers

October 6, 1995

New York, NY -- U.S. President Bill Clinton has announced that he will definitely relax restrictions on exports of high-performance computers, giving a boost Read more…

Dutch HPC Center Will Have 20 GFlop, 76-Node SP2 Online by 1996

October 6, 1995

Amsterdam, the Netherlands -- SARA, (Stichting Academisch Rekencentrum Amsterdam), Academic Computing Services of Amsterdam recently announced that it has pur Read more…

Cray Delivers J916 Compact Supercomputer to Solvay Chemical

October 6, 1995

Eagan, Minn. -- Cray Research Inc. has delivered a Cray J916 low-cost compact supercomputer and Cray's UniChem client/server computational chemistry software Read more…

NEC Laboratory Reviews First Year of Cooperative Projects

October 6, 1995

Sankt Augustin, Germany -- NEC C&C (Computers and Communication) Research Laboratory at the GMD Technopark has wrapped up its first year of operation. Read more…

Sun and Sybase Say SQL Server 11 Benchmarks at 4544.60 tpmC

October 6, 1995

Mountain View, Calif. -- Sun Microsystems, Inc. and Sybase, Inc. recently announced the first benchmark results for SQL Server 11. The result represents a n Read more…

New Study Says Parallel Processing Market Will Reach $14B in 1999

October 6, 1995

Mountain View, Calif. -- A study by the Palo Alto Management Group (PAMG) indicates the market for parallel processing systems will increase at more than 4 Read more…

ISC 2023 Booth Videos

Cornelis Networks @ ISC23
Dell Technologies @ ISC23
Intel @ ISC23
Lenovo @ ISC23
Microsoft @ ISC23
ISC23 Playlist
  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire