Steve Scott Lays Out HPE-Cray Blended Product Roadmap

By Tiffany Trader

March 11, 2020

Last week, the day before the El Capitan processor disclosures were made at HPE’s new headquarters in San Jose, Steve Scott (CTO for HPC & AI at HPE, and former Cray CTO) was on-hand at the Rice Oil & Gas HPC conference in Houston. He was there to discuss the HPE-Cray transition and blended roadmap, as well as his favorite topic, Cray’s eighth-gen networking technology, Slingshot.

HPE announced its intention to acquire Cray last May (2019) for $1.3 billion. At the time, Cray’s Shasta architecture had been selected for two U.S. exascale contracts — Aurora at Argonne and Frontier at Oak Ridge — and it would soon secure the third and final outstanding Department of Energy exascale contract, El Capitan at Livermore. (Note, as we reported last August, the DOE is not seeking a second exascale system for Argonne under the CORAL-2 procurement project.)

With excitement building around its exascale wins, Cray’s profile has only grown stronger as the company was brought into HPE, bucking the trend that we see in many M&As. As Scott said at the Rice event, “Cray went away as an entity, but the Cray systems and Cray brand will definitely persist.”

The companies were fully merged by Jan. 1, 2020.

The way Scott tells it, the transition went smoothly. “The transaction closed in September of 2019, and it literally took one month for organizations to be fully combined. We’re not running as kind of a little subsidiary off the side,” he said.

Pete Ungaro (SVP & GM, HPC & AI at HPE, formerly Cray CEO) runs the combined HPC and AI organization inside HPE. The group has also pulled in additional HPE units: Mission Critical (includes Superdome Flex), the Edge line, and the Moonshot division all report to Ungaro. “Within one month, we had a combined blended leadership team, and we have one organization working as one team,” said Scott.

Consolidation of the product roadmaps was completed inside of two months, in time for SC19, according to Scott. “Within the first month and a half, we had pretty much fully blended our storage and compute roadmaps. It was relatively easy to do because they were somewhat complementary. There were some things that we were doing in both companies, and we chose one,” he said.

Liquid-cooled infrastructure was one of the overlapping technologies, and that choice fell to Cray’s design. Some of the air-cooled commodity cabinets that Cray was developing will be replaced by the HPE Apollo. The storage roadmaps were merged together under ClusterStor.

The HPE Apollo line, comprised of standard 19″ rack systems, has been brought into the Shasta fold, along with the Cray-developed dense, scale-optimized cabinets, now known as Olympus.

“[Pre-Shasta] from a hardware perspective on the XC systems, you had one liquid-cooled cabinet and then if you wanted to have I/O nodes, you had to have them designed into this customized cabinet,” said Scott. “And if you wanted other stuff that didn’t fit in the customized cabinet, you basically had to buy a separate system and hook it up to the side through the I/O subsystem and share it. That all goes away with Shasta. There’s two physical infrastructures: the Apollo infrastructure coming from HPE, and the Olympus infrastructure, which is the dense-liquid cooled infrastructure. You can kind of get anything under the sun to put into the Apollo infrastructure. But then we do a smaller number of custom blades, high performance blades, that are really intended to optimize for the key computational technologies that go into the Olympus infrastructure and that has very high density and direct warm water cooling.”

Of the combined Shasta lineup, Scott said, “It’s the same interconnect that spans them and it’s the same software environment. It’s literally just a physical infrastructure choice. It makes no difference in terms of the look and feel of the system, in terms of the performance of the software and the user experience–it’s literally just a packaging choice.”

In a follow-up email exchange, Scott clarified that HPE still offers the CS line, the commodity Cray line with Appro roots. Recall that the Fujitsu-Cray A64FX boards are supported by the CS500 architecture, a decision that was driven by time-to-market. Long term, HPE expects the CS line to be supplanted by the Apollo line, which will also provide the expanded physical infrastructure for the Shasta supercomputers. Further, HPE is still selling Cray XC systems even as it introduces the new Shasta systems.

With the Olympus architecture, Scott notes every cabinet has a stack of four chassis on the left and another four chassis on the right. Each chassis has eight compute blades that plug into the front, up to 64 compute blades into a single cabinet. And then from the back, you can plug in from one to eight network cards per chassis, which offers configurability in terms of that amount of network bandwidth.

There is direct liquid cooling to all the components, including the optical transceivers for the active optical cables, the memories, and the processors, etc. One cabinet supports up to 512 high wattage GPUs plus 128 CPUs and up to 64 network switches. Currently, Olympus cabinets are shipping with up to 250 kilowatts, and soon that will go to 300 kilowatts, and then 400 kilowatts in a single cabinet, said Scott.

In discussing the reasons for moving to Shasta, Scott emphasized flexibility and upgradability. “The XC system that we’ve been shipping for a number of years is great in lots of ways but it’s not flexible; you can have any number of nodes you want for your blade as long as it’s four; you can have any amount of network bandwidth per node as long as it’s one PCI Gen3 x16 and the node can be any size you want, as long as it’s that, right? It’s a very inflexible design. And we hit limits there especially as we started looking at hotter and hotter processors, and we just couldn’t accommodate in that system design. So Shasta is designed to have a wide diversity of processors, all shapes and sizes, and particularly, of increasingly higher and higher wattages.”

Cray’s first wins with Shasta reflect this diversity at the node. There’s Aurora at Argonne (two Intel Xeon CPUs and six of the Xe GPUs, connected by Compute eXpress Link [CXL]); Perlmutter at NERSC (two AMD Epyc Milan CPUs and four Nvidia Volta-Next GPUs connected by PCIe Gen4); Frontier at Oak Ridge (one custom AMD Epyc CPU plus four Radeon GPUs connected by an enhanced Infinity fabric) and El Capitan at Livermore (AMD’s ‘Genoa’ Zen4 Epyc CPU plus Radeon Instinct GPUs in a one-to-four ratio, connected by AMD’s 3rd Gen AMD Infinity fabric).

While all of these are heterogeneous systems targeting “big flops,” Scott also discussed the CPU-based compute blades going into NERSC Perlmutter, encompassing four dual-socket AMD Epyc “Milan” nodes.

Cray also supports Marvell ThunderX2 Arm processors in its XC50 architecture, as in the recent refresh win (Isambard2) for GW4/UK Met, and supports the Fujitsu A64FX chips in its CS500 system, as already mentioned.

Scott, who has led the design of several generations of supercomputing interconnects, made sure to save time in his presentation to discuss Cray’s new Slingshot network, a pillar of the Shasta architecture and the interconnect on the three DOE exascale systems that are underway. Scott highlighted Slingshot’s pivot to Ethernet, its feeds and speeds, quality of service and congestion control features.

“We have decided to stop building proprietary networks and instead adopt Ethernet. The world is going to Ethernet, so we decided to stop fighting them and instead join them, but instead of just using a commodity Ethernet, we are redesigning Ethernet and bringing HPC to Ethernet. So we have standard Ethernet connectivity at the edges; we can talk to standard NICs; you can talk to other datacenter switches, etc. But inside it’s…a state of the art HPC fabric,” said Scott.

“Our Rosetta switch is 64 ports times 200 gigabits per second,” Scott continued. “This allows you to build really big systems–like those exascale systems–with a network diameter of just three switch to switch hops. The diameter is three whether it’s two cabinets or 20 cabinets or 200 cabinets, which is the neat thing about this Dragonfly topology and having high-radix switches…. It also turns out that we can build a system the size of Frontier, a multi-exaflop system where 90 percent of the cables in the system are short, cheap, reliable electrical copper cables, and only 10 percent of them have to be optics.”

“[With Shasta], the way that the network blades and the compute blades are put in the system also gives you the flexibility to put a second generation network, and a third generation network. We will absolutely be going to optics everywhere within a couple of generations without having to worry about redesigning the electrical backplane and that sort of thing. So it gives us a lot of flexibility on the interconnect side as well,” he said.

Watch Scott’s full presentation from the 2020 Rice Oil and Gas conference below for additional details on the Slingshot network, the E1000 ClusterStor system, and the Cray software environment.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

Why HPC Storage Matters More Now Than Ever: Analyst Q&A

September 17, 2021

With soaring data volumes and insatiable computing driving nearly every facet of economic, social and scientific progress, data storage is seizing the spotlight. Hyperion Research analyst and noted storage expert Mark No Read more…

GigaIO Gets $14.7M in Series B Funding to Expand Its Composable Fabric Technology to Customers

September 16, 2021

Just before the COVID-19 pandemic began in March 2020, GigaIO introduced its Universal Composable Fabric technology, which allows enterprises to bring together any HPC and AI resources and integrate them with networking, Read more…

What’s New in HPC Research: Solar Power, ExaWorks, Optane & More

September 16, 2021

In this regular feature, HPCwire highlights newly published research in the high-performance computing community and related domains. From parallel programming to exascale to quantum computing, the details are here. Read more…

Cerebras Brings Its Wafer-Scale Engine AI System to the Cloud

September 16, 2021

Five months ago, when Cerebras Systems debuted its second-generation wafer-scale silicon system (CS-2), co-founder and CEO Andrew Feldman hinted of the company’s coming cloud plans, and now those plans have come to fruition. Today, Cerebras and Cirrascale Cloud Services are launching... Read more…

AI Hardware Summit: Panel on Memory Looks Forward

September 15, 2021

What will system memory look like in five years? Good question. While Monday's panel, Designing AI Super-Chips at the Speed of Memory, at the AI Hardware Summit, tackled several topics, the panelists also took a brief glimpse into the future. Unlike compute, storage and networking, which... Read more…

AWS Solution Channel

Supporting Climate Model Simulations to Accelerate Climate Science

The Amazon Sustainability Data Initiative (ASDI), AWS is donating cloud resources, technical support, and access to scalable infrastructure and fast networking providing high performance computing (HPC) solutions to support simulations of near-term climate using the National Center for Atmospheric Research (NCAR) Community Earth System Model Version 2 (CESM2) and its Whole Atmosphere Community Climate Model (WACCM). Read more…

ECMWF Opens Bologna Datacenter in Preparation for Atos Supercomputer

September 14, 2021

In January 2020, the European Centre for Medium-Range Weather Forecasts (ECMWF) – a juggernaut in the weather forecasting scene – signed a four-year, $89-million contract with European tech firm Atos to quintuple its supercomputing capacity. With the deal approaching the two-year mark, ECMWF... Read more…

Why HPC Storage Matters More Now Than Ever: Analyst Q&A

September 17, 2021

With soaring data volumes and insatiable computing driving nearly every facet of economic, social and scientific progress, data storage is seizing the spotlight Read more…

Cerebras Brings Its Wafer-Scale Engine AI System to the Cloud

September 16, 2021

Five months ago, when Cerebras Systems debuted its second-generation wafer-scale silicon system (CS-2), co-founder and CEO Andrew Feldman hinted of the company’s coming cloud plans, and now those plans have come to fruition. Today, Cerebras and Cirrascale Cloud Services are launching... Read more…

AI Hardware Summit: Panel on Memory Looks Forward

September 15, 2021

What will system memory look like in five years? Good question. While Monday's panel, Designing AI Super-Chips at the Speed of Memory, at the AI Hardware Summit, tackled several topics, the panelists also took a brief glimpse into the future. Unlike compute, storage and networking, which... Read more…

ECMWF Opens Bologna Datacenter in Preparation for Atos Supercomputer

September 14, 2021

In January 2020, the European Centre for Medium-Range Weather Forecasts (ECMWF) – a juggernaut in the weather forecasting scene – signed a four-year, $89-million contract with European tech firm Atos to quintuple its supercomputing capacity. With the deal approaching the two-year mark, ECMWF... Read more…

Quantum Computer Market Headed to $830M in 2024

September 13, 2021

What is one to make of the quantum computing market? Energized (lots of funding) but still chaotic and advancing in unpredictable ways (e.g. competing qubit tec Read more…

Amazon, NCAR, SilverLining Team for Unprecedented Cloud Climate Simulations

September 10, 2021

Earth’s climate is, to put it mildly, not in a good place. In the wake of a damning report from the Intergovernmental Panel on Climate Change (IPCC), scientis Read more…

After Roadblocks and Renewals, EuroHPC Targets a Bigger, Quantum Future

September 9, 2021

The EuroHPC Joint Undertaking (JU) was formalized in 2018, beginning a new era of European supercomputing that began to bear fruit this year with the launch of several of the first EuroHPC systems. The undertaking, however, has not been without its speed bumps, and the Union faces an uphill... Read more…

How Argonne Is Preparing for Exascale in 2022

September 8, 2021

Additional details came to light on Argonne National Laboratory’s preparation for the 2022 Aurora exascale-class supercomputer, during the HPC User Forum, held virtually this week on account of pandemic. Exascale Computing Project director Doug Kothe reviewed some of the 'early exascale hardware' at Argonne, Oak Ridge and NERSC (Perlmutter), while Ti Leggett, Deputy Project Director & Deputy Director... Read more…

Ahead of ‘Dojo,’ Tesla Reveals Its Massive Precursor Supercomputer

June 22, 2021

In spring 2019, Tesla made cryptic reference to a project called Dojo, a “super-powerful training computer” for video data processing. Then, in summer 2020, Tesla CEO Elon Musk tweeted: “Tesla is developing a [neural network] training computer called Dojo to process truly vast amounts of video data. It’s a beast! … A truly useful exaflop at de facto FP32.” Read more…

Berkeley Lab Debuts Perlmutter, World’s Fastest AI Supercomputer

May 27, 2021

A ribbon-cutting ceremony held virtually at Berkeley Lab's National Energy Research Scientific Computing Center (NERSC) today marked the official launch of Perlmutter – aka NERSC-9 – the GPU-accelerated supercomputer built by HPE in partnership with Nvidia and AMD. Read more…

Esperanto, Silicon in Hand, Champions the Efficiency of Its 1,092-Core RISC-V Chip

August 27, 2021

Esperanto Technologies made waves last December when it announced ET-SoC-1, a new RISC-V-based chip aimed at machine learning that packed nearly 1,100 cores onto a package small enough to fit six times over on a single PCIe card. Now, Esperanto is back, silicon in-hand and taking aim... Read more…

Enter Dojo: Tesla Reveals Design for Modular Supercomputer & D1 Chip

August 20, 2021

Two months ago, Tesla revealed a massive GPU cluster that it said was “roughly the number five supercomputer in the world,” and which was just a precursor to Tesla’s real supercomputing moonshot: the long-rumored, little-detailed Dojo system. “We’ve been scaling our neural network training compute dramatically over the last few years,” said Milan Kovac, Tesla’s director of autopilot engineering. Read more…

CentOS Replacement Rocky Linux Is Now in GA and Under Independent Control

June 21, 2021

The Rocky Enterprise Software Foundation (RESF) is announcing the general availability of Rocky Linux, release 8.4, designed as a drop-in replacement for the soon-to-be discontinued CentOS. The GA release is launching six-and-a-half months after Red Hat deprecated its support for the widely popular, free CentOS server operating system. The Rocky Linux development effort... Read more…

Google Launches TPU v4 AI Chips

May 20, 2021

Google CEO Sundar Pichai spoke for only one minute and 42 seconds about the company’s latest TPU v4 Tensor Processing Units during his keynote at the Google I Read more…

Intel Completes LLVM Adoption; Will End Updates to Classic C/C++ Compilers in Future

August 10, 2021

Intel reported in a blog this week that its adoption of the open source LLVM architecture for Intel’s C/C++ compiler is complete. The transition is part of In Read more…

AMD-Xilinx Deal Gains UK, EU Approvals — China’s Decision Still Pending

July 1, 2021

AMD’s planned acquisition of FPGA maker Xilinx is now in the hands of Chinese regulators after needed antitrust approvals for the $35 billion deal were receiv Read more…

Leading Solution Providers

Contributors

Hot Chips: Here Come the DPUs and IPUs from Arm, Nvidia and Intel

August 25, 2021

The emergence of data processing units (DPU) and infrastructure processing units (IPU) as potentially important pieces in cloud and datacenter architectures was Read more…

Julia Update: Adoption Keeps Climbing; Is It a Python Challenger?

January 13, 2021

The rapid adoption of Julia, the open source, high level programing language with roots at MIT, shows no sign of slowing according to data from Julialang.org. I Read more…

10nm, 7nm, 5nm…. Should the Chip Nanometer Metric Be Replaced?

June 1, 2020

The biggest cool factor in server chips is the nanometer. AMD beating Intel to a CPU built on a 7nm process node* – with 5nm and 3nm on the way – has been i Read more…

HPE Wins $2B GreenLake HPC-as-a-Service Deal with NSA

September 1, 2021

In the heated, oft-contentious, government IT space, HPE has won a massive $2 billion contract to provide HPC and AI services to the United States’ National Security Agency (NSA). Following on the heels of the now-canceled $10 billion JEDI contract (reissued as JWCC) and a $10 billion... Read more…

Quantum Roundup: IBM, Rigetti, Phasecraft, Oxford QC, China, and More

July 13, 2021

IBM yesterday announced a proof for a quantum ML algorithm. A week ago, it unveiled a new topology for its quantum processors. Last Friday, the Technical Univer Read more…

Intel Launches 10nm ‘Ice Lake’ Datacenter CPU with Up to 40 Cores

April 6, 2021

The wait is over. Today Intel officially launched its 10nm datacenter CPU, the third-generation Intel Xeon Scalable processor, codenamed Ice Lake. With up to 40 Read more…

Frontier to Meet 20MW Exascale Power Target Set by DARPA in 2008

July 14, 2021

After more than a decade of planning, the United States’ first exascale computer, Frontier, is set to arrive at Oak Ridge National Laboratory (ORNL) later this year. Crossing this “1,000x” horizon required overcoming four major challenges: power demand, reliability, extreme parallelism and data movement. Read more…

Intel Unveils New Node Names; Sapphire Rapids Is Now an ‘Intel 7’ CPU

July 27, 2021

What's a preeminent chip company to do when its process node technology lags the competition by (roughly) one generation, but outmoded naming conventions make it seem like it's two nodes behind? For Intel, the response was to change how it refers to its nodes with the aim of better reflecting its positioning within the leadership semiconductor manufacturing space. Intel revealed its new node nomenclature, and... Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire