GTC21: Dell Building Cloud Native Supercomputers at U Cambridge and Durham

By John Russell

April 14, 2021

In conjunction with GTC21, Dell Technologies today announced new supercomputers at universities across DiRAC (Distributed Research utilizing Advanced Computing) in the UK with plans to explore use of Nvidia BlueField DPU technology.

The University of Cambridge will expand its Cambridge Service for Data Driven Discovery (CSD3) system featuring:

  • More than 400 PowerEdge C6520 servers with 3rd Gen Intel Xeon Scalable processors;
  • More than 80 PowerEdge XE8525 servers with 3rd Gen AMD Epyc processors and Nvidia A100 GPUs with NVLink;
  • And four petaflops of application performance to advance research across astrophysics, nuclear fusion power generation and clinical medicine applications.

Durham University’s COSMA8 supercomputer, currently in prototype, is set to fully deploy in October 2021 with:

  • More than 90 PowerEdge C6525 servers with 2nd and 3rd Gen AMD Epyc processors;
  • Direct Liquid Cooling and Nvidia Mellanox HDR InfiniBand networking
  • And plans to expand COSMA8 to more than 600 compute nodes over the next year to deliver computational power and efficiency for research into dark energy and black holes.

(DiRAC supports a significant portion of the UK’s STFC’s science program, providing simulation and data modelling resources for the UK Frontier Science theory community in Particle Physics, astroparticle physics, astrophysics, cosmology, solar system & planetary science and Nuclear physics (PPAN; collectively STFC Frontier Science). iDiRAC services are optimized for these research communities and operate as a single distributed facility which provides the range of architectures needed to deliver our world-leading science outcomes.)

Nvidia BlueField-2 DPU

“The Cambridge University system would easily rank in the top 70-to-75 of the Top500 list and is or will be the world’s first academic cloud native supercomputer,” said Gilad Shainer, senior vice president of marketing, Mellanox networking, Nvidia, in a pre-briefing. “The system will be used as part of the continuous development of the capabilities of cloud native supercomputing as part of a collaboration with OpenStack [in which] the university wants to bring in OpenStack and run it on the DPU.”

Shainer noted Cambridge University collaborates widely with medical institutes in the UK, and said that building a ‘cloud native supercomputing architecture’ would strengthen its security capabilities and “make it easier to be able to bring personal information or clinical information into supercomputers as part of doing analysis.”

The CSD3 employs a new cloud-native supercomputing platform enabled by Nvidia and a cloud HPC software stack, called Scientific OpenStack, developed by the University of Cambridge and StackHPC with funding from the DiRAC HPC Facility and the IRIS Facility.

The Durham University system is focused on cosmology and physics.

“COSMA 8 is aiming to model the entire universe, over time, from the big bang to today. It will allow humankind to continue advancing our understanding of where we came from and our place in the cosmos, using larger-scale simulations than ever before,” said Alastair Basden, technical manager for the DiRAC Memory Intensive Service at Durham University. “The massive scale of these simulations relies on the bandwidth only InfiniBand can deliver to make this research possible. It’s one example of how DiRAC and Durham University continue to advance the field of supercomputing through their ongoing collaboration with Nvidia.”

Nvidia posted a blog on the Cambridge University system (by Gilad Shainer) and a brief description of cloud native supercomputing (by Rick Merrick) as well as the Durham announcement.

Basden also gave a talk at GTC (On the edge of Exascale, Nvidia Bluefield at Durham University) looking at Durham’s early work with BlueField-1. The COSMA8 prototype system entered service in October 2020 (full specs from Durham website) and the full COSMA8 system is currently being installed, and due to enter service in October 2021.

“Effectively, [COSMA] is part of the UK DiRAC tier-1 national facility. It started life in 2001 as COSMA1, and it’s now in its eighth generation as COSMA8,” said Basden in his talk. “We have a terabyte of RAM per node and a full HDR 200 non-blocking fabric with a fat tree topology. We also have direct liquid cooling on-chip, five petabyte bulk storage system attached to it and a 1.25 petabyte scratch storage system for dumping our restart files onto in this work. This runs at about 400 gigabytes per second, not gigabits, [and that’s] something fairly fast during.”

Currently, Basden’s team is using BlueField-1 in exploratory work to determine if it can be used, for example, to help solve delays associated with what’s called the MPI Progression Problem. His presentation is best watched directly, but broadly Basden reported progress and is hopeful that what he calls SmartNICs (ie, the DPUs) can help with data traffic housekeeping and MPI issues.

Here are a couple excerpts (lightly edited):

“Basically what we’re looking at doing is…the hosts are doing the science calculations. The BlueField tasks are then the ones that are responsible for moving data around if that’s appropriate. Now all of this is in fairly early stages. So we haven’t got something yet that we’d want to put into production code. It’s all ideas that are formulating and we’re trying things that we’re gradually working out,” noted Basden in his talk.

When asked about overall issues, he said:

“The first thing I’ll say is [using BlueField] is not trivial. It’s not simple. [The cards] can be run in two different modes. One is what they call an embedded mode, where it kind of acts like an embedded switch. The other is host separated mode, where both the host and your card have their own MAC addresses, they can then address each other, but they can also be addressed. That’s the mode that we’re using that we found most useful, because it gives you most flexibility, gives you most power to do exactly what you want to do,” said Basden.

“Now, of course, what we do with our codes is compile two versions of the code; compile the x86 version that runs on the host, and then an Arm version that runs on the BlueField. And then we do an MPI run that will launch all of those at once, in the right places. So that’s a useful thing to have. [It] simplifies things a lot. We find it useful — even if we’re not using MPI for communications underneath, it’s a good way of launching these tasks at the same time in the right places, etc.”

Here are a couple of slides from his talk, but the session is best seen in full.

Link to Nvidia blog on the University of Cambridge: https://blogs.nvidia.com/blog/2021/04/14/csd3-cloud-native-supercomputer-cambridge-university/

Link to the Nvidia brief description of Cloud Native Supercomputing: https://blogs.nvidia.com/blog/2021/04/14/what-is-a-cloud-native-supercomputer/

Link to the Durham announcement: https://nvidianews.nvidia.com/news/durham-university-and-diracs-new-nvidia-infiniband-powered-supercomputer-to-accelerate-our-understanding-of-the-universe

Link to Baden’s GTC21 talk: https://gtc21.event.nvidia.com/media/1_2zpi3u45?ncid=ref-spo-38311

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

IBM Research Debuts 2nm Test Chip with 50 Billion Transistors

May 6, 2021

IBM Research today announced the successful prototyping of the world's first 2 nanometer chip, fabricated with silicon nanosheet technology on a standard 300mm bulk wafer. With ~50 billion transistors, the chip will enab Read more…

Supercomputer-Powered CRISPR Simulation Lights Path to Better DNA Editing

May 5, 2021

CRISPR-Cas9 – mostly just known as CRISPR – is a powerful genome editing tool that uses an enzyme (Cas9) to slice off sections of DNA and a guide RNA to repair and modify the DNA as desired, opening the door for cure Read more…

LRZ Announces New Phase of SuperMUC-NG Supercomputer with Intel’s ‘Ponte Vecchio’ GPU

May 5, 2021

At the Leibniz Supercomputing Centre (LRZ) in München, Germany – one of the constituent centers of the Gauss Centre for Supercomputing (GCS) – the SuperMUC-NG system has stood tall for several years, placing 15th on Read more…

HPC Simulations Show How Antibodies Quash SARS-CoV-2

May 5, 2021

Following more than a year of rapid-fire research and pharmaceutical development, nearly a billion COVID-19 vaccine doses have been administered around the world, with many of those vaccines proving remarkably effective Read more…

Crystal Ball Gazing at Nvidia: R&D Chief Bill Dally Talks Targets and Approach

May 4, 2021

There’s no quibbling with Nvidia’s success. Entrenched atop the GPU market, Nvidia has ridden its own inventiveness and growing demand for accelerated computing to meet the needs of HPC and AI. Recently it embarked o Read more…

AWS Solution Channel

FLYING WHALES runs CFD workloads 15 times faster on AWS

FLYING WHALES is a French startup that is developing a 60-ton payload cargo airship for the heavy lift and outsize cargo market. The project was born out of France’s ambition to provide efficient, environmentally friendly transportation for collecting wood in remote areas. Read more…

2021 Winter Classic – Coaches Chat

May 4, 2021

The Winter Classic Invitational Student Cluster Competition raged for all last week and now we’re into the week of judging interviews. Time has been flying. So as we wait for results, let’s dive a bit deeper into t Read more…

IBM Research Debuts 2nm Test Chip with 50 Billion Transistors

May 6, 2021

IBM Research today announced the successful prototyping of the world's first 2 nanometer chip, fabricated with silicon nanosheet technology on a standard 300mm Read more…

Crystal Ball Gazing at Nvidia: R&D Chief Bill Dally Talks Targets and Approach

May 4, 2021

There’s no quibbling with Nvidia’s success. Entrenched atop the GPU market, Nvidia has ridden its own inventiveness and growing demand for accelerated compu Read more…

Intel Invests $3.5 Billion in New Mexico Fab to Focus on Foveros Packaging Technology

May 3, 2021

Intel announced it is investing $3.5 billion in its Rio Rancho, New Mexico, facility to support its advanced 3D manufacturing and packaging technology, Foveros. Read more…

Supercomputer Research Shows Standard Model May Withstand Muon Discrepancy

May 3, 2021

Big news recently struck the physics world: researchers at the Fermi National Accelerator Laboratory (FNAL), in the midst of their Muon g-2 experiment, publishe Read more…

HPC Career Notes: May 2021 Edition

May 3, 2021

In this monthly feature, we’ll keep you up-to-date on the latest career developments for individuals in the high-performance computing community. Whether it Read more…

NWChemEx: Computational Chemistry Code for the Exascale Era

April 29, 2021

A team working on biofuel research is rewriting the decades-old NWChem software program for the exascale era. The new software, NWChemEx, will enable computatio Read more…

HPE Will Build Singapore’s New National Supercomputer

April 28, 2021

More than two years ago, Singapore’s National Supercomputing Centre (NSCC) announced a $200 million SGD (~$151 million USD) investment to boost its supercomputing power by an order of magnitude. Today, those plans come closer to fruition with the announcement that Hewlett Packard Enterprise (HPE) has been awarded... Read more…

Arm Details Neoverse V1, N2 Platforms with New Mesh Interconnect, Advances Partner Ecosystem

April 27, 2021

Chip designer Arm Holdings is sharing details about its Neoverse V1 and N2 cores, introducing its new CMN-700 interconnect, and showcasing its partners' plans t Read more…

Julia Update: Adoption Keeps Climbing; Is It a Python Challenger?

January 13, 2021

The rapid adoption of Julia, the open source, high level programing language with roots at MIT, shows no sign of slowing according to data from Julialang.org. I Read more…

Intel Launches 10nm ‘Ice Lake’ Datacenter CPU with Up to 40 Cores

April 6, 2021

The wait is over. Today Intel officially launched its 10nm datacenter CPU, the third-generation Intel Xeon Scalable processor, codenamed Ice Lake. With up to 40 Read more…

CERN Is Betting Big on Exascale

April 1, 2021

The European Organization for Nuclear Research (CERN) involves 23 countries, 15,000 researchers, billions of dollars a year, and the biggest machine in the worl Read more…

HPE Launches Storage Line Loaded with IBM’s Spectrum Scale File System

April 6, 2021

HPE today launched a new family of storage solutions bundled with IBM’s Spectrum Scale Erasure Code Edition parallel file system (description below) and featu Read more…

10nm, 7nm, 5nm…. Should the Chip Nanometer Metric Be Replaced?

June 1, 2020

The biggest cool factor in server chips is the nanometer. AMD beating Intel to a CPU built on a 7nm process node* – with 5nm and 3nm on the way – has been i Read more…

Saudi Aramco Unveils Dammam 7, Its New Top Ten Supercomputer

January 21, 2021

By revenue, oil and gas giant Saudi Aramco is one of the largest companies in the world, and it has historically employed commensurate amounts of supercomputing Read more…

Quantum Computer Start-up IonQ Plans IPO via SPAC

March 8, 2021

IonQ, a Maryland-based quantum computing start-up working with ion trap technology, plans to go public via a Special Purpose Acquisition Company (SPAC) merger a Read more…

Can Deep Learning Replace Numerical Weather Prediction?

March 3, 2021

Numerical weather prediction (NWP) is a mainstay of supercomputing. Some of the first applications of the first supercomputers dealt with climate modeling, and Read more…

Leading Solution Providers

Contributors

Livermore’s El Capitan Supercomputer to Debut HPE ‘Rabbit’ Near Node Local Storage

February 18, 2021

A near node local storage innovation called Rabbit factored heavily into Lawrence Livermore National Laboratory’s decision to select Cray’s proposal for its CORAL-2 machine, the lab’s first exascale-class supercomputer, El Capitan. Details of this new storage technology were revealed... Read more…

AMD Launches Epyc ‘Milan’ with 19 SKUs for HPC, Enterprise and Hyperscale

March 15, 2021

At a virtual launch event held today (Monday), AMD revealed its third-generation Epyc “Milan” CPU lineup: a set of 19 SKUs -- including the flagship 64-core, 280-watt 7763 part --  aimed at HPC, enterprise and cloud workloads. Notably, the third-gen Epyc Milan chips achieve 19 percent... Read more…

Programming the Soon-to-Be World’s Fastest Supercomputer, Frontier

January 5, 2021

What’s it like designing an app for the world’s fastest supercomputer, set to come online in the United States in 2021? The University of Delaware’s Sunita Chandrasekaran is leading an elite international team in just that task. Chandrasekaran, assistant professor of computer and information sciences, recently was named... Read more…

New Deep Learning Algorithm Solves Rubik’s Cube

July 25, 2018

Solving (and attempting to solve) Rubik’s Cube has delighted millions of puzzle lovers since 1974 when the cube was invented by Hungarian sculptor and archite Read more…

African Supercomputing Center Inaugurates ‘Toubkal,’ Most Powerful Supercomputer on the Continent

February 25, 2021

Historically, Africa hasn’t exactly been synonymous with supercomputing. There are only a handful of supercomputers on the continent, with few ranking on the Read more…

GTC21: Nvidia Launches cuQuantum; Dips a Toe in Quantum Computing

April 13, 2021

Yesterday Nvidia officially dipped a toe into quantum computing with the launch of cuQuantum SDK, a development platform for simulating quantum circuits on GPU-accelerated systems. As Nvidia CEO Jensen Huang emphasized in his keynote, Nvidia doesn’t plan to build... Read more…

The History of Supercomputing vs. COVID-19

March 9, 2021

The COVID-19 pandemic poses a greater challenge to the high-performance computing community than any before. HPCwire's coverage of the supercomputing response t Read more…

HPE Names Justin Hotard New HPC Chief as Pete Ungaro Departs

March 2, 2021

HPE CEO Antonio Neri announced today (March 2, 2021) the appointment of Justin Hotard as general manager of HPC, mission critical solutions and labs, effective Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire