Architecting for AI Workloads

March 4, 2019

Artificial intelligence has come of age. To capitalize fully on the opportunities, organizations need to design high-performance computing architectures for AI workloads.

After years of talking about the promise of artificial intelligence, enterprises around the world are now diving headfirst into AI-driven processes and business models. From financial services to manufacturing, from healthcare to retail, enterprises are now “all in” with AI and its supporting computing models — notably machine and deep learning. The same holds true for universities and government agencies. They are using AI for countless pursuits, from driving groundbreaking scientific discoveries to protecting our national security.

This widespread embrace of all things AI is fueled by the rise of more powerful processors and accelerators, advanced tools and techniques for data analytics, more precise algorithms and — most of all — an explosion of data, driven to a large degree by the Internet of Things. When you put it all together, you’ve got what it takes to put AI to work in countless applications.

Architecting for AI workloads

To capitalize fully on the opportunities in today’s data-driven world, IT organizations need to design high-performance computing architectures to accommodate demanding AI workloads. The HPC and AI community has started optimizing AI frameworks and developer tools to address performance needs, allowing for much larger batch sizes to be processed on industry standard CPUs. Within the last year, Intel® has seen up to 241x training performance gains through optimized frameworks with Intel® Math Kernel Library (MKL) on Intel Xeon® Scalable Processors over Haswell processors. This can take your time to train from hours to minutes, while these optimizations provide the eco-system greater access to AI capabilities.

This shift to AI-focused infrastructure is happening today as organizations roll out systems that bring together the capabilities of HPC, data analytics and AI. This is the case with the University of Cambridge’s latest supercomputer, called Cumulus. This groundbreaking system was designed to serve as a single HPC cluster that supports researchers’ needs for data analytics, machine learning and large-scale data processing. The goal is to solve extremely difficult big data, simulation and AI challenges.

To meet this goal, the Cumulus architecture was designed to address the broad range of system challenges, including those at the compute, network, storage and software layers. A key objective was to make the infrastructure perform well for diverse, data-intensive research workloads.

The Cumulus system provides more than 2 petaflops of performance, powered by Dell EMC PowerEdge™ servers and Intel Xeon Scalable processors, all connected via the Intel Omni-Path Architecture (OPA). The system incorporates OpenStack® software to control pools of compute, storage and networking resources and make them readily accessible to users via a cloud interface.

Solving for I/O bottlenecks

This architectural foundation alone doesn’t necessarily solve today’s persistent I/O challenges in HPC clusters. Here’s the problem: While data-processing power has raced forward in recent years, storage I/O limitations have created bottlenecks that slow time to insight, particularly for researchers running data-centric workloads that interact continuously with data storage systems.

The Cumulus system removes these bottlenecks with a unique solution called the Data Accelerator (aka DAC), which is designed into the network topology. DAC incorporates technologies from Dell EMC, Intel  and Cambridge University. In this architecture, the DAC nodes work in conjunction with the Distributed Name Space (DNE) feature in the Lustre file system and Intel® Omni-Path switches to accelerate system I/O.

The results of this accelerated architecture have been rather amazing. With DAC under the hood, Cumulus provides more than 500 GB/s of I/O read performance, which makes it the UK’s fastest HPC I/O platform, according to the university’s Research Computing Service, which operates the Cumulus cluster.[1]

In benchmark testing, the Cumulus system achieved an IO-500 score of 158.7, which ranked the system third on the November 2018 IO-500 list. For system users, these numbers equate to big improvements in I/O performance for data-intensive HPC and AI workloads — and faster time to insight.

Building the right foundation for new and emerging workloads

For organizations searching for the right IT foundation for AI workloads, Intel offers expert insights in its high-level Guide to Developing an AI Infrastructure Strategy. The options outlined in this guide range from starting from scratch with your current systems to outsourcing your entire solution. One of these options is to build a broad platform that is designed to support a wide range of AI workloads — which is the approach the University of Cambridge took with its Cumulus system.

The guide explains: “This approach is similar to the emerging ‘platform’ architecture we now see prevalent across IT — that is, an approach that provides a highly scalable infrastructure layer that can be managed as a single pool, using virtualization and software-defined orchestration across server processing, storage and networking.”[2]

The guide presents this broad-platform infrastructure strategy in terms of a three-tier stack, with hardware, software and process layers that work together to enable AI workloads. A few highlights from this architecture:

  • At the hardware layer, communication between devices and systems is based around an ultra-high speed backbone, such as the Intel® Omni-Path.
  • The software layer includes operating system and virtualization layers, which support a library of AI-specific modules. These modules enable algorithmic processing and analytics, data management and I/O, as well as the delivery of data sources and the visualization of analysis results.
  • The process layer runs the business logic of the AI application, using library modules to deliver capabilities like image recognition.

Intel notes that this architecture results in a platform-based approach that offers a single point of configuration and a unique deployment target.

Key takeaways

The rise of artificial intelligence creates unprecedented opportunities for today’s enterprises. To fully capitalize on these opportunities, your organization needs a scalable HPC infrastructure that is specifically designed to incorporate the latest processor and fabric technologies, accommodate massive amounts of data, and leverage technologies to accelerate the data storage I/O and AI workloads.

To learn more

For a closer and more technical look at the University of Cambridge’s use of the Data Accelerator, visit the Research Computing Services’ Data Accelerator site. And for a broader look at the university’s Cumulus cluster, read the Dell EMC case study “UK Science Cloud.”

 

The Convergence of HPC, Analytics and AI

High-performance computing, data analytics and artificial intelligence no longer live in separate domains. These complementary technologies are rapidly converging as organizations work to gain greater value from the data they capture and store.


[1] Dell EMC case study, “UK Science Cloud,” November 2018.

 

[2] Intel, “Select the Best Infrastructure Strategy to Support Your AI Solution,” March 2018.

 

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

Intel Debuts Pohoiki Beach, Its 8M Neuron Neuromorphic Development System

July 17, 2019

Neuromorphic computing has received less fanfare of late than quantum computing whose mystery has captured public attention and which seems to have generated more efforts (academic, government, and commercial) but whose Read more…

By John Russell

Goonhilly Unveils New Immersion-Cooled Platform, Doubles Down on Sustainability Mission

July 16, 2019

Goonhilly Earth Station has opened its new datacenter – an enhancement to its existing tier 3 facility – in Cornwall, England, touting an ambitious commitment to holistic sustainability as well as launching a managed Read more…

By Oliver Peckham

New CMU AI Poker Bot – Pluribus – Humbles the Pros Again

July 15, 2019

Remember Libratus, the Carnegie Mellon University developed AI poker bot that’s been humbling poker professionals at Texas hold’em for a couple of years. Well, say hello to Pluribus, an upgraded bot, which has now be Read more…

By John Russell

HPE Extreme Performance Solutions

Bring the Combined Power of HPC and AI to Your Business Transformation

A growing number of commercial businesses are implementing HPC solutions to derive actionable business insights, to run higher performance applications and to gain a competitive advantage. Read more…

IBM Accelerated Insights

Smarter Technology Revs Up Red Bull Racing

In 21st century business, companies that effectively leverage their information resources – thrive. As it turns out, the same is true in Formula One racing. Read more…

ISC19 Cluster Competition: Application Results, Finally!

July 15, 2019

Our exhaustive coverage of the ISC19 Student Cluster Competition continues as we discuss the application scores below. While the scores were typically high, some of the apps, like SWIFT and OpenFOAM, really pushed the st Read more…

By Dan Olds

Intel Debuts Pohoiki Beach, Its 8M Neuron Neuromorphic Development System

July 17, 2019

Neuromorphic computing has received less fanfare of late than quantum computing whose mystery has captured public attention and which seems to have generated mo Read more…

By John Russell

Goonhilly Unveils New Immersion-Cooled Platform, Doubles Down on Sustainability Mission

July 16, 2019

Goonhilly Earth Station has opened its new datacenter – an enhancement to its existing tier 3 facility – in Cornwall, England, touting an ambitious commitme Read more…

By Oliver Peckham

New CMU AI Poker Bot – Pluribus – Humbles the Pros Again

July 15, 2019

Remember Libratus, the Carnegie Mellon University developed AI poker bot that’s been humbling poker professionals at Texas hold’em for a couple of years. We Read more…

By John Russell

ISC19 Cluster Competition: Application Results, Finally!

July 15, 2019

Our exhaustive coverage of the ISC19 Student Cluster Competition continues as we discuss the application scores below. While the scores were typically high, som Read more…

By Dan Olds

Nvidia Expands DGX-Ready AI Program to 19 Countries

July 11, 2019

Nvidia’s DGX-Ready Data Center Program, announced in January and designed to provide colo and public cloud-like options to access the company’s GPU-powered Read more…

By Doug Black

Argonne Team Makes Record Globus File Transfer

July 10, 2019

A team of scientists at Argonne National Laboratory has broken a data transfer record by moving a staggering 2.9 petabytes of data for a research project.  The data – from three large cosmological simulations – was generated and stored on the Summit supercomputer at the Oak Ridge Leadership Computing Facility (OLCF)... Read more…

By Oliver Peckham

Nvidia, Google Tie in Second MLPerf Training ‘At-Scale’ Round

July 10, 2019

Results for the second round of the AI benchmarking suite known as MLPerf were published today with Google Cloud and Nvidia each picking up three wins in the at Read more…

By Tiffany Trader

Applied Materials Embedding New Memory Technologies in Chips

July 9, 2019

Applied Materials, the $17 billion Santa Clara-based materials engineering company for the semiconductor industry, today announced manufacturing systems enablin Read more…

By Doug Black

High Performance (Potato) Chips

May 5, 2006

In this article, we focus on how Procter & Gamble is using high performance computing to create some common, everyday supermarket products. Tom Lange, a 27-year veteran of the company, tells us how P&G models products, processes and production systems for the betterment of consumer package goods. Read more…

By Michael Feldman

Cray, AMD to Extend DOE’s Exascale Frontier

May 7, 2019

Cray and AMD are coming back to Oak Ridge National Laboratory to partner on the world’s largest and most expensive supercomputer. The Department of Energy’s Read more…

By Tiffany Trader

Graphene Surprises Again, This Time for Quantum Computing

May 8, 2019

Graphene is fascinating stuff with promise for use in a seeming endless number of applications. This month researchers from the University of Vienna and Institu Read more…

By John Russell

AMD Verifies Its Largest 7nm Chip Design in Ten Hours

June 5, 2019

AMD announced last week that its engineers had successfully executed the first physical verification of its largest 7nm chip design – in just ten hours. The AMD Radeon Instinct Vega20 – which boasts 13.2 billion transistors – was tested using a TSMC-certified Calibre nmDRC software platform from Mentor. Read more…

By Oliver Peckham

TSMC and Samsung Moving to 5nm; Whither Moore’s Law?

June 12, 2019

With reports that Taiwan Semiconductor Manufacturing Co. (TMSC) and Samsung are moving quickly to 5nm manufacturing, it’s a good time to again ponder whither goes the venerable Moore’s law. Shrinking feature size has of course been the primary hallmark of achieving Moore’s law... Read more…

By John Russell

Deep Learning Competitors Stalk Nvidia

May 14, 2019

There is no shortage of processing architectures emerging to accelerate deep learning workloads, with two more options emerging this week to challenge GPU leader Nvidia. First, Intel researchers claimed a new deep learning record for image classification on the ResNet-50 convolutional neural network. Separately, Israeli AI chip startup Hailo.ai... Read more…

By George Leopold

Nvidia Embraces Arm, Declares Intent to Accelerate All CPU Architectures

June 17, 2019

As the Top500 list was being announced at ISC in Frankfurt today with an upgraded petascale Arm supercomputer in the top third of the list, Nvidia announced its Read more…

By Tiffany Trader

Top500 Purely Petaflops; US Maintains Performance Lead

June 17, 2019

With the kick-off of the International Supercomputing Conference (ISC) in Frankfurt this morning, the 53rd Top500 list made its debut, and this one's for petafl Read more…

By Tiffany Trader

Leading Solution Providers

ISC 2019 Virtual Booth Video Tour

CRAY
CRAY
DDN
DDN
DELL EMC
DELL EMC
GOOGLE
GOOGLE
ONE STOP SYSTEMS
ONE STOP SYSTEMS
PANASAS
PANASAS
VERNE GLOBAL
VERNE GLOBAL

Intel Launches Cascade Lake Xeons with Up to 56 Cores

April 2, 2019

At Intel's Data-Centric Innovation Day in San Francisco (April 2), the company unveiled its second-generation Xeon Scalable (Cascade Lake) family and debuted it Read more…

By Tiffany Trader

Cray – and the Cray Brand – to Be Positioned at Tip of HPE’s HPC Spear

May 22, 2019

More so than with most acquisitions of this kind, HPE’s purchase of Cray for $1.3 billion, announced last week, seems to have elements of that overused, often Read more…

By Doug Black and Tiffany Trader

A Behind-the-Scenes Look at the Hardware That Powered the Black Hole Image

June 24, 2019

Two months ago, the first-ever image of a black hole took the internet by storm. A team of scientists took years to produce and verify the striking image – an Read more…

By Oliver Peckham

Announcing four new HPC capabilities in Google Cloud Platform

April 15, 2019

When you’re running compute-bound or memory-bound applications for high performance computing or large, data-dependent machine learning training workloads on Read more…

By Wyatt Gorman, HPC Specialist, Google Cloud; Brad Calder, VP of Engineering, Google Cloud; Bart Sano, VP of Platforms, Google Cloud

It’s Official: Aurora on Track to Be First US Exascale Computer in 2021

March 18, 2019

The U.S. Department of Energy along with Intel and Cray confirmed today that an Intel/Cray supercomputer, "Aurora," capable of sustained performance of one exaf Read more…

By Tiffany Trader

Why Nvidia Bought Mellanox: ‘Future Datacenters Will Be…Like High Performance Computers’

March 14, 2019

“Future datacenters of all kinds will be built like high performance computers,” said Nvidia CEO Jensen Huang during a phone briefing on Monday after Nvidia revealed scooping up the high performance networking company Mellanox for $6.9 billion. Read more…

By Tiffany Trader

Chinese Company Sugon Placed on US ‘Entity List’ After Strong Showing at International Supercomputing Conference

June 26, 2019

After more than a decade of advancing its supercomputing prowess, operating the world’s most powerful supercomputer from June 2013 to June 2018, China is keep Read more…

By Tiffany Trader

In Wake of Nvidia-Mellanox: Xilinx to Acquire Solarflare

April 25, 2019

With echoes of Nvidia’s recent acquisition of Mellanox, FPGA maker Xilinx has announced a definitive agreement to acquire Solarflare Communications, provider Read more…

By Doug Black

  • arrow
  • Click Here for More Headlines
  • arrow
Do NOT follow this link or you will be banned from the site!
Share This