Summit Has Real-Time Analytics: Here’s How It Happened and What’s Next

By Oliver Peckham

October 3, 2019

Summit – the world’s fastest publicly-ranked supercomputer – now has real-time streaming analytics. At the 2019 HPC User Forum at Argonne National Laboratory, Arno Kolster (principal and co-founder of HPC consultancy Providentia Worldwide) took the stage to explain how it happened – and what it means for the future.

The need for a smarter supercomputer

Summit launched at Oak Ridge National Laboratory (ORNL) in the second half of 2018. As of the June 2019 Top500 list, it still held the top spot among the world’s supercomputers, its 2.41 million cores delivering 148.6 Linpack petaflops. That also means a correspondingly massive power draw; Summit’s power consumption is rated at around 13 megawatts, equivalent to the energy draw of over 10,000 homes. That power produces an enormous amount of heat, requiring regular operation of power-hungry water chillers for Summit’s cooling system.

For ORNL, that means that a huge priority is reducing power consumption (and its costs) wherever possible. But a major obstacle remained: there was no mechanism in place that understood Summit’s second-to-second operations at a granular enough level to effectively optimize them.

Left to right: Merle Giles, Arno Kolster and S. Ryan Quick in front of the Summit supercomputer. Image courtesy of Arno Kolster.

This led Jim Rogers, director of computing and facilities at ORNL, to seek out Kolster. Rogers and Kolster, who knew each other from a partnership some six years ago, reconnected at a conference in 2017, where Kolster was speaking about streaming analytics.

“At the end of the talk, Jim pulled me aside and said, ‘Hey, can you help us do streaming analytics on Summit? Because I’ve got a problem: I’ve got 4,600 nodes, all streaming data off them, and I have no idea what to do with them,’” Kolster recalled. “And I said, ‘Yeah, we can help you with that.’”

The goal? To have Summit’s immense data streaming directly off the nodes in real-time using a resilient system that could be scaled up without proportional staff increases. 

Summiting a mountain of data

Kolster broke down the magnitude of the data at hand: first and foremost, there were 4,608 nodes, each with 99 metrics to capture per second – most importantly, power to the fans, the node and individual components within the node, such as individual CPU cores, GPU cores, DIMMs and HBMs. Outside the node, there was data from the job scheduler, polled every ten seconds; weather data from the National Oceanic and Atmospheric Administration (NOAA) once an hour; and continuous water flow data from the chillers. 

All in all, about 460,000 metrics per second – with an eye toward expansion.

This didn’t worry Providentia – the founders had handled upwards of seven million events when they were at PayPal – and so, about a year ago, Providentia set about the long process of building streaming analytics for Summit. “The majority of the time was spent on [addressing] the legal hurdles for small business to work with the government. That took probably a couple of months,” Kolster said in an interview with HPCwire. “The other long-term time was spent on crafting the statement of work in a way that we were both happy with it.” Providentia also had to navigate around Summit’s tight security and scheduling. In the end, Kolster said, it was about three months of development over an eight-month period – all done remotely on three small nodes.

Providentia began with a Kafka-based event message bus linked into the data sources. It added data persistence tools: Prometheus as a time series database and Elasticsearch for log metrics and understanding, among others. Docker was used to containerize and scale, and Spark streaming was added for on-the-wire data analytics. Finally, Grafana and Seaborn came into play for data visualization. (“Young people like to play with this stuff,” Kolster said of the long list of technologies, “so it’s a way of getting some of the younger people involved with HPC.”)

And the result? “It’s pretty spectacular,” Kolster said. Near-instant, agnostic data that could be custom-formatted; overlapping metrics with real-time visualizations. Kolster pulled up an example of one of the visualizations, thousands of glittering green cells fading in and out as Summit’s power-per-job fluctuated across its nodes.

A still from a visualization of Summit’s power per job per second fluctuating across its nodes. Image courtesy of Arno Kolster.

A new paradigm

“It’s a new paradigm,” Kolster said. “There’s no more looking at databases for data. There’s no more waiting until tomorrow to look at the data. It’s basically real-time data. What you see right now is what’s happening right now.”

“The largest supercomputer in the world is now being micromanaged by microservices – a cloud thing,” he continued. 

For now, the capabilities of the infrastructure are primarily the instant analytics and visualization that now help system operators to manually adjust Summit’s cooling and optimize job scheduling. Of course, it already has a couple of neat tricks up its sleeve – notably, the ability to alert operators if the temperature in a specific area goes up a certain amount. Kolster also hopes that the new infrastructure will help clients ask (and answer) crucial questions, such as “why is my job spending more time on the CPU than on the GPU?” or “why does my job consume more power than someone else’s job?”

Still, Kolster seems to have his heart truly set on “phase two” of the project, which (for now) remains a speculative endeavor. Phase two would involve leveraging the massive data stream for robust predictive analytics that would, for example, allow Summit to automatically schedule jobs to cooler areas of the cluster. “You could actually have the job scheduler be smart enough to schedule jobs according to their power consumption, based on historical metrics,” Kolster said. “And that’s very powerful, because that’s something that can be done right now.”

“That’s basically where things are heading,” he continued. “You hear about predictive analytics, prescriptive analytics – the basic problem right now is that everyone’s reacting to things instead of being proactive about it. And so we’ve always been more, you know, ‘You’ve got the machinery, you’ve got the computers, you have the analytics – let’s be more proactive about how things are working.’”

Looking ahead

Whether or not Providentia is invited back for phase two of Summit’s analytics infrastructure, Kolster is happy with the results. “We’d love to finish off the second phase of the Oak Ridge project, because we have some really interesting things around AI and machine learning that we think we can bring to bear there,” he said. “But we know full well that they’ve also got some really smart people there that might want to delve into those areas on their own. We’ve built the ‘highway,’ if you will, for them to move cars and trucks around, and now they can do whatever they want with the on-ramps and off-ramps.”

Musing on future applications of Providentia’s approach to Summit, Kolster said that he would prefer to showcase the “full vision” rather than arriving in medias res. “I would rather do it up front and be part of the stack that goes in instead of doing it afterwards and retrofitting it in,” he said.  He mentioned that Providentia is talking to two different verticals where the model can be used – and, of course, he said that they would love to work with Frontier, which is expected to be the world’s most powerful system when it launches in 2021.

“It’s not because it’s a new thing,” Kolster said of organizations’ interest in this approach. “It’s just that people … don’t understand – moving messages around by the millions of messages a second, they don’t understand that this can be accomplished. … And then it opens up a whole new discussion as to possibilities they never knew existed.”

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

Under The Wire: Nearly HPC News (June 13, 2024)

June 13, 2024

As managing editor of the major global HPC news source, the term "news fire hose" is often mentioned. The analogy is quite correct. In any given week, there are many interesting stories, and only a few ever become headli Read more…

Quantum Tech Sector Hiring Stays Soft

June 13, 2024

New job announcements in the quantum tech sector declined again last month, according to an Quantum Economic Development Consortium (QED-C) report issued last week. “Globally, the number of new, public postings for Qu Read more…

Labs Keep Supercomputers Alive for Ten Years as Vendors Pull Support Early

June 12, 2024

Laboratories are running supercomputers for much longer, beyond the typical lifespan, as vendors prematurely deprecate the hardware and stop providing support. A typical supercomputer lifecycle is about five to six years Read more…

MLPerf Training 4.0 – Nvidia Still King; Power and LLM Fine Tuning Added

June 12, 2024

There are really two stories packaged in the most recent MLPerf  Training 4.0 results, released today. The first, of course, is the results. Nvidia (currently king of accelerated computing) wins again, sweeping all nine Read more…

Highlights from GlobusWorld 2024: The Conference for Reimagining Research IT

June 11, 2024

The Globus user conference, now in its 22nd year, brought together over 180 researchers, system administrators, developers, and IT leaders from 55 top research computing centers, national labs, federal agencies, and univ Read more…

Nvidia Shipped 3.76 Million Data-center GPUs in 2023, According to Study

June 10, 2024

Nvidia had an explosive 2023 in data-center GPU shipments, which totaled roughly 3.76 million units, according to a study conducted by semiconductor analyst firm TechInsights. Nvidia's GPU shipments in 2023 grew by more Read more…

Under The Wire: Nearly HPC News (June 13, 2024)

June 13, 2024

As managing editor of the major global HPC news source, the term "news fire hose" is often mentioned. The analogy is quite correct. In any given week, there are Read more…

Labs Keep Supercomputers Alive for Ten Years as Vendors Pull Support Early

June 12, 2024

Laboratories are running supercomputers for much longer, beyond the typical lifespan, as vendors prematurely deprecate the hardware and stop providing support. Read more…

MLPerf Training 4.0 – Nvidia Still King; Power and LLM Fine Tuning Added

June 12, 2024

There are really two stories packaged in the most recent MLPerf  Training 4.0 results, released today. The first, of course, is the results. Nvidia (currently Read more…

Highlights from GlobusWorld 2024: The Conference for Reimagining Research IT

June 11, 2024

The Globus user conference, now in its 22nd year, brought together over 180 researchers, system administrators, developers, and IT leaders from 55 top research Read more…

Nvidia Shipped 3.76 Million Data-center GPUs in 2023, According to Study

June 10, 2024

Nvidia had an explosive 2023 in data-center GPU shipments, which totaled roughly 3.76 million units, according to a study conducted by semiconductor analyst fir Read more…

ASC24 Expert Perspective: Dongarra, Hoefler, Yong Lin

June 7, 2024

One of the great things about being at an ASC (Asia Supercomputer Community) cluster competition is getting the chance to interview various industry experts and Read more…

HPC and Climate: Coastal Hurricanes Around the World Are Intensifying Faster

June 6, 2024

Hurricanes are among the world's most destructive natural hazards. Their environment shapes their ability to deliver damage; conditions like warm ocean waters, Read more…

ASC24: The Battle, The Apps, and The Competitors

June 5, 2024

The ASC24 (Asia Supercomputer Community) Student Cluster Competition was one for the ages. More than 350 university teams worked for months in the preliminary competition to earn one of the 25 final competition slots. The winning teams... Read more…

Atos Outlines Plans to Get Acquired, and a Path Forward

May 21, 2024

Atos – via its subsidiary Eviden – is the second major supercomputer maker outside of HPE, while others have largely dropped out. The lack of integrators and Atos' financial turmoil have the HPC market worried. If Atos goes under, HPE will be the only major option for building large-scale systems. Read more…

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Everyone Except Nvidia Forms Ultra Accelerator Link (UALink) Consortium

May 30, 2024

Consider the GPU. An island of SIMD greatness that makes light work of matrix math. Originally designed to rapidly paint dots on a computer monitor, it was then Read more…

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, codenamed Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Some Reasons Why Aurora Didn’t Take First Place in the Top500 List

May 15, 2024

The makers of the Aurora supercomputer, which is housed at the Argonne National Laboratory, gave some reasons why the system didn't make the top spot on the Top Read more…

Leading Solution Providers

Contributors

AMD MI3000A

How AMD May Get Across the CUDA Moat

October 5, 2023

When discussing GenAI, the term "GPU" almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word "GPU" is assumed to mean "Nvidia" products. (As an aside, the popular Nvidia hardware used in GenAI are not technically... Read more…

The NASA Black Hole Plunge

May 7, 2024

We have all thought about it. No one has done it, but now, thanks to HPC, we see what it looks like. Hold on to your feet because NASA has released videos of wh Read more…

Google Announces Sixth-generation AI Chip, a TPU Called Trillium

May 17, 2024

On Tuesday May 14th, Google announced its sixth-generation TPU (tensor processing unit) called Trillium.  The chip, essentially a TPU v6, is the company's l Read more…

Intel’s Next-gen Falcon Shores Coming Out in Late 2025 

April 30, 2024

It's a long wait for customers hanging on for Intel's next-generation GPU, Falcon Shores, which will be released in late 2025.  "Then we have a rich, a very Read more…

GenAI Having Major Impact on Data Culture, Survey Says

February 21, 2024

While 2023 was the year of GenAI, the adoption rates for GenAI did not match expectations. Most organizations are continuing to invest in GenAI but are yet to Read more…

Q&A with Nvidia’s Chief of DGX Systems on the DGX-GB200 Rack-scale System

March 27, 2024

Pictures of Nvidia's new flagship mega-server, the DGX GB200, on the GTC show floor got favorable reactions on social media for the sheer amount of computing po Read more…

Intel Plans Falcon Shores 2 GPU Supercomputing Chip for 2026  

August 8, 2023

Intel is planning to onboard a new version of the Falcon Shores chip in 2026, which is code-named Falcon Shores 2. The new product was announced by CEO Pat Gel Read more…

How the Chip Industry is Helping a Battery Company

May 8, 2024

Chip companies, once seen as engineering pure plays, are now at the center of geopolitical intrigue. Chip manufacturing firms, especially TSMC and Intel, have b Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire