How Lawrence Livermore Is Facing Exascale Power Demands

By Tiffany Trader

June 9, 2016

The old adage “you cannot improve what you do not measure” is fresh again in the age of ubiquitous data. When considering the challenges of exascale computing, power is right at the top of the list and the major leadership-class centers want to make sure they’re doing everything they can to manage the demands of power today – which can run as high as 10 MW at peak for the largest machines – and in the coming exascale era, when the number could be three times that high. At loads of this magnitude, the largest HPC facilities need to have all the relevant power data within arm’s reach.

Managing power demands is a priority at Lawrence Livermore National Laboratory (LLNL), the Department of Energy (DOE) center entrusted with ensuring nuclear security for the nation. With a peak speed of 20 petaflops, the center’s top supercomputer, Sequoia, draws more than 9 MW of power, equivalent to the energy draw of more than 10,000 average homes.

When tens of megawatts of power are on the line, advanced power management is needed to balance the highly fluctuant power demands and power availability. This requires orchestration of resources and real-time insight into the entire operational facility and energy grid. Even small interruptions during high performance compute cycles can derail the job and disrupt power grid management as well.

Facing the challenge of balancing demands at exascale, LLNL sought out the assistance of OSIsoft, a company with deep roots in data collection, aggregation and storage. OSIsoft helps LLNL track and analyze streams of operational data from computing racks, cooling systems, energy utilities and other equipment and stores it to central control point for the life of the assets. This affords administrators, like Anna Maria Bailey, LLNL high performance computing facility manager, the opportunity to spot efficiency gains, glean what data is important, and coordinate forecasted load demands with utility companies in real-time.

Since implementing OSIsoft’s software product, the PI system, LLNL has been able to identify troubling anomalies, including several megawatt inter-hour power swings. The facility has also earned LEED Gold status and LLNL reports increased operational assurance for the future of its operations and coming big iron, like Sierra, Livermore’s next advanced technology high-performance computing system, which is spec’d at 120-150 petaflops peak.

OSIsoft has been in business for 35 years building a software platform that collects, aggregates and stores high-fidelity data for the life of assets. The company connects the sensor data that has existed for some time – now commonly referred to as big data or IoT — to enable real-time decision making as well as historical performance tracking and ultimately predictive analytics.

OSIsoft started in the refining industry — then moved into the paper industry, upstream oil and gas, metals, and mining. In the last 10 years, it added datacenters to its customer list. “It was a very logical extension because we had been involved with the heavy industry of the previous industrial age, as well as now the heavy industry of the digital age,” said OSIsoft’s Steve Sarnecki, vice president of federal and public sectors. “Datacenters, especially high-performance computing datacenters, are literally the factories of the future and the type of data they generate fits very well in the software we produce, the PI system.”

When the product was expanded to commercial datacenters like eBay, and Dell, HP and others, OSIsoft built interfaces and data collection software to collect the data from those unique pieces of equipment or types of systems with the aim of empowering teams to make better decisions.

Sarnecki further shared that about 80 percent of the megawatts of power generated in the US run through a PI system. 100 percent of the independent system operators (ISOs) that do dispatch of power within the US use the PI system and 78 out of 104 nuclear licensees use the PI system with 104 out of 104 feeding their data up to the Nuclear Regulatory Commission, who is one of OSIsoft’s federal users that looks at emergency response on the PI system.

Asked if the product was modified for Livermore, Sarnecki said it is the same product – his company provides the toolset for the expert who understands the business problem as well as solutions providers in the space.

“At Livermore, our job is to take the sensors in the field that are spread out all over that campus, different types, and make them intimately close to the intelligent resources be they computer simulations or be they scientists so they have immediate access to that data as if they were standing right in front of this plethora of meters at the same time,” said Sarnecki.

Livermore’s relationship with OSIsoft goes back to 2010. LLNL High Performance Computing Facility Manager Anna Maria Bailey explains that it started with the development of a high-performance computing master plan. “We were looking at how we were going to achieve petascale and exascale computing going forward,” she said. “We had created a master plan that had many core competencies in it, from sustainable HPC solutions, doing computational fluid dynamics, benchmarking, leveraging our existing HPC capabilities, facilitating LEED certifications, free cooling, liquid cooling, innovative electrical distribution and developing gap analysis – and another area was power management.”

In looking at the master plan of all the core competencies, Bailey said they all reflected a need for data, but although the data was in the institution of Livermore, it wasn’t all easily accessible within the HPC facilities. For example, when facilities asked for the metering data of particular transformers or the flow rates of particular chillers, they encountered issues with data being in different formats, or not up to date, or infrequently read or downloaded only when needed.

Livermore began looking at different organizations that could help compile this data, and Bailey being an electrical engineer coming from the utility industry knew about OSIsoft. After determining that the software had the functionality they were looking for, a relationship was forged.

“The PI software allowed us to bring all of the numerous data streams that we had into one area,” said Bailey. “We needed to aggregate the data into a single source – not necessarily to view on a common dashboard but that is the capability – but actually to aggregate the data to manipulate on a common platform and it allowed us to determine what data was significant.”

Before having PI, Livermore was unable to correlate events from the various sources because of the different time stamps and the formats, said Bailey. OSIsoft facilities having a common time stamp and format and the PI system does operational event, real-time data management infrastructure of all internal and external data sources.

PI enabled Livermore to bring in data from the rack-level,  the equipment-level, the metering level, the building level, management level and the utility level. With those hundreds of real-time data streams interfaces, Bailey and her team were able to manage, gather and evaluate the large amount of data, analyze it, convert it into real-time data. The system gives the team the ability to notify, send triggers and alarms and provides visualizations to support decision-making.

“Our overall goal of doing this was to lower our power utilization and obviously achieve exascale that’s the long term goal because the better we use these resources, we can actually manage our facilities and infrastructure more appropriately,” said Bailey. “When Sierra, the next machine that we’re bringing online in 2017, every rack will be metered just like Sequoia is and the data will come into PI.”

The project started as a facilities operations tool, but then the team brought in some of the resource manager data from SLURM. So now they have several scientists who use it and they use a solver on it. They migrate the data in PI out to a solver, so they can fine tune the correlated time stamps.

The facilities team uses it for performance but also for looking at anomalies. Bailey shared that while they were bringing up Sequoia, they saw some large variations in the load, specifically there were recurring inter-hour variabilities that were exceeding 8 MW because the machine was dropping from 9.6 MW to 180 kw. Maintenance was considered as a cause, but they insisted they were not responsible for dropping the machines. Working with their utility company, Bailey’s team was able to correlate that data back to maintenance periods.

“PI was able to focus in, pick all these event stamps of the power as well as what was going on with the chilled water plant, what was going on in the condenser water plant and we were able to think it up to notice that there was a correlation at that given time,” she explained. “It helped us clue in what the problem was and give us a frame to actually shut the maintenance down slower on the machine, so now we drop it from say 7.5 MW to 5.5 MW then we wait a while, then we keep dropping it so we’re not having that large inter-hour variability.”

There are analytics use cases too. Fellow LLNL’er Ghaleb Abdulla of the Data Science group is manipulating PI data on a large capacity resource called Cab. Bailey shared that her colleague brings the data into a solver and correlates it with the data that’s on the node of the machine and does some visualizations off of that. The work made it possible to pinpoint sensor locations in the field that if moved around would get better data.

Abdulla is also working on another project about how to analyze a machine that is the same architecture but has a liquid cooling solution and the same architecture that has an air-cooled system, working from the facility level, down into the rack level and into the node level.

“He likes it because he’s got all the data in one location,” said Bailey of her colleague. “The thing that’s really nice about PI is that all of these interfaces are different so the PI interface nodes that you connect to these feeder systems that come in – can be SQL, can be HTML, can be Modbus, can be BACnet, can be any type of open protocol and as long as you have the interface node you are able to bring the data in, where we were finding other systems weren’t that flexible. You were having to bring data in, you were either having to manipulate the data first and then bring it in and then we were finding that there was incompatibility with the data, where this is nice because you bring it in and they can come in the PI server and it works really well that way.”

Bailey said that her team is expecting more use cases and they are looking at grid integration, which provides further assurance of meeting exascale-class power demands. Ghaleb and Bailey are working together on figuring out strategies for fine-grained power management, course-grained power management, job scheduling, back up scheduling, and shutting down and shutting load.

“This is a big topic for us because as we go to exascale and we have a machine that could be 20-30 MW, the difference between the peak and shutting that unit as it goes offline could be huge to the utility,” said Bailey. “We actually met with one of our power providers who also has PI. One of our goals in the future is to have data that we can share amongst ourselves and them – they are also a DOE entity as well – that is huge for us. We are looking at collaboration with them and that’s a big challenge coming up in 2022 – how do we do grid integration with the utility having an exascale machine on the floor, having 20-30 MW in 20,000 sq ft of space, that’s just crazy. How do we take the environmental monitoring system, how do we integrate it to respond to these demand changes and how does the grid integration implementation require energy transactions to the power management system. We’re really heavily involved in that but it’s going to take some time so we use PI a lot on granular studies.”

Livermore reports real results with PI. Bailey said they’ve seen an improvement in PUE across all of the datacenters that are in their HPC complex, which was tied to an energy savings. In the mechanical system, they found that we were having some leakage issues through their environmental monitoring data that was coming into PI. A chiller that was going on and off line sporadically, and it actually had a mechanical problem discovered with PI. Bailey noted that the building management system doesn’t store the data long enough so the data that comes into PI was what made it possible to determine when the unit was going on and off. They reprogrammed the system so that it would use less chilled water.

So far, Livermore is OSIsoft’s only customer in the HPC facilities space. Asked about the prospect of her colleagues at other centers deploying the PI system, Bailey said there’s a need, but there’s also the matter of organizational alignment.

“I’m not matrixed into HPC, I live in HPC, so my supervisor is the same supervisor as the system administrator, as the facility operations manager as the system engineer and the system architects. We all are very aligned here,” she said. “What happens at other laboratories is that their facility people or their system engineers are matrixed in from another organization so they are not completely aligned with their line managers so it’s difficult to convince your line management that you really need this because the bottom line affects the program manager; we have the support of him which makes it huge.

“If you don’t have that support, it’s difficult. So that’s what I’ve seen with the other laboratories, a lot of them want to do it, but the way that they are structured it doesn’t allow them to have that complete backing so who’s going to pay for it, right? It always comes down to that. At our organization, we all have the same direction and the same focus so when you have everyone in alignment that this is what they need to improve their projections and to get to exascale as a common goal, you have the backing.”

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

Weekly Wire Roundup: July 8-July 12, 2024

July 12, 2024

HPC news can get pretty sleepy in June and July, but this week saw a bump in activity midweek as Americans realized they still had work to do after the previous holiday weekend. The world outside the United States also s Read more…

Nvidia, Intel not Welcomed in New Apple AI and HPC Development Tools

July 12, 2024

New Mac developer tools will leverage Apple's homegrown chips, limiting HPC users' ability to use parallel programming frameworks from Intel or Nvidia. Apple's latest programming framework, Xcode 16, was introduced at Read more…

Virga: Australia’s New HPC and AI Powerhouse

July 11, 2024

Australia has officially added another supercomputer to the TOP500 list with the implementation of Virga. Officially coming online in June 2024, Virga is the newest HPC system to come out of the Australian Commonwealth S Read more…

NSF Issues Next Solicitation and More Detail on National Quantum Virtual Laboratory

July 10, 2024

After percolating for roughly a year, NSF has issued the next solicitation for the National Quantum Virtual Lab program — this one focused on design and implementation phases of the Quantum Quantum Science and Technolo Read more…

NCSA’s SEAS Team Keeps APACE of AlphaFold2

July 9, 2024

High-performance computing (HPC) can often be challenging for researchers to use because it requires expertise in working with large datasets, scaling the software, and selecting the best user interface. The National Read more…

Anders Jensen on Europe’s Plan for AI-optimized Supercomputers, Welcoming the UK, and More

July 8, 2024

The recent ISC24 conference in Hamburg showcased LUMI and other leadership-class supercomputers co-funded by the EuroHPC Joint Undertaking (JU), including three of the 10 highest-ranking Top500 systems, but some other ne Read more…

Shutterstock 2203611339

NSF Issues Next Solicitation and More Detail on National Quantum Virtual Laboratory

July 10, 2024

After percolating for roughly a year, NSF has issued the next solicitation for the National Quantum Virtual Lab program — this one focused on design and imple Read more…

NCSA’s SEAS Team Keeps APACE of AlphaFold2

July 9, 2024

High-performance computing (HPC) can often be challenging for researchers to use because it requires expertise in working with large datasets, scaling the softw Read more…

Anders Jensen on Europe’s Plan for AI-optimized Supercomputers, Welcoming the UK, and More

July 8, 2024

The recent ISC24 conference in Hamburg showcased LUMI and other leadership-class supercomputers co-funded by the EuroHPC Joint Undertaking (JU), including three Read more…

Generative AI to Account for 1.5% of World’s Power Consumption by 2029

July 8, 2024

Generative AI will take on a larger chunk of the world's power consumption to keep up with the hefty hardware requirements to run applications. "AI chips repres Read more…

US Senators Propose $32 Billion in Annual AI Spending, but Critics Remain Unconvinced

July 5, 2024

Senate leader, Chuck Schumer, and three colleagues want the US government to spend at least $32 billion annually by 2026 for non-defense related AI systems.  T Read more…

Point and Click HPC: High-Performance Desktops

July 3, 2024

Recently, an interesting paper appeared on Arvix called Use Cases for High-Performance Research Desktops. To be clear, the term desktop in this context does not Read more…

IonQ Plots Path to Commercial (Quantum) Advantage

July 2, 2024

IonQ, the trapped ion quantum computing specialist, delivered a progress report last week firming up 2024/25 product goals and reviewing its technology roadmap. Read more…

Shutterstock_1687123447

Nvidia Economics: Make $5-$7 for Every $1 Spent on GPUs

June 30, 2024

Nvidia is saying that companies could make $5 to $7 for every $1 invested in GPUs over a four-year period. Customers are investing billions in new Nvidia hardwa Read more…

Atos Outlines Plans to Get Acquired, and a Path Forward

May 21, 2024

Atos – via its subsidiary Eviden – is the second major supercomputer maker outside of HPE, while others have largely dropped out. The lack of integrators and Atos' financial turmoil have the HPC market worried. If Atos goes under, HPE will be the only major option for building large-scale systems. Read more…

Everyone Except Nvidia Forms Ultra Accelerator Link (UALink) Consortium

May 30, 2024

Consider the GPU. An island of SIMD greatness that makes light work of matrix math. Originally designed to rapidly paint dots on a computer monitor, it was then Read more…

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, codenamed Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

Shutterstock_1687123447

Nvidia Economics: Make $5-$7 for Every $1 Spent on GPUs

June 30, 2024

Nvidia is saying that companies could make $5 to $7 for every $1 invested in GPUs over a four-year period. Customers are investing billions in new Nvidia hardwa Read more…

Nvidia Shipped 3.76 Million Data-center GPUs in 2023, According to Study

June 10, 2024

Nvidia had an explosive 2023 in data-center GPU shipments, which totaled roughly 3.76 million units, according to a study conducted by semiconductor analyst fir Read more…

Some Reasons Why Aurora Didn’t Take First Place in the Top500 List

May 15, 2024

The makers of the Aurora supercomputer, which is housed at the Argonne National Laboratory, gave some reasons why the system didn't make the top spot on the Top Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Leading Solution Providers

Contributors

AMD Clears Up Messy GPU Roadmap, Upgrades Chips Annually

June 3, 2024

In the world of AI, there's a desperate search for an alternative to Nvidia's GPUs, and AMD is stepping up to the plate. AMD detailed its updated GPU roadmap, w Read more…

Intel’s Next-gen Falcon Shores Coming Out in Late 2025 

April 30, 2024

It's a long wait for customers hanging on for Intel's next-generation GPU, Falcon Shores, which will be released in late 2025.  "Then we have a rich, a very Read more…

Google Announces Sixth-generation AI Chip, a TPU Called Trillium

May 17, 2024

On Tuesday May 14th, Google announced its sixth-generation TPU (tensor processing unit) called Trillium.  The chip, essentially a TPU v6, is the company's l Read more…

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

IonQ Plots Path to Commercial (Quantum) Advantage

July 2, 2024

IonQ, the trapped ion quantum computing specialist, delivered a progress report last week firming up 2024/25 product goals and reviewing its technology roadmap. Read more…

The NASA Black Hole Plunge

May 7, 2024

We have all thought about it. No one has done it, but now, thanks to HPC, we see what it looks like. Hold on to your feet because NASA has released videos of wh Read more…

Q&A with Nvidia’s Chief of DGX Systems on the DGX-GB200 Rack-scale System

March 27, 2024

Pictures of Nvidia's new flagship mega-server, the DGX GB200, on the GTC show floor got favorable reactions on social media for the sheer amount of computing po Read more…

MLPerf Inference 4.0 Results Showcase GenAI; Nvidia Still Dominates

March 28, 2024

There were no startling surprises in the latest MLPerf Inference benchmark (4.0) results released yesterday. Two new workloads — Llama 2 and Stable Diffusion Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire