How Lawrence Livermore Is Facing Exascale Power Demands

By Tiffany Trader

June 9, 2016

The old adage “you cannot improve what you do not measure” is fresh again in the age of ubiquitous data. When considering the challenges of exascale computing, power is right at the top of the list and the major leadership-class centers want to make sure they’re doing everything they can to manage the demands of power today – which can run as high as 10 MW at peak for the largest machines – and in the coming exascale era, when the number could be three times that high. At loads of this magnitude, the largest HPC facilities need to have all the relevant power data within arm’s reach.

Managing power demands is a priority at Lawrence Livermore National Laboratory (LLNL), the Department of Energy (DOE) center entrusted with ensuring nuclear security for the nation. With a peak speed of 20 petaflops, the center’s top supercomputer, Sequoia, draws more than 9 MW of power, equivalent to the energy draw of more than 1,000 average homes.

When tens of megawatts of power are on the line, advanced power management is needed to balance the highly fluctuant power demands and power availability. This requires orchestration of resources and real-time insight into the entire operational facility and energy grid. Even small interruptions during high performance compute cycles can derail the job and disrupt power grid management as well.

Facing the challenge of balancing demands at exascale, LLNL sought out the assistance of OSIsoft, a company with deep roots in data collection, aggregation and storage. OSIsoft helps LLNL track and analyze streams of operational data from computing racks, cooling systems, energy utilities and other equipment and stores it to central control point for the life of the assets. This affords administrators, like Anna Maria Bailey, LLNL high performance computing facility manager, the opportunity to spot efficiency gains, glean what data is important, and coordinate forecasted load demands with utility companies in real-time.

Since implementing OSIsoft’s software product, the PI system, LLNL has been able to identify troubling anomalies, including several megawatt inter-hour power swings. The facility has also earned LEED Gold status and LLNL reports increased operational assurance for the future of its operations and coming big iron, like Sierra, Livermore’s next advanced technology high-performance computing system, which is spec’d at 120-150 petaflops peak.

OSIsoft has been in business for 35 years building a software platform that collects, aggregates and stores high-fidelity data for the life of assets. The company connects the sensor data that has existed for some time – now commonly referred to as big data or IoT — to enable real-time decision making as well as historical performance tracking and ultimately predictive analytics.

OSIsoft started in the refining industry — then moved into the paper industry, upstream oil and gas, metals, and mining. In the last 10 years, it added datacenters to its customer list. “It was a very logical extension because we had been involved with the heavy industry of the previous industrial age, as well as now the heavy industry of the digital age,” said OSIsoft’s Steve Sarnecki, vice president of federal and public sectors. “Datacenters, especially high-performance computing datacenters, are literally the factories of the future and the type of data they generate fits very well in the software we produce, the PI system.”

When the product was expanded to commercial datacenters like eBay, and Dell, HP and others, OSIsoft built interfaces and data collection software to collect the data from those unique pieces of equipment or types of systems with the aim of empowering teams to make better decisions.

Sarnecki further shared that about 80 percent of the megawatts of power generated in the US run through a PI system. 100 percent of the independent system operators (ISOs) that do dispatch of power within the US use the PI system and 78 out of 104 nuclear licensees use the PI system with 104 out of 104 feeding their data up to the Nuclear Regulatory Commission, who is one of OSIsoft’s federal users that looks at emergency response on the PI system.

Asked if the product was modified for Livermore, Sarnecki said it is the same product – his company provides the toolset for the expert who understands the business problem as well as solutions providers in the space.

“At Livermore, our job is to take the sensors in the field that are spread out all over that campus, different types, and make them intimately close to the intelligent resources be they computer simulations or be they scientists so they have immediate access to that data as if they were standing right in front of this plethora of meters at the same time,” said Sarnecki.

Livermore’s relationship with OSIsoft goes back to 2010. LLNL High Performance Computing Facility Manager Anna Maria Bailey explains that it started with the development of a high-performance computing master plan. “We were looking at how we were going to achieve petascale and exascale computing going forward,” she said. “We had created a master plan that had many core competencies in it, from sustainable HPC solutions, doing computational fluid dynamics, benchmarking, leveraging our existing HPC capabilities, facilitating LEED certifications, free cooling, liquid cooling, innovative electrical distribution and developing gap analysis – and another area was power management.”

In looking at the master plan of all the core competencies, Bailey said they all reflected a need for data, but although the data was in the institution of Livermore, it wasn’t all easily accessible within the HPC facilities. For example, when facilities asked for the metering data of particular transformers or the flow rates of particular chillers, they encountered issues with data being in different formats, or not up to date, or infrequently read or downloaded only when needed.

Livermore began looking at different organizations that could help compile this data, and Bailey being an electrical engineer coming from the utility industry knew about OSIsoft. After determining that the software had the functionality they were looking for, a relationship was forged.

“The PI software allowed us to do was bring all of the numerous data streams that we had into one area,” said Bailey. “We needed to aggregate the data into a single source – not necessarily to view on a common dashboard but that is the capability – but actually to aggregate the data to manipulate on a common platform and it allowed us to determine what data was significant.”

Before having PI, Livermore was unable to correlate events from the various sources because of the different time stamps and the formats, said Bailey. OSIsoft facilities having a common time stamp and format and the PI system does operational event, real-time data management infrastructure of all internal and external data sources.

PI enabled Livermore to bring in data from the rack-level,  the equipment-level, the metering level, the building level, management level and the utility level. With those hundreds of real-time data streams interfaces, Bailey and her team were able to manage, gather and evaluate the large amount of data, analyze it, convert it into real-time data. The system gives the team the ability to notify, send triggers and alarms and provides visualizations to support decision-making.

“Our overall goal of doing this was to lower our power utilization and obviously achieve exascale that’s the long term goal because the better we use these resources, we can actually manage our facilities and infrastructure more appropriately,” said Bailey. “When Sierra, the next machine that we’re bringing online in 2017, every rack will be metered just like Sequoia is and the data will come into PI.”

The project started as a facilities operations tool, but then the team brought in some of the resource manager data from SLURM. So now they have several scientists who use it and they use a solver on it. They migrate the data in PI out to a solver, so they can fine tune the correlated time stamps.

The facilities team uses it for performance but also for looking at anomalies. Bailey shared that while they were bringing up Sequoia, they saw some large variations in the load, specifically there were recurring inter-hour variabilities that were exceeding 8 MW because the machine was dropping from 9.6 MW to 180 kw. Maintenance was considered as a cause, but they insisted they were not responsible for dropping the machines. Working with their utility company, Bailey’s team was able to correlate that data back to maintenance periods.

“PI was able to focus in, pick all these event stamps of the power as well as what was going on with the chilled water plant, what was going on in the condenser water plant and we were able to think it up to notice that there was a correlation at that given time,” she explained. “It helped us clue in what the problem was and give us a frame to actually shut the maintenance down slower on the machine, so now we drop it from say 7.5 MW to 5.5 MW then we wait a while, then we keep dropping it so we’re not having that large inter-hour variability.”

There are analytics use cases too. Fellow LLNL’er Ghaleb Abdulla of the Data Science group is manipulating PI data on a large capacity resource called Cab. Bailey shared that her colleague brings the data into a solver and correlates it with the data that’s on the node of the machine and does some visualizations off of that. The work made it possible to pinpoint sensor locations in the field that if moved around would get better data.

Abdulla is also working on another project about how to analyze a machine that is the same architecture but has a liquid cooling solution and the same architecture that has an air-cooled system, working from the facility level, down into the rack level and into the node level.

“He likes it because he’s got all the data in one location,” said Bailey of her colleague. “The thing that’s really nice about PI is that all of these interfaces are different so the PI interface nodes that you connect to these feeder systems that come in – can be SQL, can be HTML, can be Modbus, can be BACnet, can be any type of open protocol and as long as you have the interface node you are able to bring the data in, where we were finding other systems weren’t that flexible. You were having to bring data in, you were either having to manipulate the data first and then bring it in and then we were finding that there was incompatibility with the data, where this is nice because you bring it in and they can come in the PI server and it works really well that way.”

Bailey said that her team is expecting more use cases and they are looking at grid integration, which provides further assurance of meeting exascale-class power demands. Ghaleb and Bailey are working together on figuring out strategies for fine-grained power management, course-grained power management, job scheduling, back up scheduling, and shutting down and shutting load.

“This is a big topic for us because as we go to exascale and we have a machine that could be 20-30 MW, the difference between the peak and shutting that unit as it goes offline could be huge to the utility,” said Bailey. “We actually met with one of our power providers who also has PI. One of our goals in the future is to have data that we can share amongst ourselves and them – they are also a DOE entity as well – that is huge for us. We are looking at collaboration with them and that’s a big challenge coming up in 2022 – how do we do grid integration with the utility having an exascale machine on the floor, having 20-30 MW in 20,000 sq ft of space, that’s just crazy. How do we take the environmental monitoring system, how do we integrate it to respond to these demand changes and how does the grid integration implementation require energy transactions to the power management system. We’re really heavily involved in that but it’s going to take some time so we use PI a lot on granular studies.”

Livermore reports real results with PI. Bailey said they’ve seen an improvement in PUE across all of the datacenters that are in their HPC complex, which was tied to an energy savings. In the mechanical system, they found that we were having some leakage issues through their environmental monitoring data that was coming into PI. A chiller that was going on and off line sporadically, and it actually had a mechanical problem discovered with PI. Bailey noted that the building management system doesn’t store the data long enough so the data that comes into PI was what made it possible to determine when the unit was going on and off. They reprogrammed the system so that it would use less chilled water.

So far, Livermore is OSIsoft’s only customer in the HPC facilities space. Asked about the prospect of her colleagues at other centers deploying the PI system, Bailey said there’s a need, but there’s also the matter of organizational alignment.

“I’m not matrixed into HPC, I live in HPC, so my supervisor is the same supervisor as the system administrator, as the facility operations manager as the system engineer and the system architects. We all are very aligned here,” she said. “What happens at other laboratories is that their facility people or their system engineers are matrixed in from another organization so they are not completely aligned with their line managers so it’s difficult to convince your line management that you really need this because the bottom line affects the program manager; we have the support of him which makes it huge.

“If you don’t have that support, it’s difficult. So that’s what I’ve seen with the other laboratories, a lot of them want to do it, but the way that they are structured it doesn’t allow them to have that complete backing so who’s going to pay for it, right? It always comes down to that. At our organization, we all have the same direction and the same focus so when you have everyone in alignment that this is what they need to improve their projections and to get to exascale as a common goal, you have the backing.”

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

HPC in Life Sciences Part 1: CPU Choices, Rise of Data Lakes, Networking Challenges, and More

February 21, 2019

For the past few years HPCwire and leaders of BioTeam, a research computing consultancy specializing in life sciences, have convened to examine the state of HPC (and now AI) use in life sciences. Without HPC writ lar Read more…

By John Russell

Arm Unveils Neoverse N1 Platform with up to 128-Cores

February 20, 2019

Following on its Neoverse roadmap announcement last October, Arm today revealed its next-gen Neoverse microarchitecture with compute and throughput-optimized silicon designs catered toward general-purpose cloud computing Read more…

By Tiffany Trader

The Internet of Criminal Things—Trust in the Gods but Verify!

February 20, 2019

“Are we under attack?” asked Professor Elmarie Biermann of the Cyber Security Institute during the recent South African Centre for High Performance Computing’s (CHPC) National Conference in Cape Town. A quick show Read more…

By Elizabeth Leake, STEM-Trek

HPE Extreme Performance Solutions

HPE and Intel® Omni-Path Architecture: How to Power a Cloud

Learn how HPE and Intel® Omni-Path Architecture provide critical infrastructure for leading Nordic HPC provider’s HPCFLOW cloud service.

powercloud_blog.jpgFor decades, HPE has been at the forefront of high-performance computing, and we’ve powered some of the fastest and most robust supercomputers in the world. Read more…

IBM Accelerated Insights

The Perils of Becoming Trapped in the Cloud

Terms like ‘open systems’ have been bandied about for decades. While modern computer systems are relatively open compared to their predecessors, there are still plenty of opportunities to become locked into proprietary interfaces. Read more…

Machine Learning Takes Heat for Science’s Reproducibility Crisis

February 19, 2019

Scientists are raising red flags about the accuracy and reproducibility of conclusions drawn by machine learning frameworks. Among the remedies are developing new ML systems that can question their own predictions, show Read more…

By George Leopold

HPC in Life Sciences Part 1: CPU Choices, Rise of Data Lakes, Networking Challenges, and More

February 21, 2019

For the past few years HPCwire and leaders of BioTeam, a research computing consultancy specializing in life sciences, have convened to examine the state of HP Read more…

By John Russell

Arm Unveils Neoverse N1 Platform with up to 128-Cores

February 20, 2019

Following on its Neoverse roadmap announcement last October, Arm today revealed its next-gen Neoverse microarchitecture with compute and throughput-optimized si Read more…

By Tiffany Trader

Insights from Optimized Codes on Cineca’s Marconi

February 15, 2019

What can you do with 381,392 CPU cores? For Cineca, it means enabling computational scientists to expand a large part of the world’s body of knowledge from th Read more…

By Ken Strandberg

ClusterVision in Bankruptcy, Fate Uncertain

February 13, 2019

ClusterVision, European HPC specialists that have built and installed over 20 Top500-ranked systems in their nearly 17-year history, appear to be in the midst o Read more…

By Tiffany Trader

UC Berkeley Paper Heralds Rise of Serverless Computing in the Cloud – Do You Agree?

February 13, 2019

Almost exactly ten years to the day from publishing of their widely-read, seminal paper on cloud computing, UC Berkeley researchers have issued another ambitious examination of cloud computing - Cloud Programming Simplified: A Berkeley View on Serverless Computing. The new work heralds the rise of ‘serverless computing’ as the next dominant phase of cloud computing. Read more…

By John Russell

Iowa ‘Grows Its Own’ to Fill the HPC Workforce Pipeline

February 13, 2019

The global workforce that supports advanced computing, scientific software and high-speed research networks is relatively small when you stop to consider the magnitude of the transformative discoveries it empowers. Technical conferences provide a forum where specialists convene to learn about the latest innovations and schedule face-time with colleagues from other institutions. Read more…

By Elizabeth Leake, STEM-Trek

Trump Signs Executive Order Launching U.S. AI Initiative

February 11, 2019

U.S. President Donald Trump issued an Executive Order (EO) today launching a U.S Artificial Intelligence Initiative. The new initiative - Maintaining American L Read more…

By John Russell

Celebrating Women in Science: Meet Four Women Leading the Way in HPC

February 11, 2019

One only needs to look around at virtually any CS/tech conference to realize that women are underrepresented, and that holds true of HPC. SC hosts over 13,000 H Read more…

By AJ Lauer

Quantum Computing Will Never Work

November 27, 2018

Amid the gush of money and enthusiastic predictions being thrown at quantum computing comes a proposed cold shower in the form of an essay by physicist Mikhail Read more…

By John Russell

Cray Unveils Shasta, Lands NERSC-9 Contract

October 30, 2018

Cray revealed today the details of its next-gen supercomputing architecture, Shasta, selected to be the next flagship system at NERSC. We've known of the code-name "Shasta" since the Argonne slice of the CORAL project was announced in 2015 and although the details of that plan have changed considerably, Cray didn't slow down its timeline for Shasta. Read more…

By Tiffany Trader

The Case Against ‘The Case Against Quantum Computing’

January 9, 2019

It’s not easy to be a physicist. Richard Feynman (basically the Jimi Hendrix of physicists) once said: “The first principle is that you must not fool yourse Read more…

By Ben Criger

AMD Sets Up for Epyc Epoch

November 16, 2018

It’s been a good two weeks, AMD’s Gary Silcott and Andy Parma told me on the last day of SC18 in Dallas at the restaurant where we met to discuss their show news and recent successes. Heck, it’s been a good year. Read more…

By Tiffany Trader

Intel Reportedly in $6B Bid for Mellanox

January 30, 2019

The latest rumors and reports around an acquisition of Mellanox focus on Intel, which has reportedly offered a $6 billion bid for the high performance interconn Read more…

By Doug Black

ClusterVision in Bankruptcy, Fate Uncertain

February 13, 2019

ClusterVision, European HPC specialists that have built and installed over 20 Top500-ranked systems in their nearly 17-year history, appear to be in the midst o Read more…

By Tiffany Trader

US Leads Supercomputing with #1, #2 Systems & Petascale Arm

November 12, 2018

The 31st Supercomputing Conference (SC) - commemorating 30 years since the first Supercomputing in 1988 - kicked off in Dallas yesterday, taking over the Kay Ba Read more…

By Tiffany Trader

Looking for Light Reading? NSF-backed ‘Comic Books’ Tackle Quantum Computing

January 28, 2019

Still baffled by quantum computing? How about turning to comic books (graphic novels for the well-read among you) for some clarity and a little humor on QC. The Read more…

By John Russell

Leading Solution Providers

SC 18 Virtual Booth Video Tour

Advania @ SC18 AMD @ SC18
ASRock Rack @ SC18
DDN Storage @ SC18
HPE @ SC18
IBM @ SC18
Lenovo @ SC18 Mellanox Technologies @ SC18
NVIDIA @ SC18
One Stop Systems @ SC18
Oracle @ SC18 Panasas @ SC18
Supermicro @ SC18 SUSE @ SC18 TYAN @ SC18
Verne Global @ SC18

Contract Signed for New Finnish Supercomputer

December 13, 2018

After the official contract signing yesterday, configuration details were made public for the new BullSequana system that the Finnish IT Center for Science (CSC Read more…

By Tiffany Trader

Deep500: ETH Researchers Introduce New Deep Learning Benchmark for HPC

February 5, 2019

ETH researchers have developed a new deep learning benchmarking environment – Deep500 – they say is “the first distributed and reproducible benchmarking s Read more…

By John Russell

IBM Quantum Update: Q System One Launch, New Collaborators, and QC Center Plans

January 10, 2019

IBM made three significant quantum computing announcements at CES this week. One was introduction of IBM Q System One; it’s really the integration of IBM’s Read more…

By John Russell

HPC Reflections and (Mostly Hopeful) Predictions

December 19, 2018

So much ‘spaghetti’ gets tossed on walls by the technology community (vendors and researchers) to see what sticks that it is often difficult to peer through Read more…

By John Russell

IBM Bets $2B Seeking 1000X AI Hardware Performance Boost

February 7, 2019

For now, AI systems are mostly machine learning-based and “narrow” – powerful as they are by today's standards, they're limited to performing a few, narro Read more…

By Doug Black

Nvidia’s Jensen Huang Delivers Vision for the New HPC

November 14, 2018

For nearly two hours on Monday at SC18, Jensen Huang, CEO of Nvidia, presented his expansive view of the future of HPC (and computing in general) as only he can do. Animated. Backstopped by a stream of data charts, product photos, and even a beautiful image of supernovae... Read more…

By John Russell

The Deep500 – Researchers Tackle an HPC Benchmark for Deep Learning

January 7, 2019

How do you know if an HPC system, particularly a larger-scale system, is well-suited for deep learning workloads? Today, that’s not an easy question to answer Read more…

By John Russell

Intel Confirms 48-Core Cascade Lake-AP for 2019

November 4, 2018

As part of the run-up to SC18, taking place in Dallas next week (Nov. 11-16), Intel is doling out info on its next-gen Cascade Lake family of Xeon processors, specifically the “Advanced Processor” version (Cascade Lake-AP), architected for high-performance computing, artificial intelligence and infrastructure-as-a-service workloads. Read more…

By Tiffany Trader

  • arrow
  • Click Here for More Headlines
  • arrow
Do NOT follow this link or you will be banned from the site!
Share This