How Lawrence Livermore Is Facing Exascale Power Demands

By Tiffany Trader

June 9, 2016

The old adage “you cannot improve what you do not measure” is fresh again in the age of ubiquitous data. When considering the challenges of exascale computing, power is right at the top of the list and the major leadership-class centers want to make sure they’re doing everything they can to manage the demands of power today – which can run as high as 10 MW at peak for the largest machines – and in the coming exascale era, when the number could be three times that high. At loads of this magnitude, the largest HPC facilities need to have all the relevant power data within arm’s reach.

Managing power demands is a priority at Lawrence Livermore National Laboratory (LLNL), the Department of Energy (DOE) center entrusted with ensuring nuclear security for the nation. With a peak speed of 20 petaflops, the center’s top supercomputer, Sequoia, draws more than 9 MW of power, equivalent to the energy draw of more than 10,000 average homes.

When tens of megawatts of power are on the line, advanced power management is needed to balance the highly fluctuant power demands and power availability. This requires orchestration of resources and real-time insight into the entire operational facility and energy grid. Even small interruptions during high performance compute cycles can derail the job and disrupt power grid management as well.

Facing the challenge of balancing demands at exascale, LLNL sought out the assistance of OSIsoft, a company with deep roots in data collection, aggregation and storage. OSIsoft helps LLNL track and analyze streams of operational data from computing racks, cooling systems, energy utilities and other equipment and stores it to central control point for the life of the assets. This affords administrators, like Anna Maria Bailey, LLNL high performance computing facility manager, the opportunity to spot efficiency gains, glean what data is important, and coordinate forecasted load demands with utility companies in real-time.

Since implementing OSIsoft’s software product, the PI system, LLNL has been able to identify troubling anomalies, including several megawatt inter-hour power swings. The facility has also earned LEED Gold status and LLNL reports increased operational assurance for the future of its operations and coming big iron, like Sierra, Livermore’s next advanced technology high-performance computing system, which is spec’d at 120-150 petaflops peak.

OSIsoft has been in business for 35 years building a software platform that collects, aggregates and stores high-fidelity data for the life of assets. The company connects the sensor data that has existed for some time – now commonly referred to as big data or IoT — to enable real-time decision making as well as historical performance tracking and ultimately predictive analytics.

OSIsoft started in the refining industry — then moved into the paper industry, upstream oil and gas, metals, and mining. In the last 10 years, it added datacenters to its customer list. “It was a very logical extension because we had been involved with the heavy industry of the previous industrial age, as well as now the heavy industry of the digital age,” said OSIsoft’s Steve Sarnecki, vice president of federal and public sectors. “Datacenters, especially high-performance computing datacenters, are literally the factories of the future and the type of data they generate fits very well in the software we produce, the PI system.”

When the product was expanded to commercial datacenters like eBay, and Dell, HP and others, OSIsoft built interfaces and data collection software to collect the data from those unique pieces of equipment or types of systems with the aim of empowering teams to make better decisions.

Sarnecki further shared that about 80 percent of the megawatts of power generated in the US run through a PI system. 100 percent of the independent system operators (ISOs) that do dispatch of power within the US use the PI system and 78 out of 104 nuclear licensees use the PI system with 104 out of 104 feeding their data up to the Nuclear Regulatory Commission, who is one of OSIsoft’s federal users that looks at emergency response on the PI system.

Asked if the product was modified for Livermore, Sarnecki said it is the same product – his company provides the toolset for the expert who understands the business problem as well as solutions providers in the space.

“At Livermore, our job is to take the sensors in the field that are spread out all over that campus, different types, and make them intimately close to the intelligent resources be they computer simulations or be they scientists so they have immediate access to that data as if they were standing right in front of this plethora of meters at the same time,” said Sarnecki.

Livermore’s relationship with OSIsoft goes back to 2010. LLNL High Performance Computing Facility Manager Anna Maria Bailey explains that it started with the development of a high-performance computing master plan. “We were looking at how we were going to achieve petascale and exascale computing going forward,” she said. “We had created a master plan that had many core competencies in it, from sustainable HPC solutions, doing computational fluid dynamics, benchmarking, leveraging our existing HPC capabilities, facilitating LEED certifications, free cooling, liquid cooling, innovative electrical distribution and developing gap analysis – and another area was power management.”

In looking at the master plan of all the core competencies, Bailey said they all reflected a need for data, but although the data was in the institution of Livermore, it wasn’t all easily accessible within the HPC facilities. For example, when facilities asked for the metering data of particular transformers or the flow rates of particular chillers, they encountered issues with data being in different formats, or not up to date, or infrequently read or downloaded only when needed.

Livermore began looking at different organizations that could help compile this data, and Bailey being an electrical engineer coming from the utility industry knew about OSIsoft. After determining that the software had the functionality they were looking for, a relationship was forged.

“The PI software allowed us to bring all of the numerous data streams that we had into one area,” said Bailey. “We needed to aggregate the data into a single source – not necessarily to view on a common dashboard but that is the capability – but actually to aggregate the data to manipulate on a common platform and it allowed us to determine what data was significant.”

Before having PI, Livermore was unable to correlate events from the various sources because of the different time stamps and the formats, said Bailey. OSIsoft facilities having a common time stamp and format and the PI system does operational event, real-time data management infrastructure of all internal and external data sources.

PI enabled Livermore to bring in data from the rack-level,  the equipment-level, the metering level, the building level, management level and the utility level. With those hundreds of real-time data streams interfaces, Bailey and her team were able to manage, gather and evaluate the large amount of data, analyze it, convert it into real-time data. The system gives the team the ability to notify, send triggers and alarms and provides visualizations to support decision-making.

“Our overall goal of doing this was to lower our power utilization and obviously achieve exascale that’s the long term goal because the better we use these resources, we can actually manage our facilities and infrastructure more appropriately,” said Bailey. “When Sierra, the next machine that we’re bringing online in 2017, every rack will be metered just like Sequoia is and the data will come into PI.”

The project started as a facilities operations tool, but then the team brought in some of the resource manager data from SLURM. So now they have several scientists who use it and they use a solver on it. They migrate the data in PI out to a solver, so they can fine tune the correlated time stamps.

The facilities team uses it for performance but also for looking at anomalies. Bailey shared that while they were bringing up Sequoia, they saw some large variations in the load, specifically there were recurring inter-hour variabilities that were exceeding 8 MW because the machine was dropping from 9.6 MW to 180 kw. Maintenance was considered as a cause, but they insisted they were not responsible for dropping the machines. Working with their utility company, Bailey’s team was able to correlate that data back to maintenance periods.

“PI was able to focus in, pick all these event stamps of the power as well as what was going on with the chilled water plant, what was going on in the condenser water plant and we were able to think it up to notice that there was a correlation at that given time,” she explained. “It helped us clue in what the problem was and give us a frame to actually shut the maintenance down slower on the machine, so now we drop it from say 7.5 MW to 5.5 MW then we wait a while, then we keep dropping it so we’re not having that large inter-hour variability.”

There are analytics use cases too. Fellow LLNL’er Ghaleb Abdulla of the Data Science group is manipulating PI data on a large capacity resource called Cab. Bailey shared that her colleague brings the data into a solver and correlates it with the data that’s on the node of the machine and does some visualizations off of that. The work made it possible to pinpoint sensor locations in the field that if moved around would get better data.

Abdulla is also working on another project about how to analyze a machine that is the same architecture but has a liquid cooling solution and the same architecture that has an air-cooled system, working from the facility level, down into the rack level and into the node level.

“He likes it because he’s got all the data in one location,” said Bailey of her colleague. “The thing that’s really nice about PI is that all of these interfaces are different so the PI interface nodes that you connect to these feeder systems that come in – can be SQL, can be HTML, can be Modbus, can be BACnet, can be any type of open protocol and as long as you have the interface node you are able to bring the data in, where we were finding other systems weren’t that flexible. You were having to bring data in, you were either having to manipulate the data first and then bring it in and then we were finding that there was incompatibility with the data, where this is nice because you bring it in and they can come in the PI server and it works really well that way.”

Bailey said that her team is expecting more use cases and they are looking at grid integration, which provides further assurance of meeting exascale-class power demands. Ghaleb and Bailey are working together on figuring out strategies for fine-grained power management, course-grained power management, job scheduling, back up scheduling, and shutting down and shutting load.

“This is a big topic for us because as we go to exascale and we have a machine that could be 20-30 MW, the difference between the peak and shutting that unit as it goes offline could be huge to the utility,” said Bailey. “We actually met with one of our power providers who also has PI. One of our goals in the future is to have data that we can share amongst ourselves and them – they are also a DOE entity as well – that is huge for us. We are looking at collaboration with them and that’s a big challenge coming up in 2022 – how do we do grid integration with the utility having an exascale machine on the floor, having 20-30 MW in 20,000 sq ft of space, that’s just crazy. How do we take the environmental monitoring system, how do we integrate it to respond to these demand changes and how does the grid integration implementation require energy transactions to the power management system. We’re really heavily involved in that but it’s going to take some time so we use PI a lot on granular studies.”

Livermore reports real results with PI. Bailey said they’ve seen an improvement in PUE across all of the datacenters that are in their HPC complex, which was tied to an energy savings. In the mechanical system, they found that we were having some leakage issues through their environmental monitoring data that was coming into PI. A chiller that was going on and off line sporadically, and it actually had a mechanical problem discovered with PI. Bailey noted that the building management system doesn’t store the data long enough so the data that comes into PI was what made it possible to determine when the unit was going on and off. They reprogrammed the system so that it would use less chilled water.

So far, Livermore is OSIsoft’s only customer in the HPC facilities space. Asked about the prospect of her colleagues at other centers deploying the PI system, Bailey said there’s a need, but there’s also the matter of organizational alignment.

“I’m not matrixed into HPC, I live in HPC, so my supervisor is the same supervisor as the system administrator, as the facility operations manager as the system engineer and the system architects. We all are very aligned here,” she said. “What happens at other laboratories is that their facility people or their system engineers are matrixed in from another organization so they are not completely aligned with their line managers so it’s difficult to convince your line management that you really need this because the bottom line affects the program manager; we have the support of him which makes it huge.

“If you don’t have that support, it’s difficult. So that’s what I’ve seen with the other laboratories, a lot of them want to do it, but the way that they are structured it doesn’t allow them to have that complete backing so who’s going to pay for it, right? It always comes down to that. At our organization, we all have the same direction and the same focus so when you have everyone in alignment that this is what they need to improve their projections and to get to exascale as a common goal, you have the backing.”

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

SC19’s HPC Impact Showcase Chair: AI + HPC a ‘Speed Train’

November 16, 2019

This year’s chair of the HPC Impact Showcase at the SC19 conference in Denver is Lori Diachin, who has spent her career at the spearhead of HPC. Currently deputy director for the U.S. Department of Energy’s (DOE) Read more…

By Doug Black

Microsoft Azure Adds Graphcore’s IPU

November 15, 2019

Graphcore, the U.K. AI chip developer, is expanding collaboration with Microsoft to offer its intelligent processing units on the Azure cloud, making Microsoft the first large public cloud vendor to offer the IPU designe Read more…

By George Leopold

At SC19: What Is UrgentHPC and Why Is It Needed?

November 14, 2019

The UrgentHPC workshop, taking place Sunday (Nov. 17) at SC19, is focused on using HPC and real-time data for urgent decision making in response to disasters such as wildfires, flooding, health emergencies, and accidents. We chat with organizer Nick Brown, research fellow at EPCC, University of Edinburgh, to learn more. Read more…

By Tiffany Trader

China’s Tencent Server Design Will Use AMD Rome

November 13, 2019

Tencent, the Chinese cloud giant, said it would use AMD’s newest Epyc processor in its internally-designed server. The design win adds further momentum to AMD’s bid to erode rival Intel Corp.’s dominance of the glo Read more…

By George Leopold

NCSA Industry Conference Recap – Part 1

November 13, 2019

Industry Program Director Brendan McGinty welcomed guests to the annual National Center for Supercomputing Applications (NCSA) Industry Conference, October 8-10, on the University of Illinois campus in Urbana (UIUC). One hundred seventy from 40 organizations attended the invitation-only, two-day event. Read more…

By Elizabeth Leake, STEM-Trek

AWS Solution Channel

Making High Performance Computing Affordable and Accessible for Small and Medium Businesses with HPC on AWS

High performance computing (HPC) brings a powerful set of tools to a broad range of industries, helping to drive innovation and boost revenue in finance, genomics, oil and gas extraction, and other fields. Read more…

IBM Accelerated Insights

Data Management – The Key to a Successful AI Project

 

Five characteristics of an awesome AI data infrastructure

[Attend the IBM LSF & HPC User Group Meeting at SC19 in Denver on November 19!]

AI is powered by data

While neural networks seem to get all the glory, data is the unsung hero of AI projects – data lies at the heart of everything from model training to tuning to selection to validation. Read more…

Cray, Fujitsu Both Bringing Fujitsu A64FX-based Supercomputers to Market in 2020

November 12, 2019

The number of top-tier HPC systems makers has shrunk due to a steady march of M&A activity, but there is increased diversity and choice of processing components with Intel Xeon, AMD Epyc, IBM Power, and Arm server ch Read more…

By Tiffany Trader

SC19’s HPC Impact Showcase Chair: AI + HPC a ‘Speed Train’

November 16, 2019

This year’s chair of the HPC Impact Showcase at the SC19 conference in Denver is Lori Diachin, who has spent her career at the spearhead of HPC. Currently Read more…

By Doug Black

Cray, Fujitsu Both Bringing Fujitsu A64FX-based Supercomputers to Market in 2020

November 12, 2019

The number of top-tier HPC systems makers has shrunk due to a steady march of M&A activity, but there is increased diversity and choice of processing compon Read more…

By Tiffany Trader

Intel AI Summit: New ‘Keem Bay’ Edge VPU, AI Product Roadmap

November 12, 2019

At its AI Summit today in San Francisco, Intel touted a raft of AI training and inference hardware for deployments ranging from cloud to edge and designed to support organizations at various points of their AI journeys. The company revealed its Movidius Myriad Vision Processing Unit (VPU)... Read more…

By Doug Black

IBM Adds Support for Ion Trap Quantum Technology to Qiskit

November 11, 2019

After years of percolating in the shadow of quantum computing research based on superconducting semiconductors – think IBM, Rigetti, Google, and D-Wave (quant Read more…

By John Russell

Tackling HPC’s Memory and I/O Bottlenecks with On-Node, Non-Volatile RAM

November 8, 2019

On-node, non-volatile memory (NVRAM) is a game-changing technology that can remove many I/O and memory bottlenecks and provide a key enabler for exascale. That’s the conclusion drawn by the scientists and researchers of Europe’s NEXTGenIO project, an initiative funded by the European Commission’s Horizon 2020 program to explore this new... Read more…

By Jan Rowell

MLPerf Releases First Inference Benchmark Results; Nvidia Touts its Showing

November 6, 2019

MLPerf.org, the young AI-benchmarking consortium, today issued the first round of results for its inference test suite. Among organizations with submissions wer Read more…

By John Russell

Azure Cloud First with AMD Epyc Rome Processors

November 6, 2019

At Ignite 2019 this week, Microsoft's Azure cloud team and AMD announced an expansion of their partnership that began in 2017 when Azure debuted Epyc-backed instances for storage workloads. The fourth-generation Azure D-series and E-series virtual machines previewed at the Rome launch in August are now generally available. Read more…

By Tiffany Trader

Nvidia Launches Credit Card-Sized 21 TOPS Jetson System for Edge Devices

November 6, 2019

Nvidia has launched a new addition to its Jetson product line: a credit card-sized (70x45mm) form factor delivering up to 21 trillion operations/second (TOPS) o Read more…

By Doug Black

Supercomputer-Powered AI Tackles a Key Fusion Energy Challenge

August 7, 2019

Fusion energy is the Holy Grail of the energy world: low-radioactivity, low-waste, zero-carbon, high-output nuclear power that can run on hydrogen or lithium. T Read more…

By Oliver Peckham

Using AI to Solve One of the Most Prevailing Problems in CFD

October 17, 2019

How can artificial intelligence (AI) and high-performance computing (HPC) solve mesh generation, one of the most commonly referenced problems in computational engineering? A new study has set out to answer this question and create an industry-first AI-mesh application... Read more…

By James Sharpe

Cray Wins NNSA-Livermore ‘El Capitan’ Exascale Contract

August 13, 2019

Cray has won the bid to build the first exascale supercomputer for the National Nuclear Security Administration (NNSA) and Lawrence Livermore National Laborator Read more…

By Tiffany Trader

DARPA Looks to Propel Parallelism

September 4, 2019

As Moore’s law runs out of steam, new programming approaches are being pursued with the goal of greater hardware performance with less coding. The Defense Advanced Projects Research Agency is launching a new programming effort aimed at leveraging the benefits of massive distributed parallelism with less sweat. Read more…

By George Leopold

AMD Launches Epyc Rome, First 7nm CPU

August 8, 2019

From a gala event at the Palace of Fine Arts in San Francisco yesterday (Aug. 7), AMD launched its second-generation Epyc Rome x86 chips, based on its 7nm proce Read more…

By Tiffany Trader

D-Wave’s Path to 5000 Qubits; Google’s Quantum Supremacy Claim

September 24, 2019

On the heels of IBM’s quantum news last week come two more quantum items. D-Wave Systems today announced the name of its forthcoming 5000-qubit system, Advantage (yes the name choice isn’t serendipity), at its user conference being held this week in Newport, RI. Read more…

By John Russell

Ayar Labs to Demo Photonics Chiplet in FPGA Package at Hot Chips

August 19, 2019

Silicon startup Ayar Labs continues to gain momentum with its DARPA-backed optical chiplet technology that puts advanced electronics and optics on the same chip Read more…

By Tiffany Trader

Crystal Ball Gazing: IBM’s Vision for the Future of Computing

October 14, 2019

Dario Gil, IBM’s relatively new director of research, painted a intriguing portrait of the future of computing along with a rough idea of how IBM thinks we’ Read more…

By John Russell

Leading Solution Providers

ISC 2019 Virtual Booth Video Tour

CRAY
CRAY
DDN
DDN
DELL EMC
DELL EMC
GOOGLE
GOOGLE
ONE STOP SYSTEMS
ONE STOP SYSTEMS
PANASAS
PANASAS
VERNE GLOBAL
VERNE GLOBAL

Intel Confirms Retreat on Omni-Path

August 1, 2019

Intel Corp.’s plans to make a big splash in the network fabric market for linking HPC and other workloads has apparently belly-flopped. The chipmaker confirmed to us the outlines of an earlier report by the website CRN that it has jettisoned plans for a second-generation version of its Omni-Path interconnect... Read more…

By Staff report

Kubernetes, Containers and HPC

September 19, 2019

Software containers and Kubernetes are important tools for building, deploying, running and managing modern enterprise applications at scale and delivering enterprise software faster and more reliably to the end user — while using resources more efficiently and reducing costs. Read more…

By Daniel Gruber, Burak Yenier and Wolfgang Gentzsch, UberCloud

Dell Ramps Up HPC Testing of AMD Rome Processors

October 21, 2019

Dell Technologies is wading deeper into the AMD-based systems market with a growing evaluation program for the latest Epyc (Rome) microprocessors from AMD. In a Read more…

By John Russell

Rise of NIH’s Biowulf Mirrors the Rise of Computational Biology

July 29, 2019

The story of NIH’s supercomputer Biowulf is fascinating, important, and in many ways representative of the transformation of life sciences and biomedical res Read more…

By John Russell

Xilinx vs. Intel: FPGA Market Leaders Launch Server Accelerator Cards

August 6, 2019

The two FPGA market leaders, Intel and Xilinx, both announced new accelerator cards this week designed to handle specialized, compute-intensive workloads and un Read more…

By Doug Black

When Dense Matrix Representations Beat Sparse

September 9, 2019

In our world filled with unintended consequences, it turns out that saving memory space to help deal with GPU limitations, knowing it introduces performance pen Read more…

By James Reinders

With the Help of HPC, Astronomers Prepare to Deflect a Real Asteroid

September 26, 2019

For years, NASA has been running simulations of asteroid impacts to understand the risks (and likelihoods) of asteroids colliding with Earth. Now, NASA and the European Space Agency (ESA) are preparing for the next, crucial step in planetary defense against asteroid impacts: physically deflecting a real asteroid. Read more…

By Oliver Peckham

Cerebras to Supply DOE with Wafer-Scale AI Supercomputing Technology

September 17, 2019

Cerebras Systems, which debuted its wafer-scale AI silicon at Hot Chips last month, has entered into a multi-year partnership with Argonne National Laboratory and Lawrence Livermore National Laboratory as part of a larger collaboration with the U.S. Department of Energy... Read more…

By Tiffany Trader

  • arrow
  • Click Here for More Headlines
  • arrow
Do NOT follow this link or you will be banned from the site!
Share This