The old adage “you cannot improve what you do not measure” is fresh again in the age of ubiquitous data. When considering the challenges of exascale computing, power is right at the top of the list and the major leadership-class centers want to make sure they’re doing everything they can to manage the demands of power today – which can run as high as 10 MW at peak for the largest machines – and in the coming exascale era, when the number could be three times that high. At loads of this magnitude, the largest HPC facilities need to have all the relevant power data within arm’s reach.
Managing power demands is a priority at Lawrence Livermore National Laboratory (LLNL), the Department of Energy (DOE) center entrusted with ensuring nuclear security for the nation. With a peak speed of 20 petaflops, the center’s top supercomputer, Sequoia, draws more than 9 MW of power, equivalent to the energy draw of more than 1,000 average homes.
When tens of megawatts of power are on the line, advanced power management is needed to balance the highly fluctuant power demands and power availability. This requires orchestration of resources and real-time insight into the entire operational facility and energy grid. Even small interruptions during high performance compute cycles can derail the job and disrupt power grid management as well.
Facing the challenge of balancing demands at exascale, LLNL sought out the assistance of OSIsoft, a company with deep roots in data collection, aggregation and storage. OSIsoft helps LLNL track and analyze streams of operational data from computing racks, cooling systems, energy utilities and other equipment and stores it to central control point for the life of the assets. This affords administrators, like Anna Maria Bailey, LLNL high performance computing facility manager, the opportunity to spot efficiency gains, glean what data is important, and coordinate forecasted load demands with utility companies in real-time.
Since implementing OSIsoft’s software product, the PI system, LLNL has been able to identify troubling anomalies, including several megawatt inter-hour power swings. The facility has also earned LEED Gold status and LLNL reports increased operational assurance for the future of its operations and coming big iron, like Sierra, Livermore’s next advanced technology high-performance computing system, which is spec’d at 120-150 petaflops peak.
OSIsoft has been in business for 35 years building a software platform that collects, aggregates and stores high-fidelity data for the life of assets. The company connects the sensor data that has existed for some time – now commonly referred to as big data or IoT — to enable real-time decision making as well as historical performance tracking and ultimately predictive analytics.
OSIsoft started in the refining industry — then moved into the paper industry, upstream oil and gas, metals, and mining. In the last 10 years, it added datacenters to its customer list. “It was a very logical extension because we had been involved with the heavy industry of the previous industrial age, as well as now the heavy industry of the digital age,” said OSIsoft’s Steve Sarnecki, vice president of federal and public sectors. “Datacenters, especially high-performance computing datacenters, are literally the factories of the future and the type of data they generate fits very well in the software we produce, the PI system.”
When the product was expanded to commercial datacenters like eBay, and Dell, HP and others, OSIsoft built interfaces and data collection software to collect the data from those unique pieces of equipment or types of systems with the aim of empowering teams to make better decisions.
Sarnecki further shared that about 80 percent of the megawatts of power generated in the US run through a PI system. 100 percent of the independent system operators (ISOs) that do dispatch of power within the US use the PI system and 78 out of 104 nuclear licensees use the PI system with 104 out of 104 feeding their data up to the Nuclear Regulatory Commission, who is one of OSIsoft’s federal users that looks at emergency response on the PI system.
Asked if the product was modified for Livermore, Sarnecki said it is the same product – his company provides the toolset for the expert who understands the business problem as well as solutions providers in the space.
“At Livermore, our job is to take the sensors in the field that are spread out all over that campus, different types, and make them intimately close to the intelligent resources be they computer simulations or be they scientists so they have immediate access to that data as if they were standing right in front of this plethora of meters at the same time,” said Sarnecki.
Livermore’s relationship with OSIsoft goes back to 2010. LLNL High Performance Computing Facility Manager Anna Maria Bailey explains that it started with the development of a high-performance computing master plan. “We were looking at how we were going to achieve petascale and exascale computing going forward,” she said. “We had created a master plan that had many core competencies in it, from sustainable HPC solutions, doing computational fluid dynamics, benchmarking, leveraging our existing HPC capabilities, facilitating LEED certifications, free cooling, liquid cooling, innovative electrical distribution and developing gap analysis – and another area was power management.”
In looking at the master plan of all the core competencies, Bailey said they all reflected a need for data, but although the data was in the institution of Livermore, it wasn’t all easily accessible within the HPC facilities. For example, when facilities asked for the metering data of particular transformers or the flow rates of particular chillers, they encountered issues with data being in different formats, or not up to date, or infrequently read or downloaded only when needed.
Livermore began looking at different organizations that could help compile this data, and Bailey being an electrical engineer coming from the utility industry knew about OSIsoft. After determining that the software had the functionality they were looking for, a relationship was forged.
“The PI software allowed us to do was bring all of the numerous data streams that we had into one area,” said Bailey. “We needed to aggregate the data into a single source – not necessarily to view on a common dashboard but that is the capability – but actually to aggregate the data to manipulate on a common platform and it allowed us to determine what data was significant.”
Before having PI, Livermore was unable to correlate events from the various sources because of the different time stamps and the formats, said Bailey. OSIsoft facilities having a common time stamp and format and the PI system does operational event, real-time data management infrastructure of all internal and external data sources.
PI enabled Livermore to bring in data from the rack-level, the equipment-level, the metering level, the building level, management level and the utility level. With those hundreds of real-time data streams interfaces, Bailey and her team were able to manage, gather and evaluate the large amount of data, analyze it, convert it into real-time data. The system gives the team the ability to notify, send triggers and alarms and provides visualizations to support decision-making.
“Our overall goal of doing this was to lower our power utilization and obviously achieve exascale that’s the long term goal because the better we use these resources, we can actually manage our facilities and infrastructure more appropriately,” said Bailey. “When Sierra, the next machine that we’re bringing online in 2017, every rack will be metered just like Sequoia is and the data will come into PI.”
The project started as a facilities operations tool, but then the team brought in some of the resource manager data from SLURM. So now they have several scientists who use it and they use a solver on it. They migrate the data in PI out to a solver, so they can fine tune the correlated time stamps.
The facilities team uses it for performance but also for looking at anomalies. Bailey shared that while they were bringing up Sequoia, they saw some large variations in the load, specifically there were recurring inter-hour variabilities that were exceeding 8 MW because the machine was dropping from 9.6 MW to 180 kw. Maintenance was considered as a cause, but they insisted they were not responsible for dropping the machines. Working with their utility company, Bailey’s team was able to correlate that data back to maintenance periods.
“PI was able to focus in, pick all these event stamps of the power as well as what was going on with the chilled water plant, what was going on in the condenser water plant and we were able to think it up to notice that there was a correlation at that given time,” she explained. “It helped us clue in what the problem was and give us a frame to actually shut the maintenance down slower on the machine, so now we drop it from say 7.5 MW to 5.5 MW then we wait a while, then we keep dropping it so we’re not having that large inter-hour variability.”
There are analytics use cases too. Fellow LLNL’er Ghaleb Abdulla of the Data Science group is manipulating PI data on a large capacity resource called Cab. Bailey shared that her colleague brings the data into a solver and correlates it with the data that’s on the node of the machine and does some visualizations off of that. The work made it possible to pinpoint sensor locations in the field that if moved around would get better data.
Abdulla is also working on another project about how to analyze a machine that is the same architecture but has a liquid cooling solution and the same architecture that has an air-cooled system, working from the facility level, down into the rack level and into the node level.
“He likes it because he’s got all the data in one location,” said Bailey of her colleague. “The thing that’s really nice about PI is that all of these interfaces are different so the PI interface nodes that you connect to these feeder systems that come in – can be SQL, can be HTML, can be Modbus, can be BACnet, can be any type of open protocol and as long as you have the interface node you are able to bring the data in, where we were finding other systems weren’t that flexible. You were having to bring data in, you were either having to manipulate the data first and then bring it in and then we were finding that there was incompatibility with the data, where this is nice because you bring it in and they can come in the PI server and it works really well that way.”
Bailey said that her team is expecting more use cases and they are looking at grid integration, which provides further assurance of meeting exascale-class power demands. Ghaleb and Bailey are working together on figuring out strategies for fine-grained power management, course-grained power management, job scheduling, back up scheduling, and shutting down and shutting load.
“This is a big topic for us because as we go to exascale and we have a machine that could be 20-30 MW, the difference between the peak and shutting that unit as it goes offline could be huge to the utility,” said Bailey. “We actually met with one of our power providers who also has PI. One of our goals in the future is to have data that we can share amongst ourselves and them – they are also a DOE entity as well – that is huge for us. We are looking at collaboration with them and that’s a big challenge coming up in 2022 – how do we do grid integration with the utility having an exascale machine on the floor, having 20-30 MW in 20,000 sq ft of space, that’s just crazy. How do we take the environmental monitoring system, how do we integrate it to respond to these demand changes and how does the grid integration implementation require energy transactions to the power management system. We’re really heavily involved in that but it’s going to take some time so we use PI a lot on granular studies.”
Livermore reports real results with PI. Bailey said they’ve seen an improvement in PUE across all of the datacenters that are in their HPC complex, which was tied to an energy savings. In the mechanical system, they found that we were having some leakage issues through their environmental monitoring data that was coming into PI. A chiller that was going on and off line sporadically, and it actually had a mechanical problem discovered with PI. Bailey noted that the building management system doesn’t store the data long enough so the data that comes into PI was what made it possible to determine when the unit was going on and off. They reprogrammed the system so that it would use less chilled water.
So far, Livermore is OSIsoft’s only customer in the HPC facilities space. Asked about the prospect of her colleagues at other centers deploying the PI system, Bailey said there’s a need, but there’s also the matter of organizational alignment.
“I’m not matrixed into HPC, I live in HPC, so my supervisor is the same supervisor as the system administrator, as the facility operations manager as the system engineer and the system architects. We all are very aligned here,” she said. “What happens at other laboratories is that their facility people or their system engineers are matrixed in from another organization so they are not completely aligned with their line managers so it’s difficult to convince your line management that you really need this because the bottom line affects the program manager; we have the support of him which makes it huge.
“If you don’t have that support, it’s difficult. So that’s what I’ve seen with the other laboratories, a lot of them want to do it, but the way that they are structured it doesn’t allow them to have that complete backing so who’s going to pay for it, right? It always comes down to that. At our organization, we all have the same direction and the same focus so when you have everyone in alignment that this is what they need to improve their projections and to get to exascale as a common goal, you have the backing.”