Cray Returns to NERSC: A Q&A with Bill Kramer

By Steve Conway

August 18, 2006

The DOE's National Energy Research Scientific Computing Center (NERSC) recently announced that Cray won the bid to deliver a 100-teraflop, $52 million supercomputer that can be significantly expanded over time. Bill Kramer, general manager of the NERSC center, talked with HPCwire about the evaluation criteria and how Cray came out ahead in performance and price/performance.

HPCwire: Talk about the NERSC user base and workload.

Kramer: NERSC's mission is to support not just big science, but the entire range of open science. We serve the largest, most diverse group of users within the DOE. Each year, we support about 2,500 users working on 300-400 projects. Our users produce 1,200-1,400 peer-reviewed scientific papers every year.

HPCwire: Does that include the INCITE program?

Kramer: Absolutely. We've been involved in INCITE since the program began four years ago. The DOE approached NERSC to be the sole site for the prototype program, which was called “Big Splash,” and that worked out well enough to justify the larger INCITE program. In 2005, 15 percent of our cycles were dedicated to three INCITE projects.

HPCwire: Other than your big, diverse user base, what sets NERSC apart from other DOE centers and labs?

Kramer: We're leaders in extreme scaling with high utilization. We've run codes efficiently on our full 6,000-CPU “Seaborg” Power system, and in 2003 we established our Scaling Project, with the aim of making it possible to run codes on tens of thousands of processors, and eventually hundreds of thousands of processors, with high utilization. The project identified codes with good potential for running at very large scales, and also identified the bottlenecks that would need to be addressed for the codes to scale efficiently. This work is also important for the INCITE program.

HPCwire: You have some of the world's best performance evaluation experts, people like Lenny Oliker and Kathy Yelick.

Kramer: We're fortunate to have access to them. On the performance modeling side, we have Erich Strohmaier at NERSC and do work with Alan Snavely in San Diego.

HPCwire: How often does NERSC procure big systems?

Kramer: We go through a major, full-competitive system procurement every three years or so. This is the biggest investment we make as a center.

HPCwire: Can you describe the benchmarks you used in your latest procurement?

Kramer: They're our own. We call them the Sustained System Performance Metric, or SSP for short. The suite includes seven key applications drawn from NERSC's workload. We ask vendors to run the applications at three different sizes, from 256 to 2048 CPUs, then we combine the system's performance by taking the geometric mean of the performance rates, without any weighting.

HPCwire: Does the mix of applications ever change?

Kramer: Each procurement is different as the workload changes, but there is also consistency. Usage by some disciplines gets more and some less. Astrophysics is getting more emphasis right now, for example, and climate temporarily lessened a bit after an important phase of IPCC computational work was completed. At any given time, the SSP metric represents 85 percent of our disciplines.

HPCwire: How do you account for price to come up with price/performance scores?

Kramer: We multiply the sustained SSP performance per CPU by the number of CPUs in the proposed system, and we look at this over a three-year period, since a system may evolve over the first several years it is installed. This gives us a figure for the total number of sustained teraflops we can expect from the system during the three years until our next major procurement. We then divide this SSP number by the total cost of the system, which includes software and electricity but not my staff costs for the three years. The result is expressed as teraflop-years per dollar.

HPCwire: And Cray won.

Kramer: Cray came out ahead on both total performance and price/performance. We're not disclosing price/performance results for any of the vendors, but on the performance side we were looking for between 7.5 and 10 sustained teraflops across our SSP benchmarks. The Cray system proposed was above that range, which we thought was very impressive. Now that we have finished our discussions with Cray, we now expect the system to have an SSP of 16.1 teraflops.

HPCwire: Were you satisfied with your SSP benchmarks themselves?

Kramer: There's always room for improvement, but overall the SSP benchmarks did a nice job of meeting our requirements. We think benchmarks have four purposes. First, they need to be able to evaluate candidate systems. Second, they need to be able to validate that you're getting the system performance you expected to get. Third, they need to be able to do this throughout the machine's lifetime, because we have seen cases where performance degrades over time for systems, so unless you have a way to monitor it formally, it gets complicated figuring out what happened. Finally, benchmarks should be able to provide guidance, especially to vendors, for designing future systems. Most well-known benchmarks look mostly at the first use, but the SSP does the first three very well, and can be used for the fourth.

HPCwire: Do you collaborate with anyone else in constructing your SSP benchmarks?

Kramer: We're now coordinating across agencies for the first time. That includes the DOE, NSF and the DOD MOD program. You might remember that the NRC and HECRTF reports recommended that agencies coordinate more of their activities. The goal here is to identify areas of overlap, where we can use common benchmarks. This can save time in procurements for us and for the vendors. We've looked at this with the other agencies. Right now, the overlap isn't large because our users' needs are fairly different, but we did adopt some things in common. For example, for chemistry we're using the DOD's GAMESS application in our benchmark suite. More of this will happen over time.

HPCwire: Did you look at using any synthetic benchmarks, such as the ones Alan Snavely and his team have been working on?

Kramer: We're certainly open to that. For now, we agree with the DOD MOD program that synthetic benchmarks have a lot of promise but are too new to rely heavily on for making procurement decisions. The accuracy of modeling needs to be within five percent to be useful in making a real decision. Once there is stronger correlation with actual application results, synthetic benchmarks will have a more important role to play.

HPCwire: Do the vendors also benefit from going through a procurement process like yours?

Kramer: I think there have been some important benefits for Cray. The process helped to crystallize Cray's software roadmap. For example, Cray is planning to use Berkeley Lab's checkpoint/restart as the basis for their implementation.

NERSC and Cray will also benefit in the area of the new petascale I/O interface. It will take about 12 to 18 months of work to fully realize, but then Cray will be able to integrate their system with the NERSC Global Filesystem, a high performance, facility-wide file system, based on the GPFS system, that we're using with all of our architectures. This will make Cray more portable and better able to integrate into a variety of existing environments. Our contract calls for the establishment of a Cray Center of Excellence at NERSC. The first two areas that will be addressed through this collaborative center are system management and storage management.

HPCwire: In closing, how would you characterize this procurement?

Kramer: We think it worked really well. We're getting an excellent system from Cray that will have a tremendous positive impact on our users. As measured on our SSP benchmark, the Cray system will boost our computational power by 9x over today. That's a major improvement. It will also be fun to work with Cray to help make the new “Hood” system really solid for NERSC and the broader community.

Topics: Systems

Sectors: Academia & Research, Government

Anders Dam Jensen on HPC Sovereignty, Sustainability, and JU Progress

April 23, 2024

The recent 2024 EuroHPC Summit meeting took place in Antwerp, with attendance substantially up since 2023 to 750 participants. HPCwire asked Intersect360 Research senior analyst Steve Conway, who closely tracks HPC, AI, Read more…

AI Saves the Planet this Earth Day

April 22, 2024

Earth Day was originally conceived as a day of reflection. Our planet’s life-sustaining properties are unlike any other celestial body that we’ve observed, and this day of contemplation is meant to provide all of us Read more…

Intel Announces Hala Point – World’s Largest Neuromorphic System for Sustainable AI

April 22, 2024

As we find ourselves on the brink of a technological revolution, the need for efficient and sustainable computing solutions has never been more critical. A computer system that can mimic the way humans process and s Read more…

Empowering High-Performance Computing for Artificial Intelligence

April 19, 2024

Artificial intelligence (AI) presents some of the most challenging demands in information technology, especially concerning computing power and data movement. As a result of these challenges, high-performance computing Read more…

Kathy Yelick on Post-Exascale Challenges

April 18, 2024

With the exascale era underway, the HPC community is already turning its attention to zettascale computing, the next of the 1,000-fold performance leaps that have occurred about once a decade. With this in mind, the ISC Read more…

2024 Winter Classic: Texas Two Step

April 18, 2024

Texas Tech University. Their middle name is ‘tech’, so it’s no surprise that they’ve been fielding not one, but two teams in the last three Winter Classic cluster competitions. Their teams, dubbed Matador and Red Read more…

Anders Dam Jensen on HPC Sovereignty, Sustainability, and JU Progress

April 23, 2024

The recent 2024 EuroHPC Summit meeting took place in Antwerp, with attendance substantially up since 2023 to 750 participants. HPCwire asked Intersect360 Resear Read more…

AI Saves the Planet this Earth Day

April 22, 2024

Earth Day was originally conceived as a day of reflection. Our planet’s life-sustaining properties are unlike any other celestial body that we’ve observed, Read more…

Kathy Yelick on Post-Exascale Challenges

April 18, 2024

With the exascale era underway, the HPC community is already turning its attention to zettascale computing, the next of the 1,000-fold performance leaps that ha Read more…

Software Specialist Horizon Quantum to Build First-of-a-Kind Hardware Testbed

April 18, 2024

Horizon Quantum Computing, a Singapore-based quantum software start-up, announced today it would build its own testbed of quantum computers, starting with use o Read more…

MLCommons Launches New AI Safety Benchmark Initiative

April 16, 2024

MLCommons, organizer of the popular MLPerf benchmarking exercises (training and inference), is starting a new effort to benchmark AI Safety, one of the most pre Read more…

Exciting Updates From Stanford HAI’s Seventh Annual AI Index Report

April 15, 2024

As the AI revolution marches on, it is vital to continually reassess how this technology is reshaping our world. To that end, researchers at Stanford’s Instit Read more…

Intel’s Vision Advantage: Chips Are Available Off-the-Shelf

April 11, 2024

The chip market is facing a crisis: chip development is now concentrated in the hands of the few. A confluence of events this week reminded us how few chips Read more…

The VC View: Quantonation’s Deep Dive into Funding Quantum Start-ups

April 11, 2024

Yesterday Quantonation — which promotes itself as a one-of-a-kind venture capital (VC) company specializing in quantum science and deep physics — announce Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Intel’s Server and PC Chip Development Will Blur After 2025

January 15, 2024

Intel's dealing with much more than chip rivals breathing down its neck; it is simultaneously integrating a bevy of new technologies such as chiplets, artificia Read more…

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Baidu Exits Quantum, Closely Following Alibaba’s Earlier Move

January 5, 2024

Reuters reported this week that Baidu, China’s giant e-commerce and services provider, is exiting the quantum computing development arena. Reuters reported � Read more…

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Google Addresses the Mysteries of Its Hypercomputer

December 28, 2023

When Google launched its Hypercomputer earlier this month (December 2023), the first reaction was, "Say what?" It turns out that the Hypercomputer is Google's t Read more…

How AMD May Get Across the CUDA Moat

October 5, 2023

When discussing GenAI, the term "GPU" almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word "GPU" is assumed to mean "Nvidia" products. (As an aside, the popular Nvidia hardware used in GenAI are not technically... Read more…

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

January 25, 2024

In under two minutes, Meta's CEO, Mark Zuckerberg, laid out the company's AI plans, which included a plan to build an artificial intelligence system with the eq Read more…

China Is All In on a RISC-V Future

January 8, 2024

The state of RISC-V in China was discussed in a recent report released by the Jamestown Foundation, a Washington, D.C.-based think tank. The report, entitled "E Read more…

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, codenamed Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

Eyes on the Quantum Prize – D-Wave Says its Time is Now

January 30, 2024

Early quantum computing pioneer D-Wave again asserted – that at least for D-Wave – the commercial quantum era has begun. Speaking at its first in-person Ana Read more…

GenAI Having Major Impact on Data Culture, Survey Says

February 21, 2024

While 2023 was the year of GenAI, the adoption rates for GenAI did not match expectations. Most organizations are continuing to invest in GenAI but are yet to Read more…

The GenAI Datacenter Squeeze Is Here

February 1, 2024

The immediate effect of the GenAI GPU Squeeze was to reduce availability, either direct purchase or cloud access, increase cost, and push demand through the roof. A secondary issue has been developing over the last several years. Even though your organization secured several racks... Read more…

Intel’s Xeon General Manager Talks about Server Chips

January 2, 2024

Intel is talking data-center growth and is done digging graves for its dead enterprise products, including GPUs, storage, and networking products, which fell to Read more…

Click Here for More Headlines

HPCwire is a registered trademark of Tabor Communications, Inc. Use of this site is governed by our Terms of Use and Privacy Policy.

Reproduction in whole or in part in any form or medium without express written permission of Tabor Communications, Inc. is prohibited.

Leading Solution Providers

Off The Wire

Industry Headlines

April 23, 2024

April 22, 2024

April 19, 2024

April 18, 2024

Subscribe to HPCwire's Weekly Update!

Anders Dam Jensen on HPC Sovereignty, Sustainability, and JU Progress

AI Saves the Planet this Earth Day

Intel Announces Hala Point – World’s Largest Neuromorphic System for Sustainable AI

Empowering High-Performance Computing for Artificial Intelligence

Kathy Yelick on Post-Exascale Challenges

2024 Winter Classic: Texas Two Step

Anders Dam Jensen on HPC Sovereignty, Sustainability, and JU Progress

AI Saves the Planet this Earth Day

Kathy Yelick on Post-Exascale Challenges

Software Specialist Horizon Quantum to Build First-of-a-Kind Hardware Testbed

MLCommons Launches New AI Safety Benchmark Initiative

Exciting Updates From Stanford HAI’s Seventh Annual AI Index Report

Intel’s Vision Advantage: Chips Are Available Off-the-Shelf

The VC View: Quantonation’s Deep Dive into Funding Quantum Start-ups

Nvidia H100: Are 550,000 GPUs Enough for This Year?

Synopsys Eats Ansys: Does HPC Get Indigestion?

Intel’s Server and PC Chip Development Will Blur After 2025

Choosing the Right GPU for LLM Inference and Training

Baidu Exits Quantum, Closely Following Alibaba’s Earlier Move

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

Google Addresses the Mysteries of Its Hypercomputer

How AMD May Get Across the CUDA Moat

Leading Solution Providers

Contributors

Tiffany Trader

Editorial Director

Douglas Eadline

Managing Editor

John Russell

Senior Editor

Kevin Jackson

Contributing Editor

Ali Azhar

Contributing Editor

Alex Woodie

Contributing Editor

Addison Snell

Contributing Editor

Drew Jolly

Assistant Editor

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

China Is All In on a RISC-V Future

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

Eyes on the Quantum Prize – D-Wave Says its Time is Now

GenAI Having Major Impact on Data Culture, Survey Says

The GenAI Datacenter Squeeze Is Here

Intel’s Xeon General Manager Talks about Server Chips

The Information Nexus of Advanced Computing and Data systems for a High Performance World

Share

Copy short link