HPC@Intel: Go Beyond the Kernel

By Lance Shuler and Tyler Thessin

March 5, 2009

Refocusing HPC Benchmarking on Total Application Performance

Want to improve application performance by 10x or 100x? Few HPC customers would say no. Yet in some cases, the promises of tremendous performance improvements from accelerators, attached processors, field-programmable gate arrays, and the like evaporate when total application performance is evaluated. Benchmarks that focus on kernel performance can provide important information, but only total application benchmarking can give customers a true picture of how an HPC system will function back in their data center.

Why benchmark?

Benchmarking is an essential means for helping end users choose and configure HPC systems. An end user has a problem and needs to know the best way to solve it. More specifically, the end user has a specific workload to run and needs to find hardware that can deliver the best performance, reliability, application portability, and ease of application maintenance. As Purdue University researchers wrote in a recent IEEE article that argued for real application benchmarking, an HPC benchmark should, among other things, produce metrics that help customers evaluate the overall solution time for their problems.

The claim of a 10x to 100x improvement from a particular product can easily grab someone’s attention. But what does that 10x measurement really mean? In many cases, these claims are derived from kernel benchmarking, which might fail to tell the whole story. While an increase in floating-point performance or the addition of a CPU accelerator could deliver a significant improvement for one kernel, the total application improvement depends on additional HPC system elements. As one participant argued in a recent HPC conference covered by IDC, solution time can be represented as an equation:

 Solution time = processing time + memory time + communication time + I/O time

Kernel benchmarking has its place, but benchmarking total (or “real”) application performance is critical for accurately evaluating HPC systems.

The value of kernel benchmarking

Kernel benchmarking is unlikely to disappear any time soon. For one thing, it is often easier to create and run a benchmark that focuses on a small part of an application than one that measures performance across the entire application. In addition, a kernel benchmark is frequently more portable across systems. Since a key goal of benchmarking is to compare application performance on more than one system or system configuration, having a portable benchmarking test is extremely important.

Kernel benchmarking can also produce valuable information. If a developer has identified application subroutines that are essential for a certain workload, kernel benchmarking can offer a good means of quickly and easily measuring performance for those subroutines.

Still, kernel benchmarking alone will rarely be sufficient for evaluating HPC systems. In some cases, the kernels measured might make up only 30 or 40 percent of the total application. By evaluating the total application performance, the benchmarking team would see that improvements in those kernels might result in total application improvements of only 1.4x and 1.6x. Stopping at kernel benchmarking could deliver a deceptive result.

Kernel benchmarking would provide sufficient information only in cases where one workload consumes more than 90 percent of HPC system time and the kernel takes more than 80 percent of the time within a workload. If the ultimate goal of benchmarking is to determine how a particular application will perform in the real world, stopping at the kernel level will not be as helpful as total application benchmarking. The group conducting the test must zoom back out to the application level and analyze how kernel performance will affect overall application performance.

Going beyond the kernel

Why doesn’t total application benchmarking replace kernel benchmarking? Measuring total application performance can be challenging. It takes time, money, and resources to develop a benchmark that accurately reflects a real-world workload. In some cases, it might also be difficult to gain access to proprietary commercial software in order to create a benchmark that can be tested on different platforms.

Measuring overall application performance might also require a large HPC system. An end user might plan to run an application on 1,000 cores, but building such a large system just for benchmarking would be too costly. Consequently, the benchmarking team would need to scale the test down to something smaller and more manageable.

Are there ways to realize the benefits of total application analysis while still capitalizing on the simplicity and manageability of kernel benchmarking? Yes. For example, benchmarking teams can use OS/system-level software tools that can help accurately assess the amount of time spent in the kernel, I/O, and application. Other tools, such as the Intel® VTune™ Performance Analyzer, can help drill down farther in the application itself to identify pain points and determine what percentage of time is spent in certain subroutines. Once the performance of individual components is measured, the benchmarking team can then extrapolate from those results and project total application time. By understanding how each component affects total application time, HPC system designers and developers can then better identify ways to improve application performance, while end users can better assess the value of certain HPC system components.

Working toward balance

Once we agree that total application performance is what truly matters, the next question is, how can we improve total application performance?

Focusing on total application performance helps to underscore the importance of achieving balance in HPC systems. Unless HPC systems balance processor performance with memory capacity and I/O bandwidth, floating-point improvements or increased core counts will do little to help end users answer bigger questions or solve problems faster. Only balanced systems can help customers attain real, sustainable improvements in HPC performance.

Creating balanced systems can and should be a priority for commercial hardware vendors as well as academics. Clearly there’s nothing wrong with highlighting ways to improve floating-point operations. But how do we enhance performance across the entire system to achieve overall application performance improvements? Tackling the challenges of I/O and memory are essential in designing better systems and helping end users in all fields answer their questions.

At Intel, we are dedicated to delivering balanced HPC systems. We gather important information from customer interactions and from internal research on software applications, and we feed that information back into platform development. For example, we know that memory bandwidth is a critical issue for many HPC applications, from seismic applications to weather forecasting. Consequently, we are about to deliver processors with an integrated memory controller, point-to-point interconnects, and larger and smarter caches. We are also working on a new generation of solid-state disk technology to improve I/O speeds.

We also update the Intel® software tools to help developers ensure support for the latest processors and platforms. At the microarchitecture level, our tools help developers parallelize code and optimize it for new capabilities. These tools help developers to achieve a seamless transition from one processor generation to the next. We also provide cluster tools to help developers take a system-level approach. Our tools can help developers take full advantage of system and network capabilities.

The Intel® Software and Services Group also works closely with end users to ensure that they are evaluating all the necessary HPC elements for the workload they are testing. With in-depth knowledge of HPC applications, we can help end users characterize applications so they understand the big picture.

Giving them what they need

For end users, the message about kernel benchmarking is simple: Be skeptical of the hype. Look past “wow” numbers derived from kernel benchmarking and make sure you understand how each new product will really affect your overall application performance today and how you can best protect your investment for tomorrow.

Meanwhile, commercial vendors and those in academia should remember to focus on what end users really need. They have specific problems that they want to solve, and they need to know the best ways to find the answers. To design better HPC systems and to help end users succeed at their goals, we need to deliver balanced systems that can boost total application performance, not just the performance of a single kernel.

Lance Shuler is a senior manager, applications enabling in the Intel® Software and Services Group. He has 15 years of experience in the areas of high-performance computing and workstation application optimization, with special focus on manufacturing and oil and gas industries.

Tyler Thessin is director of the Intel® Performance Libraries Lab in the Intel® Software and Services Group. He is responsible for various high-performance software developer library products (e.g., Intel® Math Kernel Library and Intel® Integrated Performance Primitives) as well as technologies and products for future multi-core and many-core processor-based systems. He has more than 21 years of experience developing and managing software products.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

Empowering High-Performance Computing for Artificial Intelligence

April 19, 2024

Artificial intelligence (AI) presents some of the most challenging demands in information technology, especially concerning computing power and data movement. As a result of these challenges, high-performance computing Read more…

Kathy Yelick on Post-Exascale Challenges

April 18, 2024

With the exascale era underway, the HPC community is already turning its attention to zettascale computing, the next of the 1,000-fold performance leaps that have occurred about once a decade. With this in mind, the ISC Read more…

2024 Winter Classic: Texas Two Step

April 18, 2024

Texas Tech University. Their middle name is ‘tech’, so it’s no surprise that they’ve been fielding not one, but two teams in the last three Winter Classic cluster competitions. Their teams, dubbed Matador and Red Read more…

2024 Winter Classic: The Return of Team Fayetteville

April 18, 2024

Hailing from Fayetteville, NC, Fayetteville State University stayed under the radar in their first Winter Classic competition in 2022. Solid students for sure, but not a lot of HPC experience. All good. They didn’t Read more…

Software Specialist Horizon Quantum to Build First-of-a-Kind Hardware Testbed

April 18, 2024

Horizon Quantum Computing, a Singapore-based quantum software start-up, announced today it would build its own testbed of quantum computers, starting with use of Rigetti’s Novera 9-qubit QPU. The approach by a quantum Read more…

2024 Winter Classic: Meet Team Morehouse

April 17, 2024

Morehouse College? The university is well-known for their long list of illustrious graduates, the rigor of their academics, and the quality of the instruction. They were one of the first schools to sign up for the Winter Read more…

Kathy Yelick on Post-Exascale Challenges

April 18, 2024

With the exascale era underway, the HPC community is already turning its attention to zettascale computing, the next of the 1,000-fold performance leaps that ha Read more…

Software Specialist Horizon Quantum to Build First-of-a-Kind Hardware Testbed

April 18, 2024

Horizon Quantum Computing, a Singapore-based quantum software start-up, announced today it would build its own testbed of quantum computers, starting with use o Read more…

MLCommons Launches New AI Safety Benchmark Initiative

April 16, 2024

MLCommons, organizer of the popular MLPerf benchmarking exercises (training and inference), is starting a new effort to benchmark AI Safety, one of the most pre Read more…

Exciting Updates From Stanford HAI’s Seventh Annual AI Index Report

April 15, 2024

As the AI revolution marches on, it is vital to continually reassess how this technology is reshaping our world. To that end, researchers at Stanford’s Instit Read more…

Intel’s Vision Advantage: Chips Are Available Off-the-Shelf

April 11, 2024

The chip market is facing a crisis: chip development is now concentrated in the hands of the few. A confluence of events this week reminded us how few chips Read more…

The VC View: Quantonation’s Deep Dive into Funding Quantum Start-ups

April 11, 2024

Yesterday Quantonation — which promotes itself as a one-of-a-kind venture capital (VC) company specializing in quantum science and deep physics  — announce Read more…

Nvidia’s GTC Is the New Intel IDF

April 9, 2024

After many years, Nvidia's GPU Technology Conference (GTC) was back in person and has become the conference for those who care about semiconductors and AI. I Read more…

Google Announces Homegrown ARM-based CPUs 

April 9, 2024

Google sprang a surprise at the ongoing Google Next Cloud conference by introducing its own ARM-based CPU called Axion, which will be offered to customers in it Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Intel’s Server and PC Chip Development Will Blur After 2025

January 15, 2024

Intel's dealing with much more than chip rivals breathing down its neck; it is simultaneously integrating a bevy of new technologies such as chiplets, artificia Read more…

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Baidu Exits Quantum, Closely Following Alibaba’s Earlier Move

January 5, 2024

Reuters reported this week that Baidu, China’s giant e-commerce and services provider, is exiting the quantum computing development arena. Reuters reported � Read more…

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Shutterstock 1179408610

Google Addresses the Mysteries of Its Hypercomputer 

December 28, 2023

When Google launched its Hypercomputer earlier this month (December 2023), the first reaction was, "Say what?" It turns out that the Hypercomputer is Google's t Read more…

AMD MI3000A

How AMD May Get Across the CUDA Moat

October 5, 2023

When discussing GenAI, the term "GPU" almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word "GPU" is assumed to mean "Nvidia" products. (As an aside, the popular Nvidia hardware used in GenAI are not technically... Read more…

Leading Solution Providers

Contributors

Shutterstock 1606064203

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

January 25, 2024

In under two minutes, Meta's CEO, Mark Zuckerberg, laid out the company's AI plans, which included a plan to build an artificial intelligence system with the eq Read more…

China Is All In on a RISC-V Future

January 8, 2024

The state of RISC-V in China was discussed in a recent report released by the Jamestown Foundation, a Washington, D.C.-based think tank. The report, entitled "E Read more…

Shutterstock 1285747942

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

DoD Takes a Long View of Quantum Computing

December 19, 2023

Given the large sums tied to expensive weapon systems – think $100-million-plus per F-35 fighter – it’s easy to forget the U.S. Department of Defense is a Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, codenamed Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

Eyes on the Quantum Prize – D-Wave Says its Time is Now

January 30, 2024

Early quantum computing pioneer D-Wave again asserted – that at least for D-Wave – the commercial quantum era has begun. Speaking at its first in-person Ana Read more…

GenAI Having Major Impact on Data Culture, Survey Says

February 21, 2024

While 2023 was the year of GenAI, the adoption rates for GenAI did not match expectations. Most organizations are continuing to invest in GenAI but are yet to Read more…

The GenAI Datacenter Squeeze Is Here

February 1, 2024

The immediate effect of the GenAI GPU Squeeze was to reduce availability, either direct purchase or cloud access, increase cost, and push demand through the roof. A secondary issue has been developing over the last several years. Even though your organization secured several racks... Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire