Refocusing HPC Benchmarking on Total Application Performance
Want to improve application performance by 10x or 100x? Few HPC customers would say no. Yet in some cases, the promises of tremendous performance improvements from accelerators, attached processors, field-programmable gate arrays, and the like evaporate when total application performance is evaluated. Benchmarks that focus on kernel performance can provide important information, but only total application benchmarking can give customers a true picture of how an HPC system will function back in their data center.
Why benchmark?
Benchmarking is an essential means for helping end users choose and configure HPC systems. An end user has a problem and needs to know the best way to solve it. More specifically, the end user has a specific workload to run and needs to find hardware that can deliver the best performance, reliability, application portability, and ease of application maintenance. As Purdue University researchers wrote in a recent IEEE article that argued for real application benchmarking, an HPC benchmark should, among other things, produce metrics that help customers evaluate the overall solution time for their problems.
The claim of a 10x to 100x improvement from a particular product can easily grab someone’s attention. But what does that 10x measurement really mean? In many cases, these claims are derived from kernel benchmarking, which might fail to tell the whole story. While an increase in floating-point performance or the addition of a CPU accelerator could deliver a significant improvement for one kernel, the total application improvement depends on additional HPC system elements. As one participant argued in a recent HPC conference covered by IDC, solution time can be represented as an equation:
Solution time = processing time + memory time + communication time + I/O time
Kernel benchmarking has its place, but benchmarking total (or “real”) application performance is critical for accurately evaluating HPC systems.
The value of kernel benchmarking
Kernel benchmarking is unlikely to disappear any time soon. For one thing, it is often easier to create and run a benchmark that focuses on a small part of an application than one that measures performance across the entire application. In addition, a kernel benchmark is frequently more portable across systems. Since a key goal of benchmarking is to compare application performance on more than one system or system configuration, having a portable benchmarking test is extremely important.
Kernel benchmarking can also produce valuable information. If a developer has identified application subroutines that are essential for a certain workload, kernel benchmarking can offer a good means of quickly and easily measuring performance for those subroutines.
Still, kernel benchmarking alone will rarely be sufficient for evaluating HPC systems. In some cases, the kernels measured might make up only 30 or 40 percent of the total application. By evaluating the total application performance, the benchmarking team would see that improvements in those kernels might result in total application improvements of only 1.4x and 1.6x. Stopping at kernel benchmarking could deliver a deceptive result.
Kernel benchmarking would provide sufficient information only in cases where one workload consumes more than 90 percent of HPC system time and the kernel takes more than 80 percent of the time within a workload. If the ultimate goal of benchmarking is to determine how a particular application will perform in the real world, stopping at the kernel level will not be as helpful as total application benchmarking. The group conducting the test must zoom back out to the application level and analyze how kernel performance will affect overall application performance.
Going beyond the kernel
Why doesn’t total application benchmarking replace kernel benchmarking? Measuring total application performance can be challenging. It takes time, money, and resources to develop a benchmark that accurately reflects a real-world workload. In some cases, it might also be difficult to gain access to proprietary commercial software in order to create a benchmark that can be tested on different platforms.
Measuring overall application performance might also require a large HPC system. An end user might plan to run an application on 1,000 cores, but building such a large system just for benchmarking would be too costly. Consequently, the benchmarking team would need to scale the test down to something smaller and more manageable.
Are there ways to realize the benefits of total application analysis while still capitalizing on the simplicity and manageability of kernel benchmarking? Yes. For example, benchmarking teams can use OS/system-level software tools that can help accurately assess the amount of time spent in the kernel, I/O, and application. Other tools, such as the Intel® VTune™ Performance Analyzer, can help drill down farther in the application itself to identify pain points and determine what percentage of time is spent in certain subroutines. Once the performance of individual components is measured, the benchmarking team can then extrapolate from those results and project total application time. By understanding how each component affects total application time, HPC system designers and developers can then better identify ways to improve application performance, while end users can better assess the value of certain HPC system components.
Working toward balance
Once we agree that total application performance is what truly matters, the next question is, how can we improve total application performance?
Focusing on total application performance helps to underscore the importance of achieving balance in HPC systems. Unless HPC systems balance processor performance with memory capacity and I/O bandwidth, floating-point improvements or increased core counts will do little to help end users answer bigger questions or solve problems faster. Only balanced systems can help customers attain real, sustainable improvements in HPC performance.
Creating balanced systems can and should be a priority for commercial hardware vendors as well as academics. Clearly there’s nothing wrong with highlighting ways to improve floating-point operations. But how do we enhance performance across the entire system to achieve overall application performance improvements? Tackling the challenges of I/O and memory are essential in designing better systems and helping end users in all fields answer their questions.
At Intel, we are dedicated to delivering balanced HPC systems. We gather important information from customer interactions and from internal research on software applications, and we feed that information back into platform development. For example, we know that memory bandwidth is a critical issue for many HPC applications, from seismic applications to weather forecasting. Consequently, we are about to deliver processors with an integrated memory controller, point-to-point interconnects, and larger and smarter caches. We are also working on a new generation of solid-state disk technology to improve I/O speeds.
We also update the Intel® software tools to help developers ensure support for the latest processors and platforms. At the microarchitecture level, our tools help developers parallelize code and optimize it for new capabilities. These tools help developers to achieve a seamless transition from one processor generation to the next. We also provide cluster tools to help developers take a system-level approach. Our tools can help developers take full advantage of system and network capabilities.
The Intel® Software and Services Group also works closely with end users to ensure that they are evaluating all the necessary HPC elements for the workload they are testing. With in-depth knowledge of HPC applications, we can help end users characterize applications so they understand the big picture.
Giving them what they need
For end users, the message about kernel benchmarking is simple: Be skeptical of the hype. Look past “wow” numbers derived from kernel benchmarking and make sure you understand how each new product will really affect your overall application performance today and how you can best protect your investment for tomorrow.
Meanwhile, commercial vendors and those in academia should remember to focus on what end users really need. They have specific problems that they want to solve, and they need to know the best ways to find the answers. To design better HPC systems and to help end users succeed at their goals, we need to deliver balanced systems that can boost total application performance, not just the performance of a single kernel.
Lance Shuler is a senior manager, applications enabling in the Intel® Software and Services Group. He has 15 years of experience in the areas of high-performance computing and workstation application optimization, with special focus on manufacturing and oil and gas industries.
Tyler Thessin is director of the Intel® Performance Libraries Lab in the Intel® Software and Services Group. He is responsible for various high-performance software developer library products (e.g., Intel® Math Kernel Library and Intel® Integrated Performance Primitives) as well as technologies and products for future multi-core and many-core processor-based systems. He has more than 21 years of experience developing and managing software products.