How Meltdown and Spectre Patches Will Affect HPC Workloads

By Rosemary Francis

January 10, 2018

There have been claims that the fixes for the Meltdown and Spectre security vulnerabilities, named the KPTI (aka KAISER) patches, are going to affect application performance by 10-30 percent. The patch makes any call from user space into the operating system much more expensive, so I/O intensive applications are likely to be the worst hit. What does this really mean for HPC workloads?

Optimisation has always been important to HPC, but the new patches have moved the goal posts. Profiling applications and how they access data on disk and over the network is going to be key to getting a handle on what the worst problems are going to be. Losing a third of your compute overhead is an understandably scary prospect, but this doesn’t have to be a thoroughly lose-lose situation.

Consider the following aspects of your workloads now to mitigate potential performance damage as much as possible.

1.     Our workflows use third-party tools – there’s nothing I can do, right?

It’s common to use a mix of third-party and in-house tools. Optimising in-house applications is a resourcing issue, but possible. When third-party tools have performance problems, you are often on your own, even if they are open source.

It is, however, still worth investigating performance drops of third-party workflows. As well as being able to feed data back to your vendor or to the user community, there are always changes you can make to the way a program runs that affects the I/O. For example, do you use environment variables to configure that application? Long PATH variables and similar settings can cause applications to trawl the file system with sequences of failed open() or stat() calls many times in a single execution.

These failed I/O calls or distributed accesses cause problems for shared file systems even before you have installed the KPTI patch. Once the patch has been applied it’s likely that the affect of those unnecessary meta data operations will increase for the program performing them as well as for the shared file system struggling under the load.

So even if you have a workflow with just a wrapper script to launch it and a single third-party binary that won’t ever be updated, it is still a very good idea to work out where you are spending your time. Even something as simple as moving temporary files from shared to local storage is likely to give you a win.

2.     Genome pipelines and other fans of small files

Genome pipelines are well known for using a lot of small files to map DNA segments against a reference genome. Other HPC applications in oil and gas, EDA and finance sometimes do the same. Part of the reason for this is a need to honour legacy working practices because in science it can be hard to prove the validity of your work if it is using entirely new software, but it’s also a limitation of the algorithms used in these applications.

Small files necessarily mean small I/O and lots of meta data, but worse than that, often small files are accessed in smaller chunks, further exacerbating the problem. Why is this a problem? Let’s assume that small I/O is anything under 32KB, depending on the architecture of your system. Files under 4KB can be easily written off as an unsolvable problem, but if you access the data in less than 4KB blocks then there is definitely scope for the problem to be a lot worse than it needs to be. Closing a file after every small write or checking the existence of every file before opening it will also magnify the performance impact of the security patches.

Small I/O isn’t limited to applications that use small files. Older libraries and code that hasn’t been profiled will often use system calls that perform small reads and writes. Sometimes gigabytes of data are read or written one byte at a time. This behaviour is very common in almost all compute environments and will be catastrophic for local performance as well as any shared file system or database.

3.     Will the performance drop caused by the patches impact local or shared storage the most?

Reports so far of the impact of the patches on shared file systems have not been good. In time, those who maintain the file system will be able to make some performance improvements based on the new compute landscape, but no one is going to be able to escape the performance drop entirely. Shared storage is an important part of most HPC clusters and cloud infrastructure. When accessing anything remotely there will be a performance hit over local storage, but the KPTI patch will be hitting performance at both ends.

Some workloads will be slowed down mainly by their own behaviour and I/O patterns. This is good news for the shared file system because anything that throttles accesses gives the file system a bit more time to keep up.

Unfortunately, I/O intensive workloads that have bad local performance are also likely to be those hitting the file system the hardest and will be the worst affected by any slow down of the file system. The only way to know the impact of the patch on your cluster and workloads is to try it and measure it; the complexity of modern HPC systems means that an effect on one resource cannot predict the performance of another.

4.     My application doesn’t do that much I/O so do I still have to worry?

Much has been made of the effect the KPTI patches will have on I/O performance, but the impact will be seen on all system calls. This means calls such as gettimeofday() will get more expensive. Applications with strict timing constraints will make lots of such calls and may well have poorly constructed timing constraints broken by the new delays in accessing even small amounts of data.

Ask yourself, does you program really need that fancy progress bar? You could be paying a lot more for features like this in the future.

5.    Is MPI I/O better or worse?

HPC applications don’t just do POSIX I/O: MPI libraries are a popular way of sharing data and coordinating applications across many thousands of compute ranks. All MPI libraries use POSIX I/O underneath the hood, perhaps not a surprise to many, but what will surprise some is the way they do it.

MPI libraries have evolved over time and seemingly similar calls can have very different implementations with varying reliance on small system calls. We are back to discussing small I/O because high-level constructs such as tables and matrices of data are often accessed with very small reads and writes.

The good news is that many MPI libraries are binary compatible for the most part, so changing the library once a performance problem has been identified is not as difficult as optimising other types of I/O, but it is something that you may not have control over. Artificial benchmarks such as IOR that let you compare MPI libraries are unlikely to give you much insight into the real impact of the KPTI patch because real workloads are so different from the orderly I/O that those benchmarks stress. Again, the only way to find out where time is being spent is to profile a real application.

6.     Can I escape the problem by moving my workloads to the cloud?

Unfortunately, virtualised workloads are still affected by the vulnerability and need to be patched. That goes for containerised workloads as well.

Moving applications to the cloud usually involves some kind of re-architecture to reduce the reliance on shared storage and to take advantage of the high-performance and low-cost storage options available. Getting a handle on how your applications use storage and where their dependencies are is part of this process, so optimising for the KPTI patch is work that can almost come for free as part of the efforts to embrace the future. Some I/O patterns will be worse in the cloud and some will be better, but that was true before anyone knew about Meltdown and Spectre.

So all in all, HPC applications are going to be affected by the patches in varying degrees. The cheerful news is that a lot of the impact on performance can be mitigated by optimisation efforts where the resources are available. Using third-party tools and libraries doesn’t render you helpless – there still might be room for easy wins in performance.

Let’s hope that the situation doesn’t become even more complicated as the HPC industry works to come with more solutions. Given the complexity of most HPC workloads and systems already, anything that can be done to simplify and optimise systems rather than add layers of fixes will surely be good for future growth.

About the Author

Dr. Rosemary Francis is CEO and founder of Ellexus, the I/O profiling company (www.ellexus.com). Ellexus makes application profiling and monitoring tools that can be run on a live compute cluster to protect from rogue jobs and noisy neighbours, make cloud migration easy and allow a cluster to be scaled rapidly. The system- and storage-agnostic tools provide end-to-end visibility into exactly what applications and users are up to. We don’t just give you data about what your programs are doing; our tools include expertise on what is going wrong and how you can fix it.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

InfiniBand Still Tops in Supercomputing

July 19, 2018

In the competitive global HPC landscape, system and processor vendors, nations and end user sites certainly get a lot of attention--deservedly so--but more than ever, the network plays a crucial role. While fast, perform Read more…

By Tiffany Trader

HPC for Life: Genomics, Brain Research, and Beyond

July 19, 2018

During the past few decades, the life sciences have witnessed one landmark discovery after another with the aid of HPC, paving the way toward a new era of personalized treatments based on an individual’s genetic makeup Read more…

By Warren Froelich

WCRP’s New Strategic Plan for Climate Research Highlights the Importance of HPC

July 19, 2018

As climate modeling increasingly leverages exascale computing and researchers warn of an impending computing gap in climate research, the World Climate Research Programme (WCRP) is developing its new Strategic Plan – and high-performance computing is slated to play a critical role. Read more…

By Oliver Peckham

HPE Extreme Performance Solutions

Introducing the First Integrated System Management Software for HPC Clusters from HPE

How do you manage your complex, growing cluster environments? Answer that big challenge with the new HPC cluster management solution: HPE Performance Cluster Manager. Read more…

IBM Accelerated Insights

Are Your Software Licenses Impeding Your Productivity?

In my previous article, Improving chip yield rates with cognitive manufacturing, I highlighted the costs associated with semiconductor manufacturing, and how cognitive methods can yield benefits in both design and manufacture.  Read more…

U.S. Exascale Computing Project Releases Software Technology Progress Report

July 19, 2018

As is often noted the race to exascale computing isn’t just about hardware. This week the U.S. Exascale Computing Project (ECP) released its latest Software Technology (ST) Capability Assessment Report detailing progress so far. Read more…

By John Russell

InfiniBand Still Tops in Supercomputing

July 19, 2018

In the competitive global HPC landscape, system and processor vendors, nations and end user sites certainly get a lot of attention--deservedly so--but more than Read more…

By Tiffany Trader

HPC for Life: Genomics, Brain Research, and Beyond

July 19, 2018

During the past few decades, the life sciences have witnessed one landmark discovery after another with the aid of HPC, paving the way toward a new era of perso Read more…

By Warren Froelich

D-Wave Breaks New Ground in Quantum Simulation

July 16, 2018

Last Friday D-Wave scientists and colleagues published work in Science which they say represents the first fulfillment of Richard Feynman’s 1982 notion that Read more…

By John Russell

AI Thought Leaders on Capitol Hill

July 14, 2018

On Thursday, July 12, the House Committee on Science, Space, and Technology heard from four academic and industry leaders – representatives from Berkeley Lab, Argonne Lab, GE Global Research and Carnegie Mellon University – on the opportunities springing from the intersection of machine learning and advanced-scale computing. Read more…

By Tiffany Trader

HPC Serves as a ‘Rosetta Stone’ for the Information Age

July 12, 2018

In an age defined and transformed by its data, several large-scale scientific instruments around the globe might be viewed as a ‘mother lode’ of precious data. With names seemingly created for a ‘techno-speak’ glossary, these interferometers, cyclotrons, sequencers, solenoids, satellite altimeters, and cryo-electron microscopes are churning out data in previously unthinkable and seemingly incomprehensible quantities -- billions, trillions and quadrillions of bits and bytes of electro-magnetic code. Read more…

By Warren Froelich

Tsinghua Powers Through ISC18 Field

July 10, 2018

Tsinghua University topped all other competitors at the ISC18 Student Cluster Competition with an overall score of 88.43 out of 100. This gives Tsinghua their s Read more…

By Dan Olds

HPE, EPFL Launch Blue Brain 5 Supercomputer

July 10, 2018

HPE and the Ecole Polytechnique Federale de Lausannne (EPFL) Blue Brain Project yesterday introduced Blue Brain 5, a new supercomputer built by HPE, which displ Read more…

By John Russell

Pumping New Life into HPC Clusters, the Case for Liquid Cooling

July 10, 2018

High Performance Computing (HPC) faces some daunting challenges in the coming years as traditional, industry-standard systems push the boundaries of data center Read more…

By Scott Tease

Leading Solution Providers

SC17 Booth Video Tours Playlist

Altair @ SC17

Altair

AMD @ SC17

AMD

ASRock Rack @ SC17

ASRock Rack

CEJN @ SC17

CEJN

DDN Storage @ SC17

DDN Storage

Huawei @ SC17

Huawei

IBM @ SC17

IBM

IBM Power Systems @ SC17

IBM Power Systems

Intel @ SC17

Intel

Lenovo @ SC17

Lenovo

Mellanox Technologies @ SC17

Mellanox Technologies

Microsoft @ SC17

Microsoft

Penguin Computing @ SC17

Penguin Computing

Pure Storage @ SC17

Pure Storage

Supericro @ SC17

Supericro

Tyan @ SC17

Tyan

Univa @ SC17

Univa

  • arrow
  • Click Here for More Headlines
  • arrow
Do NOT follow this link or you will be banned from the site!
Share This