Historically, compute has gotton faster and storage hasn’t been able to keep up. However, this year that balance is shifting. Moore’s Law is grinding to a halt, which means compute gains are slowing down. NVME and DIMM-mounted flash are becoming more affordable, which means storage is catching up.
Systems will get even more complex as more storage products are brought out to close the compute-storage performance gap, while new compute architectures aim to extend Moore’s Law. Soon we will have a wide range of options with very wide-ranging pricing, as the hybrid cloud design replaces the technology gains we’ve enjoyed for a long time.
However, when many users are accessing shared storage, it can be all too easy to overload the system. Users can save files in the wrong place or submit jobs that take up the entire bandwidth by mistake. This sort of activity will reduce the performance of even the newest IT system.
As a result, it is becoming more important than ever for organisations involved in HPC to know exactly what users and their applications are doing within shared storage systems. Faster, more complex systems mean that I/O problems are going to magnify. Legacy applications can throw up real problems if you try to migrate them without first untangling all the historical dependencies and network links. Bad I/O can slow a system down exponentially, wasting the potential of an upgraded IT system whether that’s on-premise or in the cloud. Monitoring becomes vital.
Knowing what your users are doing and giving them easy access to storage profiles is also key to making the right compute and storage choices for applications, both in hybrid cloud and on-prem. In the short term, system telemetry can be used to get the most out of existing systems and longer term it can be used to make the right choices when procuring for the future.
The industry needs solutions that help organisations get the most out of their compute architecture. Thankfully, a few tools are starting to emerge that can make monitoring far simpler. One such offering comes from Ellexus, the I/O profiling company, in the form of a range of tools with an accompanying IBM reference architecture.
A case study: EDA software vendor overloading shared storage
When many users are accessing shared storage, it can be all too easy to overload the system. Users can save files in the wrong place or submit jobs that take up the entire bandwidth by mistake. This sort of activity will reduce the performance of even the newest IT system.
One customer approached Ellexus with this problem. The customer is a software vendor in the EDA industry, which has its own HPC cluster for continuous integration testing. They wanted to protect their IBM Spectrum Scale storage in-house, but wanted to make sure that bad I/O patterns are not seen by their customers either.
The customer has a host leaf architecture with about 1,000 compute nodes in total and eight shared filers. The hosts had high-speed network connection to the IBM Spectrum Scale Storage and leaf nodes mounted the file system over NFS with a slower network. Users are supposed to aggregate data in the host node and write to the storage from there, but this doesn’t always happen. Distributed writes from leaf nodes can overload the storage. For example, when users leave the debug flag on when they hand over a job to be run at scale. This generates a lot of extra data from host and leaf nodes.
After consulting with Ellexus, the team deployed Mistral, Ellexus’ tool for continuous monitoring and live system telemetry, by integrating with IBM Spectrum LSF. Mistral profiles I/O and will send an alert if a job exceeds an I/O bandwidth or meta data limit. That alert contains information about the area of storage affected as well as the IBM Spectrum LSF job ID, hostname, MPI rank and the program that carried out the most I/O.
By deploying Mistral, the team wanted to provide IT managers with a way to profile applications before they were run at scale. This would integrate with their existing fault management framework and provide users with the storage profile of their applications so they could find performance solutions quickly.
The Ellexus-IBM reference architecture
The customer team deployed Mistral to all production jobs using IBM Spectrum LSF and the reference architecture developed by Ellexus with IBM. This reference architecture combines the analytics already available with the IBM Spectrum product suite with Ellexus’ knowledge about job- level I/O patterns and the impact they have on the storage.
To make life easy for users, IBM has made a Mistral plugin for the IBM Spectrum LSF RTM job reporter. IBM Spectrum LSF RTM puts all the analytics from IBM Spectrum LSF and Mistral in one place, giving an overview of all jobs, as well as a way to zoom in to see more information about a single job. For customers who use Elastic Search or Grafana, there are plugins for those dashboards too that give a similar overview with the ability to dive in on jobs, users and projects.
When combined with the IBM Spectrum Computing solutions, Mistral gives real insight into the system. In the case of this customer, applications that run as part of the R&D workflow are periodically checked with a more detailed I/O healthcheck queue that has more alerts and can give a more detailed breakdown of good and bad I/O.
The customer has seen a dramatic reduction in the time taken to resolve a problem as well as the number of rogue jobs submitted. Now that users have concrete information about what causes a problem, they are more proactive about preventing bad I/O.
Monitoring is only going to increase in importance as compute and storage gets more complex. No doubt more solutions will appear over time that will help organisations ensure they get the highest performance from their increasingly complex architectures.
To find out more about Mistral and the IBM Spectrum LSF reference architecture, visit the IBM Global Solutions Directory. For more on Mistral and other I/O profiling solutions, visit the Ellexus website.