Tiering in HPC storage has a bad rep. No one likes it. It complicates things and slows I/O. At least one storage technology newcomer – VAST Data – advocates dumping the whole idea. One large-scale user, NERSC storage architect Glenn Lockwood sort of agrees. The challenge, of course, is that tiering is a practical necessity for the vast majority of HPC systems. Faster, cheaper, and more flexible as the new solid state hardware choices are, they are not yet cheap enough. And there’s a huge base of existing storage infrastructure (software and hardware) that’s not going away quickly.
An SC20 panel held last week – Diverse Approaches to Tiering HPC Storage – dug into current tiering trends and although it produced little agreement on when or precisely how storage tiering will enter its own end-of-life phase, the conversation covered a lot of ground. This was a large group of prominent panelists: Glenn Lockwood, storage architect, NERSC/LBNL; Wayne Sawdon, CTO and strategy architect, IBM (Spectrum Scale); Jeff Denworth, VP of products and marketing, and co-founder, VAST Data; Liran Zvibel, CEO and founder, WekaIO; Curtis Anderson, senior software architect, Panasas; Matthew Starr, CTO, Spectra Logic; and Andreas Dilger, Lustre principal architect, Whamcloud (owned by DDN). The moderator was Addison Snell, CEO and founder, Intersect360 Research.
Lockwood’s early comments set the stage well and describe, at least in an aspirational sense, how many in HPC view storage tiering. He’s on the team standing up NERSC’s next supercomputer, Perlmutter. Here’s a somewhat extended (and lightly-edited) excerpt from his opening comments:
“We have a very large and very broad user base comprised of thousands of active users, hundreds of projects and hundreds of different applications every year. We also support not only traditional simulation workloads, but the large-scale data analytics workflows coming from experimental facilities such as telescopes and beam lines, and because of this breadth, our aggregate workload is very data intensive.
“Our storage systems typically see over an exabyte of I/O annually. Balancing this I/O intensive workload with the economics of storage means that at NERSC, we live and breathe tiering. And this is a snapshot of the storage hierarchy we have on the floor today at NERSC. Although it makes for a pretty picture, we don’t have storage tiering because we want to, and in fact, I’d go so far as to say it’s the opposite of what we and our users really want. Moving data between tiers has nothing to do with scientific discovery.
“To put some numbers behind this, last year we did a study that found that between 15% and 30% of that exabyte of I/O is not coming from our users’ jobs, but instead coming from data movement between storage tiers. That is to say that 15% to 30% of the I/O at NERSC is a complete waste of time in terms of advancing science. But even before that study, we knew that both the changing landscape of storage technology and the emerging large-scale data analysis and AI workloads arriving at NERSC required us to completely rethink our approach to tiered storage.
“Back in 2015, we began devising a strategic plan [for storage]. Our goal is ultimately to have as few tiers as possible for our users’ sake, while balancing the economic realities of storage media as they evolve over the next decade. Fortunately, economic trends are on our side for storage. And in anticipation of falling flash prices, we gave ourselves a goal of collapsing our performance tiers by 2020 to coincide with our 2020 HPC procurement, which is now called Perlmutter.
“I’m happy to report that we are on track with our 2020 goal. We’re currently in the process of deploying the world’s first all NVMe parallel file system at scale that will replace both our previous burst buffer tier and disk-based scratch tier. Once the dust is settled with Perlmutter, we’re aiming to further reduce down to just two tiers that map directly to the two fundamental categories of data that users store at NERSC: one a hot tier for data that users are actively using to carry out their work, and two a cold tier for data that users don’t actively need but are storing just in case they or someone else needs it somewhere down the road.
“By only having these two tiers, the only data movement that users need to worry about is moving data before and after their entire scientific project is carried out. At NERSC, this means that they will only have to worry about this once or twice per year. We will still keep two separate tiers because they represent two fundamentally different ways users interact with their data.
“The hot tier will be optimized for performance and getting data to and from user applications as fast as possible, so that they can spend their time advancing science rather than waiting for I/O. And their cold tier will be optimized for making data easy to search, index and share with their collaborators reliably. Because we don’t expect any magical media to hit the market between now and 2025, we’re relying on software to bridge the gap between different media types, so that even if there are both say flash and non-volatile memory and our hot tier, users don’t have to worry about which media is backing their data within that tier.”
Keep in mind Perlmutter is a government-funded supercomputer supported by technical and financial resources that are beyond the scope of most HPCers. That said Lockwood’s description of the problem and NERSC’s target solution echo the perspectives of many in the HPC user community.
Snell pointed out in shaping the discussion, “The implementation of solid state or flash storage has continued to grow among HPC users, and in most cases it exists together with, not separate from, conventional spinning disk hard drives. Having data on the right tier at the right time has become a compelling conversation point in determining what constitutes high performance storage, together with the potential latency and moving data between tiers. Meanwhile, the need for long term data stewardship is no less important. And to make things even more complicated, cloud computing is increasingly common in high performance computing, bringing in the notion of cloud is yet another data location.”
Because the panel was quite long (1.5 hours) presented here are just a few comments from each of the panelist on how their companies implement tiering.
“Panasas’s view is conventional tiering, when all things are considered, just hurts price performance. You’ve got your compute cluster. You have a hot tier, that’s probably made of NVMe flash or something expensive. You have a cold tier that’s made out of hard drives and lower cost technologies. Then you have the storage management software layer that’s moving data back and forth. What do you get out of that? You get unpredictable performance. If you have guessed right, or your data has been recently used, then it’s in the hot tier [and] performance is not bad. If you guessed wrong or the workloads have changed or the compute cluster is overloaded, the data is still in the cold tier, and your nodes are idle while you’re pulling data up to the hot tier,” said Anderson.
“In addition, you get three separate management domains. There’s probably three separate products, they all need management. This is an example of temperature based data placement, how recently something has been accessed, determines where it lives, unless you’re reaching in and manually managing where data lives which has its own set of costs and issues,” he said.
“We believe that data placement based upon size, not on temperature, is the right solution. So we’ve built the ActiveStor Ultra around this concept. Hard drives, just as a as an example, are not terrible products from an HPC storage perspective, they’re great products for HPC. They deliver a tremendous amount of bandwidth, as long as you only give them large files and large sequential transfers to operate on. They’re very inexpensive per terabyte of storage, and they deliver great performance. So this [ActiveStor Ultra] is a an architecture that’s based on device efficiency, getting the most out of the fundamental building blocks of SSDs and HDDS that are part of the solution.
Panasas is getting a 2x performance advantage over competitive products, according to Anderson, although he didn’t specify which products. Using what he called a single tier approach, where data placement is based on size using a mix of technologies, is the key. “Comparing HPC storage products is difficult, because there’s such wildly different hardware. But if you boil it down to gigabytes delivered per 100 spindles of hard drive, then you get a comparison we believe is fair. The reason we’re getting this benefit is the architectural difference,” he said.
VAST Data has vast aspirations and Denworth’s enthusiastic pitch matched those aspirations. Time will tell how successful this young company fares. About tiering he said, “Basically just throw the whole concept right out the window. We don’t see a lot of utility in the topic in 2020.” The key ingredients in VAST’s formula for dispensing with tiering include QLC flash memory, 3D XPoint, NVMe fabric, fresh coding, and buying scale.
“What we’ve done is we’ve combined a next generation computing layer that’s built in stateless Docker containers over a next generation NVMe fabric which can be implemented over Ethernet or InfiniBand, essentially, to disaggregate the containers from the underlying storage media, and make it such that they share all of the media, all of the NVMe devices within the cluster,” said Denworth. “That allows us to basically implement a new class of global codes, global codes around how we get to really, really efficient data protection. Imagine two and a half percent overhead for raid at scale without compromising on resilience, and global data reduction codes, such that you can get to now dramatic efficiency gains from where customers have come from in the past. And finally, global flash translation codes that get up to 10 years of use out of low-cost QLC flash that other storage systems can’t even use in their architectures.”
“When you put this together, you now have this stateless containerized architecture that can scale to exabytes, doesn’t cost any more for your infrastructure than what you paid for hard drive-based infrastructure, and the cost of flash is declining at a rate that’s much more aggressive than hard drives, such that we feel that we’re on the right side of that curve,” he said.
Denworth noted the rise of AI workloads as an important factor driving change, “AI changes the I/O paradigm altogether and where HPC systems of yesterday were designed to optimize for writes and then burst buffers came in and further optimized for writes. On the flip side, these new algorithms want to randomly read through the largest amounts of data at the fastest rates of speed, and you just can’t prefetch a random read. And we’re not alone on the [flash] island. organizations like the DOE Office of Science have concluded that the only way to efficiently feed next generation AI algorithms is to use random access flash.”
WekaIO’s take on tiering was interesting. “It’s a central part of what we do,” said Zvibel. He describes WekaIO an enterprise grade storage file system for customers that need the NAS feature set but also need performance of a parallel file system. “If you go with NAS, you get feature rich enterprise grade, but you’re usually limited by your scale and performance. On the other hand, if you’re going with a parallel file system, you’re going to get a lot of scale, great throughput [but] be limited on mixed workloads and low latency I/Os.”
“You buy your commodity servers from vendors you like, put the Weka software on, you’re getting a parallel file system for NVMe over fabric. Then you install our clients on your compute and the I/O runs parallel to all of the NVMe-over-fabric servers and [provides] the extra performance. We also support GPU direct storage for the Nvidia GPUs, NFS, SMB and S3,” said Zvibel.
Regarding tiering, Zvibel said, “A lot of the projects have a portion of the data that is their active window. This could be hundreds of terabytes. It could be few petabytes or dozens of petabytes, but usually they have a lot more capacity that is stored, and they don’t need to access it for every run. For that we’re enabling tiering to any S3 enabled object storage and we can tier it to more than one on-prem in the cloud. You can have your data stored in separate locations. We’re not just tiering, we’re actually enabling a feature we called snap-to-object. So if you’re tiering to two locations, you can save snapshots and have the Weka system spun up on the other side, and basically keep running from that point in time.”
Starr from long-time tape powerhouse Spectra Logic briefly reviewed storage technology pyramid and noted it’s relative inefficiencies. He singled out the weakness of of metadata handling capabilities of file systems as an ongoing issue. He also singled out 3D XPoint’s load-store functionality as a potential gamer-changer “for what applications can do for doing things like snapshots; instead of writing to a disk or an NVMe system to a disk drive interface, or file system interface, [they can use] XPoint to do load-store and actually getting persistent storage.”
On balance, Starr sees the emergence of dynamic tiering across hybrid technologies as the wave of the future. “I think the model is going to [be one] where you end up with CPU RAM, XPoint, NVMe and compute together. That system will be the scratch file system, the sandbox for people to play in. Then [you’ll have] a separate storage area, made of hybrid [technologies] of the flash arrays, HDD, tape cloud, and most likely that’s going to have object ReSTful interface.” He thinks that latter interface is likely to “be an immutable interface so that as data’s coming into this scratch area, and being written back out, new versions are being written back out on the right side, when they need to be written out.”
Overall Starr suggested the following trends: “First, we’re seeing a move off HSMs (hierarchical storage management systems) – not saying those systems are going away tomorrow, but the idea that HSMS are going to be replaced with object storage interfaces. I think the new storage tiers like XPoint are going to start changing how applications are written, especially when you think about snapshots, or VM farms, how much RAM you can actually get into a server with 3D XPoint. Customers are going to look at open standards a lot more, like LTFS (linear tape file system). Those are the winners, just like Linux won the Unix war, LTFS is going to win the tape format, standard war.
“I think we are going to see a lot more ReSTful object storage interfaces and open standards being deployed where you can deploy content in the cloud [but still keep] a copy of that data on site for easy retrieval, but have a copy in the cloud to share it with other people. [I think we’ll see more] immutable archives. Lastly, I think that we’re going to be looking a lot more search [capability] and how we perform data capture and search before we put data deep into an archive? Those are the things I think are trending in the storage areas today.”
Spectrum Scale, formerly GPFS, has a long history of storage tiering support. “By 2000, two years after we introduced the product, we supported it for tape using the XDSM DMAPI standard. In 2006, we introduced the information Lifecycle Management, which divides the online storage into pools, allowing us to distinguish fast storage from slower storage. Along with the pools we introduced policy rules. In 2010, we introduced the active file management, which allows us to tier data across the wide area network to other sites in other locations,” said Swadon.
“In 2016, we introduced the transparent cloud sharing to allow us to move data to and from Cloud Storage. Spectrum Scale continue to invest into data tiering to extend its common namespace across data, lakes and object storage. We are also investing to transparently tier data into and out of client compute nodes using storage rich servers with local NVMe and persistent memory.”
Unlike Denworth, IBM sees storage tiering as something that will continue to be important. The goal, reiterated Sawdon, is to move the data as close to the computation as possible to reduce the time required to derive value from the data. IBM believes Spectrum Scale will keep evolving to meet the task of serving different media (and functional) cost-performance requirements.
“With today’s analytics on data, we see an increase in both the volume of the data and then the value of the data. High volume increases the demand for tiering to cheaper storage; high value increases the demand for tiering the high-performance storage to reduce the time for analytics. Thus, we conclude that data tiering is important now and will be even more important in the future,” said Sawdon.
Sawdon also emphasized Spectrum Scale’s software defined nature. “The benefit of being software defined is we can run on any hardware, we don’t care if we’re running on cloud hardware, we don’t care if it’s on prem. So we do have HPC deployments in the cloud. What we’re seeing for cloud deployments are [that] these are in fact, larger than our on-prem ones. Virtual machines are easy to spin up and getting 100,000 node clusters just happens. So we’re seeing lots of activity in that space. The interesting thing for Spectrum Scale is we’ve built common namespaces across different installations. You can build common namespaces between your cloud deployment and an on-prem deployment, and transparently move data if the customer wants to do it. Today, customers aren’t doing that yet. But we do have customers in the cloud who are running HPC.”
Lustre has long been an HPC mainstay. Since DDN acquired Whamcloud (Lustre) from Intel, it has been working to incorporate common features used in enterprise file systems and to add ease of use features. Lustre underpins DDN’s EXAScaler product line (see HPCwire coverage). Dilger provided a brief overview of tiering in Lustre.
“The primary mechanism by which Lustre achieves storage tiering is through the use of storage pools to identify different storage technologies, and then the file layout which is stored on every file individually, and can be specified either as a default for the whole file system, per directory or per file. This allows a great deal of flexibility in terms of where files are located. To avoid applications or users trying to consume all of this storage space in a say flash pool on a system there are quotas to prevent abuse of the resources,” said Dilger
Typically, files are initially placed on a flash pool, then files can be mirrored to a different storage pool. “If the flash storage pool is becoming full, the policy agent can release the copy of older files and leave the one copy on the disk. It’s even possible to split a single file across different storage types. This is practical for very large files that can’t necessarily all fit into a single storage pool, or for files that have different types of data in them, for instance, an index at the start of the file, and then large data stored at the end of the file,” said Dilger.
In addition to managing storage on the server, Lustre can now also manage storage directly in the client. “This is through the persistent client cache. This allows files to be stored on local storage that are very large or need very low latency access. This leverages a local file system such as EC4, or NOVA. There are two types of storage for persistent client cache, either read-only mirror that’s shared among the file system and multiple clients or an exclusive copy where the client can write locally and then the data is mirrored asynchronously back to the Lustre file system,” he said.
Discussion around how quickly NVMe and FLASH and 3D XPoint technology would overturn the current tiering paradigm was lively but ultimately inconclusive. Best to watch the SC20 video for that interchange. Broadly, in the near-term, most panelists think dynamically tiering across media types will collapse tiers, at least in the sense that the tiers are increasingly invisible to uses and applications. Also, the persistence of POSIX, installed infrastructure (HDD et al) and improving tape performance suggest some form of discrete storage tiering will remain for quite some time.
There was a fair amount of discussion around cloud storage economics and interestingly also around AI application I/O patterns. The thinking on AI I/O patterns was that they were largely developed on smaller devices (laptops) without much thought about large-scale systems or storage systems behavior – that may change.
One balance, the panelists agree NVMe flash and 3D XPoint are the future, it’s a matter of when, and most of the panelists expect an evolution rather than abrupt replacement of existing storage tech.
Lockwood noted, “Ideally, economics is what allows us to keep collapsing tiers and not the other way around. So the project costs, how much we spend on storage, has no change. For Perlmutter, that is approximately true. Our previous system, Cori, I want to say the storage budget was somewhere between 10% and 15% of the total system cost. For Perlmutter, it was about the same. So we’re just leveraging the falling economics of flash. Where additional money probably will be needed is in the software and software enablement part of this, and whether or not you consider that part of the capital expenditure or non-recurring engineering or internal R&D effort is a much more complicated question.”