pNFS Provides Performance and New Possibilities

By Molly Presley, SVP of Marketing, Hammerspace

February 29, 2024

At the cusp of a new era in technology, enterprise IT stands on the brink of the most profound transformation since the Internet’s inception. This seismic shift is propelled by the advent of artificial intelligence (AI), a force equally as groundbreaking as the cloud, which together are redefining the landscape of the 21st century’s IT. 

While the cloud revolutionized the “where” of IT operations, AI is dramatically reshaping the “how,” ushering in an era where the boundaries of enterprise computing are being redrawn. In this transformative time, we explore how these advancements are altering the fabric of IT work and setting the stage for a future where the possibilities are boundless.

Organizations have long demanded “the need for speed” when evaluating their data compute and data storage requirements. As larger quantities of data are being created, analyzed, and processed – and industry forecasts anticipate that global data creation will reach 180 zettabytes by 2025 – performance demands aren’t just growing but accelerating and fast. It has become an urgent business imperative that an organization’s data be both more accessible and usable to support modern applications and workflows. 

At first, large-scale computing was used in specific domains, such as supercomputing centers and high-performance computing (HPC) business units in government and enterprise organizations. The advent of AI has rapidly driven the infrastructure demands of HPC environments into the IT world. Never before has the enterprise seen a technology with so many applications to IT activities and use cases.

Today, IT’s main challenge in supporting AI is supplying its infrastructure requirements and having that infrastructure meet enterprise standards for application integration, data security, data protection, and data governance. A key to this challenge has been the use of interfaces to the storage. Most high-performance applications prefer open standard NFS, yet it has historically not had sufficient performance or flexibility to meet the workload requirements of HPC until now.

Standards to Deliver the Benefits of Specialized Workflows to Everyone 

As HPC and enterprise datasets grow and the applications for data-driven insights grow in tandem with them, continual focus is needed to chase down and eliminate those obstacles that prevent putting data to work efficiently. While most innovations are incremental advancements in a single vendor’s technology, on occasion, there is a massive leap forward in the standards community that changes the game globally. 

Our industry is now at one of those game-changing moments in advancing data path performance from the Linux community. The timing could not be better with the rise of AI, ML, and data analytics applications. 

Let us look at how this communication, storage, and computing technology convergence will result in a new storage architecture enabling data-intensive, distributed applications. 

The Evolution of Network-Attached Storage (NAS) Performance 

There was a time when a redundant array of independent disks (RAID) controllers was a limiting performance factor. The introduction of NVMe and Peripheral Component Interconnect Express (PCIe) allowed direct-attached NVMe-based storage systems to remove the RAID controller from the data path. 

For shared storage environments, however, leveraging the native capabilities of the underlying solid-state drive (SSD) performance is more challenging. In shared storage use cases, high-performance erasure coding is needed for efficient data protection, and additional physical layers are required within the data path for the file system to map files to blocks and then to map blocks to a flash address. Internal networking is needed in systems to connect and scale out multiple nodes into a single system to accomplish all of these tasks. 

What is important to note is that both the efficiency of the file system software and the physical architecture are important in high-performance environments. 

Let us delve into the physical architecture further. In scale-out (NFS based NAS) solutions using an InfiniBand3 or Ethernet backplane, there are no fewer than nine data retransmissions (see Figure 1) with each high-speed serial bus move, going chip to chip or over a cable. Each data retransmission removes a bit of the native performance of the underlying hardware away from the performance available from the data path, resulting in the client not seeing the full native performance of the available infrastructure.

Figure 1: Nine or more hops in the data path in traditional scale-out NAS (Source Hammerspace)

Scale-out NAS solution providers tackling performance workloads saw these challenges and designed their technology upon NVMe-over-Fabric (NVMe-oF). These solutions take the CPU out of the storage server data path and use PCIe for routing to the network adapters. This data path optimization is a big step forward but still requires extensive data retransmissions and unnecessary hardware costs. It also causes power consumption inefficiency as a network port is needed for each NVMe device in order to benefit from the full performance of the device, and metadata updates are done in-band with the data path.

The Origin of the NFS4.2 Client

With Network File System (NFS) v4.2, the Linux community introduced an open standards-based solution within the standard Linux kernel to drive speed and efficiency with the NFS protocol. NFSv4.2 allows workloads to remove the metadata, file server, and directory mapping from the data path, which enables the NFSv3 data path to have an uninterrupted connection to the storage. 

The NFS client resides under the application interface, making data access transparent to the user and application. NFS users typically perform the same application operations that they would for local file systems, such as opening files, reading, and writing them. Underneath this activity, the NFS client converts these commands into remote procedure calls (RPCs) to the server, which executes the required operations.

The Vision and Evolution of pNFS and the Introduction of Flex Files 

Most high-performance computing (HPC) applications require a parallel file system to ingest and process the massive data that loads data onto clients by communicating with them directly. Historically, proprietary software from technologies such as IBM SpectrumScale (previously IBM GPFS), Lustre, Panasas PanFS, Quantum StorNext, etc., have been required to deliver the performance needed for HPC, as well as media and entertainment post-production environments. Historically, these workflows have not had open solutions available to them; only specialized workflows and a limited set of users and applications could work with the data sets in the parallel file system.  

The vision of the pNFS protocol to solve this problem has been under development for the last 20 years. NFSv4 introduced the capability to provide a parallel file system client as part of a standard Linux distribution. It adapts NFS for modern use cases involving large-scale operations for high-performance computing data centers and cloud platforms.  

The contribution of Flex Files by Hammerspace to the Linux community made the innovation of pNFS useful in the real world. pNFS with Flex Files offers live file mobility, which means hot files can be moved to hot storage while cold data can be moved to cold storage. This capability works entirely in the background, even on active and in use files, and can significantly improve access performance for large and small files, even on legacy storage infrastructure. 

Metadata Operations Outside the Data Path 

pNFS clearly defines the abstraction between metadata and data. The metadata goes to a traditional data server, which performs all the opens and closes on the data while data can stream unencumbered from the compute to the storage. When a client wants to talk to a data server, it requests a layout that maps the data’s location and its means of access. 

pNFS with Flex Files can request a new layout even as the I/O from a previous layout is completing. This feature allows the pNFS model to manage data on the fly, which is highly useful under many circumstances. For example, leveraging these capabilities, the Hammerspace Global Data Environment performs storage tiering, metadata replication, and high-performance data movement across different sites.

pNFS in the Global Data Environment and Hyperscale NAS Solutions

Hammerspace has deployed pNFS with Flex Files as part of its Hyperscale NAS capabilities. Compared with the nine hops required in a Scale-out NAS, the Hyperscale NAS requires only four hops and takes metadata updates out of the data path. This approach, shown in Figure 2, provides near 100% performance of the underlying hardware and reduces costs by eliminating network ports and controllers.

 

Figure 2: Only four hops in the data path in Hyperscale NAS (Source Hammerspace)

Hammerspace leverages pNFS with Flex Files in its Global Data environment solution to provide non-disrupted access to data by applications and users while performing live data movement across storage tiers and storage geographic locations. Flex Files put to work in the Global Data Environment can non-disruptively recall layouts, enabling data access and integrity to be maintained, even as files are being moved or copied. This approach has enormous ramifications for enterprises as it can eliminate the downtime associated with data migrations and technology upgrades. Enterprises can combine this capability with software, such as a metadata engine, that can virtualize data across heterogeneous storage types and automate the movement and placement of data according to IT-defined business objectives.

Hammerspace has contributed heavily to both the Linux community and in its development to make high-performance access to data stored in multiple storage silos a reality.  

In the Hammerspace Global Data Environment:

  • Enterprises with high-availability requirements and stringent security policies benefit from parallel file system performance with no additional software installation on client machines. 
  • Parallel file system performance is possible in the Hammerspace Global Data Environment servers because metadata and data service operations are offloaded from the data path.
  • Secure sharing of data across Linux and Windows platforms in the Global Data Environment is possible due to NFS v4.2 access control lists (ACLs) compatibility with Windows ACLs.

What is the Next Big Opportunity to Reduce Data Path Latency? 

The next big innovation for standards-based file storage performance is rapidly approaching as Ethernet-attached SSDs (eSSD) can embed the NFS server locally to the eSSDs. This innovation finally enables the technology to evolve so that the SSD is directly attached to Ethernet (an NFS-eSSD). The SSD will be able to communicate with NFSv3 natively, and block-to-flash address mapping could be converted into a single level of mapping. The outcome of this architecture is a further reduction in the number of data retransmissions from nine in traditional NAS and five in parallel file systems to just three when leveraging NFSv4.2 and NFS-eSSD. 

Benefits of NFS in the eSSD 

The benefits of this new architecture are significant. It provides lower latency with fewer data retransmissions and lower power consumption since the serial transmission of data across chip-to-chip connections or over wires consumes significant power in data centers. This approach also has lower operational and capital costs since it eliminates extra hardware. For these NFS-eSSDs, write amplification can be reduced since the SSDs become better aware of their free space, avoiding overwrites and improving SSD endurance. Since this reduces the need for overprovisioning, higher capacity densities and higher access densities can be achieved without sacrificing the ability to fully utilize the actual available performance. The overall reliability, availability, and serviceability can be optimized with less hardware and fewer data retransmissions than conventional NVMe SSDs. As a result, a much wider dynamic scale is possible, allowing storage systems to scale up or down with ease directly on existing Ethernet networks and using standards-based storage connectivity. 

“Many storage systems involve multiple layers of communication and data transfer. By embedding NFS directly into an Ethernet-attached SSD array, many of these layers are bypassed, resulting in significantly lower latency,” said Thomas Isakovich, CEO at Nimbus Data. “We are excited to work with Hammerspace as we partner to continue to deliver previously unmatched low latency and data path speed to high-performance applications.”

Why Now?

Enormous value can be derived from data with modern data analytics and AI models, but creating this value requires rapid data access and processing. These data analysis workloads require large datasets stored on efficient and high-performance storage devices and the ability to intelligently orchestrate data. The required high-speed, orchestrated data pipeline must provide a shared view of an organization’s data to all systems. It needs to be optimized for small, random I/O patterns and provide high peak system performance and high aggregate file system performance to meet the variety of training workloads an organization may encounter. The data pipeline needs to be driven by a standards-based solution that will be easy to deploy on machines with diverse security and build environments. The storage systems need to deliver the full performance of the SSD to the workflow to maximize opportunity while containing infrastructure costs. 

Conclusion

As we stand at the dawn of a new AI epoch in enterprise and HPC computing, it’s clear that the convergence of groundbreaking technologies—high-speed Ethernet, 64-bit processor IP, IPv6, NFSv4 in Linux, and high-performance open-soource standards-based file systems—is not just a testament to our current technological achievements but a beacon for future possibilities. These advancements, integral to the fabric of modern IT, are the building blocks enabling the monumental shift towards an AI-driven, cloud-empowered future. The journey from the early days of the internet to today’s AI and cloud revolution has been marked by continuous innovation, with each step forward making the next leap possible. 

As we harness these technologies, we’re not just adapting to changes but creating a future where enterprise and technical computing transcends today’s limitations, opening new horizons for business, government, and academia. In this era of unprecedented change, one thing remains certain: the path forward is illuminated by the open advancements we embrace today, promising a tomorrow where the potential of enterprise IT is boundless.

 

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

Intel’s Silicon Brain System a Blueprint for Future AI Computing Architectures

April 24, 2024

Intel is releasing a whole arsenal of AI chips and systems hoping something will stick in the market. Its latest entry is a neuromorphic system called Hala Point. The system includes Intel's research chip called Loihi 2, Read more…

Anders Dam Jensen on HPC Sovereignty, Sustainability, and JU Progress

April 23, 2024

The recent 2024 EuroHPC Summit meeting took place in Antwerp, with attendance substantially up since 2023 to 750 participants. HPCwire asked Intersect360 Research senior analyst Steve Conway, who closely tracks HPC, AI, Read more…

AI Saves the Planet this Earth Day

April 22, 2024

Earth Day was originally conceived as a day of reflection. Our planet’s life-sustaining properties are unlike any other celestial body that we’ve observed, and this day of contemplation is meant to provide all of us Read more…

Intel Announces Hala Point – World’s Largest Neuromorphic System for Sustainable AI

April 22, 2024

As we find ourselves on the brink of a technological revolution, the need for efficient and sustainable computing solutions has never been more critical.  A computer system that can mimic the way humans process and s Read more…

Empowering High-Performance Computing for Artificial Intelligence

April 19, 2024

Artificial intelligence (AI) presents some of the most challenging demands in information technology, especially concerning computing power and data movement. As a result of these challenges, high-performance computing Read more…

Kathy Yelick on Post-Exascale Challenges

April 18, 2024

With the exascale era underway, the HPC community is already turning its attention to zettascale computing, the next of the 1,000-fold performance leaps that have occurred about once a decade. With this in mind, the ISC Read more…

Intel’s Silicon Brain System a Blueprint for Future AI Computing Architectures

April 24, 2024

Intel is releasing a whole arsenal of AI chips and systems hoping something will stick in the market. Its latest entry is a neuromorphic system called Hala Poin Read more…

Anders Dam Jensen on HPC Sovereignty, Sustainability, and JU Progress

April 23, 2024

The recent 2024 EuroHPC Summit meeting took place in Antwerp, with attendance substantially up since 2023 to 750 participants. HPCwire asked Intersect360 Resear Read more…

AI Saves the Planet this Earth Day

April 22, 2024

Earth Day was originally conceived as a day of reflection. Our planet’s life-sustaining properties are unlike any other celestial body that we’ve observed, Read more…

Kathy Yelick on Post-Exascale Challenges

April 18, 2024

With the exascale era underway, the HPC community is already turning its attention to zettascale computing, the next of the 1,000-fold performance leaps that ha Read more…

Software Specialist Horizon Quantum to Build First-of-a-Kind Hardware Testbed

April 18, 2024

Horizon Quantum Computing, a Singapore-based quantum software start-up, announced today it would build its own testbed of quantum computers, starting with use o Read more…

MLCommons Launches New AI Safety Benchmark Initiative

April 16, 2024

MLCommons, organizer of the popular MLPerf benchmarking exercises (training and inference), is starting a new effort to benchmark AI Safety, one of the most pre Read more…

Exciting Updates From Stanford HAI’s Seventh Annual AI Index Report

April 15, 2024

As the AI revolution marches on, it is vital to continually reassess how this technology is reshaping our world. To that end, researchers at Stanford’s Instit Read more…

Intel’s Vision Advantage: Chips Are Available Off-the-Shelf

April 11, 2024

The chip market is facing a crisis: chip development is now concentrated in the hands of the few. A confluence of events this week reminded us how few chips Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Intel’s Server and PC Chip Development Will Blur After 2025

January 15, 2024

Intel's dealing with much more than chip rivals breathing down its neck; it is simultaneously integrating a bevy of new technologies such as chiplets, artificia Read more…

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Baidu Exits Quantum, Closely Following Alibaba’s Earlier Move

January 5, 2024

Reuters reported this week that Baidu, China’s giant e-commerce and services provider, is exiting the quantum computing development arena. Reuters reported � Read more…

Shutterstock 1179408610

Google Addresses the Mysteries of Its Hypercomputer 

December 28, 2023

When Google launched its Hypercomputer earlier this month (December 2023), the first reaction was, "Say what?" It turns out that the Hypercomputer is Google's t Read more…

AMD MI3000A

How AMD May Get Across the CUDA Moat

October 5, 2023

When discussing GenAI, the term "GPU" almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word "GPU" is assumed to mean "Nvidia" products. (As an aside, the popular Nvidia hardware used in GenAI are not technically... Read more…

Leading Solution Providers

Contributors

Shutterstock 1606064203

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

January 25, 2024

In under two minutes, Meta's CEO, Mark Zuckerberg, laid out the company's AI plans, which included a plan to build an artificial intelligence system with the eq Read more…

China Is All In on a RISC-V Future

January 8, 2024

The state of RISC-V in China was discussed in a recent report released by the Jamestown Foundation, a Washington, D.C.-based think tank. The report, entitled "E Read more…

Shutterstock 1285747942

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, codenamed Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

Eyes on the Quantum Prize – D-Wave Says its Time is Now

January 30, 2024

Early quantum computing pioneer D-Wave again asserted – that at least for D-Wave – the commercial quantum era has begun. Speaking at its first in-person Ana Read more…

GenAI Having Major Impact on Data Culture, Survey Says

February 21, 2024

While 2023 was the year of GenAI, the adoption rates for GenAI did not match expectations. Most organizations are continuing to invest in GenAI but are yet to Read more…

The GenAI Datacenter Squeeze Is Here

February 1, 2024

The immediate effect of the GenAI GPU Squeeze was to reduce availability, either direct purchase or cloud access, increase cost, and push demand through the roof. A secondary issue has been developing over the last several years. Even though your organization secured several racks... Read more…

Intel’s Xeon General Manager Talks about Server Chips 

January 2, 2024

Intel is talking data-center growth and is done digging graves for its dead enterprise products, including GPUs, storage, and networking products, which fell to Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire