Scaling a read-intensive, low-latency file system to 10M+ IOPs

Many shared file systems are used in supporting read-intensive applications. These applications typically exploit copies of datasets whose authoritative copy resides somewhere else. One application of a read-heavy application is financial backtesting. For small datasets, in-memory databases and caching techniques can yield impressive results. However, low latency flash-based scalable shared file systems can provide both massive IOPs and bandwidth. They’re also easy to adopt because of their use of a file-level abstraction.

In this post, I’ll share how to easily create and scale a shared, distributed POSIX compatible file system that performs at local NVMe speeds for files opened read-only. For this configuration, I’ll be using i3en.24xlarge Amazon EC2 instances on AWS. Figure 1 shows the basic architecture, in which there are N file servers acting as file system clients and application servers, as well as one or more remote clients that can access the file system, but which don’t have high-speed local access.

*Figure 1: High level architecture of the file system.*

In general, this architecture is designed for any type of workflow where some amount of data is pushed to the file system via a single writer and with N clients reading and analyzing the data – eventually producing output that’s highly concentrated thus not requiring a lot of file system writes. In this example, I’ll assume a shared file system size of 54TB, which is the total NVMe capacity of a i3en.24xlarge.

Capacity can be scaled simply by adding row(s) of additional i3en.24xlarge instances that will act as storage expansion servers. We do this by configuring the expansion servers as NVMe over fabric (NVMe-oF) targets, and the the file server instances as initiators. This expansion is shown in figure 1 as the capacity expansion row.

NVMe-oF has been included in mainline Linux kernels for over 5 years, and best practices for setting up a pair of targets and initiators for RHEL can be found here. This enables a total shared file system size that is a multiple of 54 TB, with each row of expansion instances adding an additional 54TB of capacity, but not aggregate performance. If the expansion instances are provisioned within the same cluster placement group, the additional “hop” to the expansion NVMe-oF averages less than 200 microseconds, courtesy of Elastic Fabric Adapter (EFA), our high-speed, low-latency network adapter.

While we are on the topic of the goodness of shared file-level abstraction, AWS offers a robust managed Lustre implementation called Amazon FSx for Lustre suitable for the vast majority of high performance workloads. If your application needs a more balanced Read/Write profile, FSx for Lustre is a great choice. For those of you that have an extreme read-only distribution use case like ours, read on for more details on how to roll your own filesystem.

Read-Mostly Distributed File System

For ultra-low latency read access, each file server instance has a local XFS file system, striped via LVM (the Logical Volume Manager) across all 8 local NVMe devices, and optionally concatenated with NVMe-oF drives presented from the capacity row. This arrangement allows the addition of capacity and growth of the shared file system, without disrupting the overall file system structure.

The local file systems that exist on all file servers are all local copies of a shared POSIX-compatible FUSE (Filesystem in USErspace) file system. This is based on open-source GLUSTER, and any number of clients (including the servers themselves) can directly read and write to it through the FUSE mount. When a FUSE client writes to the file system, each write is replicated to all the server nodes, and can be seen immediately on each server and by each client.

Because the file system is designed to scale reads, in a low (or no) write environment, the write performance penalty due to FUSE abstraction and GLUSTER replication to the N underlying file systems isn’t a concern. In practice, the write speed is limited to a few hundred MBytes/s through the remote FUSE client, and reads from a remote client are in the range of a few GBytes/s. However, when accessing files on the file server via the read-only (non-FUSE) mount, read access occurs at PCI backplane speeds. Because each file server is reading directly from a local copy, the aggregate read performance scales perfectly linearly with the number of file servers.

Read-intensive applications and jobs can run directly on the file server, opening the files for read on the local read-only mount point, and writing output files through the FUSE enabled mount point. Both mount are to the same file system directory structure, which are constantly in sync. Non-consistency across the local file systems is avoided by mounting the local copies read-only, and using the shared file system’s consistency for all write and metadata operations. Thus, the full power of a shared file system is at the application’s disposal, but operating at the speed of localized NVMe reads.

Method

To create a read-mostly distributed file system, I provisioned 13 x Amazon EC2 i3en.24xlarge file servers using Red Hat Enterprise Linux 8, but in general, any common Linux distribution can be used. Following the GLUSTER installation guide, I built a single XFS GLUSTER “brick” file system on each server, which was striped using LVM, with a width of 1024 KB. Once all 13 peers (GLUSTER parlance for file server) had been added to the cluster, a volume of type “replica” was created that is automatically replicated to all 13 peers. This configuration leverages GLUSTER to create a shared file system construct that duplicates the entire file system directory and contents to all cluster members. Once created, we start the volume and mount the GLUSTER file system on all peers to an appropriate mount point, in this case, /sharedfs. This is our shared read/write file system.

Read the full blog to learn more about the setup, performance test and the results.

Reminder: You can learn a lot from AWS HPC engineers by subscribing to the HPC Tech Short YouTube channel, and following the AWS HPC Blog channel.