Since 1986 - Covering the Fastest Computers in the World and the People Who Run Them

Language Flags
July 22, 2011

IBM Demos Record-Breaking Parallel File System Performance

Michael Feldman

A research group at IBM has come up with a prototype parallel storage system that they claim is an order of magnitude faster than anything demonstrated before. Using a souped-up version of IBM’s General Parallel File System (GPFS) and a set of Violin Memory’s solid-state storage arrays, the system was able to scan 10 billion files in 43 minutes. They say that’s 37 times faster than the last time IBM topped out GPFS performance in 2007.

The idea behind 10-billion files scans is demonstrate GPFS can keep pace with the enormous flood of data that organizations are amassing. According to IDC, there will be 60 exabytes of digitized data this year and these data stores are expected to increase 60 percent per year. In a nutshell, we’re heading for a zettabyte world.

But it’s not just the aggregate size of storage. Individual businesses and government organizations will soon be expected to actively manage 10 to 100 billion files in a single system. The HPCS DARPA program requires a trillion files in a single system.

That’s certainly beyond the capabilities of storage systems today. Even parallel file systems designed for extreme scalability, like GPFS and Lustre currently top out at about 2 billion files. But the limit is not storage capacity, it’s performance.

While hard drive capacity is increasing at about 25 to 40 percent per year, performance is more in the range of 5 to 10 percent. That’s a problem for all types of storage I/O, but especially for operations on metadata. Metadata is the information that describes file attributes, like name, size, data type, permissions, etc. This information, while small in size, has to be accessed often and quickly — basically every time you do something with a file. When you have billions of files being actively managed, the metadata becomes a choke point.

Typically metadata itself doesn’t require lots of capacity. To store the attributes for 10 billion files, you only need four 2TB disks; they just aren’t fast enough for this level of metadata processing. To get the needed I/O bandwidth, you’d actually need around 200 disk drives. (According to IBM, their 2007 scanning demo of 1 billion files under GPFS required 20 drives.) Using lots of disks to aggregate I/O for metadata is a rather inefficient approach, considering the amount of power, cooling, floor space and system administration associated with disk arrays.

The obvious solution is solid-state storage, and that is indeed what the IBM researchers used for their demo this week. In this case, they used hardware from Violin Memory, a maker of flash storage arrays. According to the IBM researchers, the Violin gear provided the attributes needed for the extreme levels of file scan performance: high bandwidth; low I/O access time, with good transaction rate at medium sized blocks; sustained performance with mixing different I/O access patterns; multiple access paths to shared storage, and reliable data protection in case of NAND failure.

When I asked the IBM team why they opted for Violin in preference to other flash memory offerings, they told me the Violin storage met all of these requirements as well or better than any other SSD approach they had seen. “For example, SSDs on a PCI-e card will not address the high availability requirement unless it replicates with another device,” they said. “This will effectively increase the solution cost. Many SSDs we sampled and evaluated do not sustain performance when mixing different I/O access patterns.”

The storage setup for the demo consisted of four Violin Memory 3205 arrays, with a total raw capacity of 10 TB (7.2 GB usable), and aggregate I/O bandwidth of 5 GB/second. The four arrays can deliver on the order of a million IOPS with 4K blocks, with a typical write latency of 20us and read latency of 90us.

Driving the storage were ten IBM 3650 M2 dual-socket x86 servers, each with 32 GB of memory. The 3650 cluster was connected with InfiniBand, with the Violin boxes hooked to the servers via PCIe.

All 6.5 TB of metadata for the 10 billion files was mapped to the four 3U Violin arrays. No disk drives were required since, for demonstration purposes, the files themselves contained no data. To provide a more or less typical file system environment, the files were spread out across 10 million directories. Scaled up to 100 billion files, the researchers estimated that just half a rack of flash storage arrays would be needed for the metadata, compared to five to ten racks of disks required for the same performance.

It’s noteworthy that the researchers selected Violin gear for this particular demo, especially considering that IBM is currently shipping Fusion-io PCI-based flash drives with its System X servers. Even though the work describe here was just a research project, with no timetable for commercialization, it’s not too big a stretch to imagine future IBM systems with Violin technology folded in. The larger lesson though is that solid-state storage is likely to figure prominently in future storage system, IBM or otherwise, when billions of files is are in the mix.