When researchers in Germany sat down nearly a decade ago to create a brand new parallel file system for HPC clusters, they had three goals: maximum scalability, maximum flexibility, and ease of use. What they came up with was the Fraunhofer Parallel File System (FhGFS), which is now in use on supercomputers.
The initial design considerations and inner workings of FhGFS are described in a ClusterMonkey paper on the file system by Tobias Götz, a researcher at the Fraunhofer Institute for Industrial Mathematics (ITWM).
Götz, who now lives and works in Berkeley, California, says ITWM researchers were frustrated with the limitations of existing parallel file systems. “There has to be a better way!” was the rallying cry of a group led by Franz-Josef Pfreundt, head of ITWM’s Competence Center High-Performance Computing (CC-HPC).
Pfreundt’s team started from scratch to create an ideal file system that used a “scalable, multi-threaded architecture that distributes metadata and doesn’t require any kernel patches, supports several network interconnects including native InfiniBand, and is easy to install and manage,” Götz writes.
The distributed metadata architecture is a key component of FhGHS, and contributes to the high level of scalability and flexibility that FhGHS was designed to provide HPC applications. “The metadata is distributed over several metadata servers on a directory level, with each server storing a part of the complete file system tree. This approach allows much faster access on the data,” he writes.
Similarly, the storage system breaks the storage content into “chunks” and distributes them across several storage servers using striping, according to Götz’s paper. The size of the chunks can be defined by the file system administrator.
There is no requirement in FhGHS to have dedicated hardware for the file and metadata servers. In fact, they can reside on the same physical server if necessary, Götz writes. This virtual approach also enables users to add as many storage and metadata servers as needed, without requiring any downtime.
Administrators can easily create a new FhGHS instance over a set of nodes, which makes it easy to set up a new test environment, either on physical hardware or in the cloud. A Java-based GUI is provided for management and monitoring tasks. The FhGHS file system itself runs on the Linux kernel, and is commercially supported by Fraunhofer.
FhGHS was officially unveiled in November 2007 at the SC07 conference in Reno, Nevada. Since then, it has been put to use on several systems, including the Top 500 system at Goethe University in Frankfurt, Germany.
Benchmark tests for FhGHS show near linear scalability (94 percent of maximum) on read/write operations on clusters of up to 20 storage servers. Tests of the metadata server demonstrate the capability to generate up to 500,000 files per second. In other words, the creation of 1 billion files would take about half an hour.
In head to head competition against Lustre and GPFS on 37-mile and 250-mile 100Gigabit Ethernet test tracks in Dresden, Germany, the group backing FhGHS was one of a few to publish results. In those tests, FhGHS demonstrated throughput of 89.6 percent of theoretical maximum on the 250-mile track in bi-directional mode, and 99.2 percent of maximum in uni-directional mode, according to Götz’s paper.
As the HPC community moves towards exascale computing, the folks behind FhGHS think the new file system can provide part of the solution, especially as it has to do with power consumption, fault tolerance, and software scalability. “Fraunhofer has experience that can be used to attack the exascale problem from several directions, the parallel file system being one of them,” Götz writes.