MapReduce is well known for its relative ease of use in today’s ubiquitous world of parallelism. The beauty of the model is in its ability to absolve or abstract away details of parallelism, fault tolerance, synchronization, and input management from the user. A user typically writes her algorithm using two uniform functions — a map and a reduce — with a single computing node in mind. The MapReduce framework then takes these functions and automatically parallelizes them on commodity clusters. This no fuss, no hassle way of programming provides high programmer productivity that scales well across many servers.
To date, the most popular MapReduce implementation is Apache Hadoop which leverages the Hadoop Distributed File System (HDFS). However, current HPC environments such as those employed by the National Energy Research Scientific Computing Center (NERSC) uses a shared-disk filesystem. As Madhusudhan Govindaraju, an associate professor in the Department of Computer Science at SUNY Binghamton notes, “HDFS is not preferred within many HPC environments that already rely on POSIX compliant high performance file systems.”
It’s still possible to integrate HDFS with HPC environments that use shared-disk file systems. However, shoehorning HDFS into such HPC environments requires layers of indirection that adversely affects performance. The software stack is modified to work under the constraints of Apache Hadoop. Govindaraju emphasizes that “the partitioning of the cluster for specific software stacks results in inefficient utilization of resources and MapReduce users find themselves not able to make full use of the HPC infrastructure.”
To address the above problem of supporting HPC environments for MapReduce, Govindaraju and his team have developed a new MapReduce framework that is suited for popular HPC environments such as those provided by NERSC. Their framework, called MARIANE (MApReduce Implementation Adapted for HPC Environments), is designed with shared-disk filesystems in mind maintaining high-performance without the performance tax associated with Apache Hadoop
MARIANE follows the same basic tenets for any MapReduce-like implementation: (1) fault tolerance, (2) high-throughput, and (3) data management. Fault tolerance focuses on automatically detecting and recovering from serviceable errors and is the most critical goal for such frameworks. Next, MapReduce jobs require high-throughput for very large datasets on the order of 100s of petabytes of data.Finally, data management in such frameworks subdivides data into smaller chunks and schedule them appropriately to the respective hardware.
With MARIANE, the applicability of the MapReduce paradigm is extended to a wider array of HPC environments without compromising system performance. Govindaraju and his researchers demonstrate a significant increase in performance and decrease in application overhead compared to Apache Hadoop under typical HPC environments.
We recently talked about this in the San Diego Supercomputer Center context in How HPC is Hacking Hadoop, a solid read for those looking for a more personal take on how this trend is shaping up.