The rapid maturation of the Apache Hadoop ecosystem has caught the eyes of HPC professionals who are eager to take advantage of emerging big data tools, such as Spark. One HPC group presenting on the topic at the SC15 show this week in Austin, Texas, is Rutgers University’s RADICAL team.
The Research in Advanced Distributed Cyberinfrastructure and Applications Laboratory (RADICAL) duo of professors Shantenu Jha (Rutgers) and Andre Luckow (Clemson University) are giving a three-and-a-half hour tutorial Sunday morning demonstrating how the power of Hadoop-resident frameworks, such as MapReduce and Spark–as well as the group’s own RADICAL-Cybertools suite–can further the analytical goals of the HPC professional.
In his introduction to the course, Professor Luckow discusses about how the HPC world can learn and benefit from the tools and analytic approaches that have been championed in the Hadoop world. “High performance computing (HPC) environments have traditionally been designed to meet the compute demands of scientific applications; data has only been a second order concern,” Professor Luckow says in his tutorial introduction, which can be viewed on YouTube.
“However, with science moving toward data-driven discoveries relying on correlations and patterns in data to form scientific hypotheses,” Professor Luckow continues, “the limitations of HPC approaches become apparent. Low-level abstractions and architectural paradigms, such as the separation of storage and compute, are not optimal for data-intensive applications.”
While there are powerful kernels and libraries available for traditional HPC, the lack of “functional completeness” of analytical libraries is holding them back, the professor says. “In contrast, the Apache Hadoop ecosystem has grown to be rich with analytical libraries, e.g. Spark MLlib,” he says. “Bringing the richness of the Hadoop ecosystem to traditional HPC environments will help address some gaps.”
The RADICAL team’s tutorial at SC15 is aimed at giving attendees some hands-on experience with tools that can help close that gap. The class will be broken into three parts. The first will cover the conceptual bases for understanding and characterizing the different workloads of interest.
In the second, the team will introduce Pilot-Extraction, the infrastructure abstraction layer of the RADICAL Cybertools stack. Last, the attendees will learn how to combine the use of Apache Hadoop and Spark tools with the Pilot-Extraction to implement algorithms for advance data-intensive analysis, such as K-means clustering and logistic regressions.
The audience will learn how to efficiently use Spark and Hadoop on HPC to carry out advanced analytics and will understand deployment performance tradeoffs for these tools, Professor Luckow says.