A 256-node Hadoop system at the University of Texas at Austin is breaking down the barriers that have traditionally kept high performance computing relegated to technical experts. Nearly 70 students and researchers at the Texas Advanced Computing Center (TACC) have used the cluster to crunch big datasets, and provide potential answers to questions in the fields of biomedicine, linguistics, and astronomy.
There’s been a lot of hype over Apache’s Hadoop in the last few years, and with good reason. With the emergence of big data, new technologies like Hadoop promise to make it easier to sort through huge datasets and tease out the patterns, without burdening users with low-level plumbing, like I/O, memory structures, and job queuing.
What’s notable about the TACC’s Hadoop cluster is that it represents the first Hadoop implementation running on a supercomputer at a U.S. high performance computing center. Until the folks at TACC loaded Hadoop on their 256-node Dell cluster (dubbed Longhorn) in the fall of 2010, you couldn’t find Hadoop running on an academic supercomputer, according to Aaron Dubrow, a science and technology writer at TACC.
In the 3.5 years that the TACC cluster has been online, it’s seen more than one million hours of data intensive computations across 19 different projects, and has been the basis for dozens of papers and presentations ranging from flow cytometry (FCM) to natural language processing.
Longhorn helped accelerate the identification of cell types using FCM, which is a technology used by medical researchers. Thanks to the cluster’s ability to automatically create and schedule parallel tasks based on the user’s job specification, the FCM processing got an immediate boost, and eliminated the need to rewrite the open-source software to handle big data sets.
The cluster was also used by linguistic researchers to show how language is connected across time and space. A UT linguistics professor applied the TextGrounder algorithm against a collection of British and American books from a century ago. The results were then meshed with a geobrowser to display where words have their roots.
Others are using the 96-TB Hadoop cluster to help sort the wheat from the chaff on the Internet as it relates to one topic in particular: Autism. UT researchers are using visualization techniques to help the parents of autistic children find information and support on the Web more quickly.
TACC is also working with Intel to find out how Hadoop clusters can be goosed to run scientific workloads faster, particularly as it relates to speedier interconnects. The groups shared their work together with a white paper that was recently published.