Vendors in the HPC universe are jumping on the Hadoop bandwagon. This week SGI announced that it was marrying Cloudera’s CDH (Cloudera’s Distribution including Apache Hadoop) software with its own cluster machines. This is not too surprising, considering Hadoop’s role as the leading open source framework for data-intensive analytics on distributed platforms, and Cloudera’s position as a top Hadoop distributor and supporter.
According to the press release, the SGI-Cloudera partnership will “enable the two companies to jointly build, sell and deploy integrated, high performance Apache Hadoop-based commercial solutions.” But as pointed out by Derrick Harris over at GigaOM, this is not necessarily an HPC play in the conventional sense. Even though Hadoop can be used for technical workloads like genomics and seismology, it’s more typical application is for search engines, social media analytics, and advertising optimization.
According to Harris, the Cloudera integration with SGI gear appears to be targeted more toward the latter. On SGI’s website, the pre-configured Hadoop clusters come in two cluster flavors: Rackable Servers and CloudRack Servers. Both are from the non-HPC side of the house. That doesn’t mean such systems won’t be running technical computing workloads, however, given the somewhat different nature of these data-intensive applications (i.e., you don’t necessarily need top bin CPUs, or even InfiniBand, for I/O-bound Hadoop apps).
Harris also points out that Microsoft recently announced its Hadoop integration with Windows Server and Azure. This is an even more nuanced move, considering that Microsoft already has a Hadoop alternative for HPC called LINQ to HPC (formally Dryad). The latter is also packaged with HPC Server 2008 R2, and eventually will be supported in Azure as well.
The implication is that Microsoft will position its LINQ technology for HPC-type applications, and its standard Hadoop integration for non-HPC use cases. There are other Hadoop alternatives designed specifically for performance-obsessed users. In this category are platforms like LexisNexis’ Data Analytics Supercomputer (DAS) offering, as well as non-standard flavors of Hadoop that are being tweaked for performance.
Unfortunately this is the ultimate endorsement of a successful technology — copycats and derivatives. If successful though, at least some of these performance-minded frameworks for data-intensive analytics could find a happy home in HPC.