Wolfgang Hoschek is a software engineer at Cloudera working on the Hadoop platform and Cloudera search team. He is a committed member of the Apache Flume and Apache Lucene/Solr projects. A former CERN fellow and computer scientist at Lawrence Berkeley Laboratory, Hoschek has over 15 years of experience in large-scale distributed systems, data intensive computing, and real time analytics.
He will be talking at the upcoming ISC Big Data conference in Heidelberg, Germany, on “Adding Search as a First-Class Citizen to Hadoop.”
Q1: What is the best indication that big data has left the world of hype behind and is now at center stage, both for business and academia?
In academia big data has been central to many efforts for a long time. For example, high energy physics, genomics, space agencies, climate research, and social sciences routinely use big data systems in cost-effective ways at even larger scale.
Big data projects are no more pilots in business either. Security, reliability, and high availability have come a long way. Big data production services provide mission-critical functions to leading businesses today, for example, in sectors such as internet and technology, financial services, healthcare, energy, industrials, utilities, and telcos. These companies provide novel differentiated services that separate them from the competition. Big data analysis also assists electoral campaigns and governments.
Q2: The Hadoop ecosystem is maturing, but also getting more complex. How can users cope, especially those who just want “the right answer?”
The Hadoop ecosystem is evolving and expanding rapidly, somewhat similar as Linux did. Companies have emerged to fill the gap between the bleeding edge pieces and the needs of integrated rock solid production system. These companies offer turnkey Hadoop distributions and enterprise data hubs that integrate with existing legacy systems and fully integrate all important components of the Hadoop ecosystem, in secure, reliable, fault-tolerant, and cost-effective ways. Vendor products take care of installation, configuration, monitoring, trouble shooting, tuning, upgrades, maintenance, and other operational aspects.
Many of these companies also offer training, support, consulting, and professional services. They also employ large portions of the Apache Hadoop open source committer community and correspondingly fund key open source development in response to customer bug reports and feature requests.
Q3: Is choosing the right hardware infrastructure largely a solved issue for big data?
The larger the data, the more optimization of hardware configuration matters, relative to people costs. The hardware landscape and it’s growth curves continue to change, for example, with the introduction of flash as another storage tier via NVRAM, cheaper RAM, manycore CPUs, GPUs with higher bandwidth, spine switches with high port counts, etc. Currently, a typical commodity Hadoop node consists of 8 to 12 SATA drives (in the terabyte range), a dual socket server with 6-8 cores (per socket), 128-256 MB of memory, and a 10Gb Ethernet link.
Q4: At the ISC Big Data conference in October, you will talk about adding sophisticated search capabilities to the Hadoop framework. We will also have presentations at the conference on the evolution of analytics more broadly. Will the Hadoop ecosystem continue to evolve rapidly in the future?
The Hadoop ecosystem will continue to evolve rapidly in response to demand, innovation, and lessons learned. This evolution is fueled by the observed value of the software and the growing adoption and large investments of a wide range of companies and individuals. All this evolution happens in common projects via worldwide cross-cultural open source collaboration.
Q5: Do you see Hadoop as the unifying framework for big data applications? In other words, do you think it can encapsulate all the functionality needed to become a de facto standard for the whole domain, or will there still be a place for alternative frameworks?
Today, Hadoop is the de facto unifying framework for big data applications. Somewhat astoundingly, no serious competitor has emerged, and the software industry has instead rallied behind Hadoop. Some pieces of functionality are still missing or quite limited in Hadoop. Progress to address these concerns is being made fast in many areas. For example, YARN now enables resource management of a wide range of applications on top of the platform, not just for MapReduce. This way, new data processing frameworks such as Spark can leverage the Hadoop ecosystem and participate in it.
Hadoop is fundamentally open and if there is a large class of repeatable high value gig data applications that aren’t served well, an open source framework will likely evolve to address these needs, and be integrated with the Hadoop platform. Facilitating a variety of processing applications on the same underlying data platform provides great synergies.