High Performance Computing has always been about Big Data. It’s not uncommon for research datasets to contain millions of files and many terabytes, even petabytes of data, or more. But data volumes in this range weren’t common in the enterprise until the Big Data revolution of the last 10 years. Now tools that have been designed to manage and analyze enterprise Big Data are being adopted by HPC users as they introduce modern analytics frameworks into the HPC workflow.
One open-source initiative that has transformed enterprise Big Data analytics is Apache Spark (Spark). It represents a new approach that shatters the previously daunting barriers to designing, developing, and distributing solutions capable of processing the colossal volumes of Big Data that enterprises are accumulating each day.
To thoroughly comprehend Spark and its full potential, it’s beneficial to view it in the context of earlier generation enterprise Big Data technologies. This is the first in a series of articles that will discuss the factors leading to the development of Spark, it’s capabilities, and the techniques and technology organizations are using today to reap the benefits Spark offers.
Enterprise Data Growth
Commercial enterprise datasets have been rapidly growing and the most notable data growth drivers include –
- New data sources
- Larger information quantities
- Broader data categories
- Commoditized hardware and software
- The rapid adoption and growth of artificial intelligence (AI).
Of course, the advent of machines exchanging information with machines – what we call the Internet of Things (IoT) – tops the list of new enterprise data sources. The number of connected devices has already surpassed the tens of billions.
AI will become another major driver of data growth at the enterprise level. Machine learning and deep learning applications need huge quantities of “training data” to make them accurate and effective. This depends heavily on high quality, diverse, and dynamic data inputs. Spark for Dummies (Wiley) estimates that emerging AI solutions will consume 8-10 times the data used in current Big Data solutions.
The data volumes flooding companies and common organizations are approaching HPC levels – and predicted to keep rising. But that isn’t big news. Instead, the truly important story lies in how enterprises are finding ways to derive more and more value from their expanding data assets with modern data processing technologies such as MapReduce, Hadoop, and eventually Spark.
The explosive growth of Big Data placed IT organizations in every industry under great stress. The old procedures for handling all this information no longer scaled, and organizations needed a new approach. Distributed processing had proven to be an excellent way of coping with massive amounts of input data. Commodity hardware and software made it cost-effective to employ hundreds or thousands of servers working in parallel to answer a question. From these ingredients, MapReduce, and the open-source software framework known as Hadoop, were born.
Much can be written about the benefits and accomplishments of the MapReduce / Hadoop combination. They were a great start and they continue to be heavily utilized. But for modern enterprises and HPC installations alike, it’s the next twists in the Big Data story that make it powerful.
While Hadoop and MapReduce were a great start — and they continue to be heavily utilized — they’re not the end of the story for optimal Big Data processing. Techniques and tools have continued to evolve, in part driven by some notable solution development and operational headaches such as –
- Batch processing. From the beginning, Hadoop running MapReduce has been most commonly deployed in batch processing environments. But once users get a taste of what’s possible with Big Data, they only want more — and they want answers right away.
- Hadoop and MapReduce are heavily reliant on time-consuming read/write interactions with disk drives. As both data volumes and user expectations grow, these slowdowns become less and less acceptable.
- Solution development complexity. While Hadoop and MapReduce application construction methodologies are vastly simpler than earlier hand-cobbled application development techniques, they remain well out of reach for the vast majority of software developers: the learning curve is still far too steep.
- Silo proliferation. With users enticed by the possibilities of rapid access to reams of valuable data – and in the absence of proper oversight – matters can speedily devolve into an expensive collection of isolated information silos. Naturally, each data island will have its own dedicated, underutilized hardware and will be inaccessible from the rest of the organization. Along with being an administrative and security nightmare, this proliferation results in higher expenses and diminished ROI.
- Multiple version support. Open-source software, especially critical infrastructure such as Hadoop, is characterized by endless updates. Many enterprises find it necessary to run multiple versions of their open-software technologies at the same time, with the upshot being an ever-expanding collection of isolated data silos.
- Multiple tenants support. When an organization makes the not-insubstantial expenditures for Big Data hardware, software, and expertise, it can expect to derive maximum return from its investment. This means making data and related solutions available to as many users as possible. The act of sharing, however, opens up all sorts of potential security risks, especially when these users get a taste of what’s possible and encourage even more people to participate. And once again, it’s very common for this requirement to result in individual Hadoop implementations for each project and ensuing information silos.
- Born-in-the-cloud technologies. We’re reaching a point where the vast majority of enterprise software is designed, developed, and deployed in the cloud. This has major implications for your Big Data strategy.
- Multicloud maturation. In the early days of Hadoop and MapReduce, enterprises were just beginning to fully grasp the potential of cloud computing. But now, most sophisticated organizations have a cloud computing strategy that incorporates a blend of public, private, and hybrid instances.
A New Approach
Challenges such as these involve many of the most important issues facing IT infrastructure teams today. To solve them, a new approach is required.
This is the point in the story where we meet our star.
Spark is a memory-optimized data processing engine offering much better performance than the MapReduce/Hadoop solution set. Spark can also process data on other clustering environments, such as Cassandra or even on its own clusters and files. By minimizing disk reads and writes, Spark delivers very high performance access to dynamic, complex data collections and provides interactive functionality such as ad hoc query support in a scalable way.
Spark for Dummies provides this introduction to Spark.
“The designers of Spark created it with the expectation that it would work with petabytes of data that were distributed across a cluster of thousands of servers – both physical and virtual. From the beginning, Spark was meant to exploit the potential – particularly speed and scalability – offered by in-memory information processing. These time savings really add up when compared with how MapReduce jobs tend to continually write intermediate data back to disk throughout an often lengthy processing cycle. A rule of thumb metric is that Spark delivers results on the same dataset up to 100 times faster than MapReduce….”
Deploying Spark in the enterprise
Because they now underpin numerous production applications, Big Data technologies, including Apache Spark, warrant being treated as mission-critical information and resources by the enterprise. This task requires significant time and expertise, because most Big Data infrastructure offerings are composed of numerous open-source components, and downloading various elements and then assembling a stable, scalable, manageable environment isn’t straightforward.
IBM has made — and continues to make — enormous contributions and investments in open-source Spark. To complement these efforts, IBM also created an all-inclusive, turnkey commercial distribution that delivers all of Spark’s advantages, while making it easier for enterprises that are basing new applications around it. IBM Spectrum Conductor, is a secure and scalable platform for orchestrating multi-tenant enterprise Big Data applications, such as Spark, on a shared computing grid both on-premises and in the cloud.
For Spark and other workloads, IBM Spectrum Conductor provides a better infrastructure for scalable enterprise analytics because it –
- Supports diverse applications and frameworks
- Allows users to share resources dynamically while maintaining a secure environment (runtime isolation while sharing resources dynamically)
- Delivers superior performance and scalability
- Allows compute resources to be scaled independent of storage
- Simplifies environments, helping reduce administrative costs
Spark represents the next generation in Big Data infrastructure, and it’s already supplying an unprecedented blend of power and ease of use to those organizations that have eagerly adopted it. Rising from the need for faster analytics and for solutions to MapReduce/Hadoop challenges, Spark has become one of the most widely used technologies in Big Data today. In the next article, we will dive deeper into a comparison of MapReduce/Hadoop and Spark capabilities across a number of use cases.