Apache Spark represents a revolutionary new approach that shatters the previously daunting barriers to designing, developing, and distributing solutions capable of processing the colossal volumes of Big Data that enterprises are accumulating each day. [Spark for Dummies, 2019]
In the first installment of this series on Apache Spark (Spark), we met the hero of our story and learned how Spark, Big Data, and high-performance computing (HPC) all relate. Essentially, thanks to sources such as the Internet of Things (IoT), artificial intelligence (AI), and soon new technologies such as blockchain, enterprises are now being flooded by volumes of data that previously were only seen in HPC environments.
Pioneering enterprises such as Google looked to massively parallel processing models to cope with their Big Data challenges. From these efforts arose MapReduce and soon after the commercial software framework called Hadoop. Thanks to its early successes and advantages versus prior methods, for years MapReduce was the de facto software development methodology for distributed Big Data batch processing. However, batch processing, while very useful, isn’t the answer for every computational situation. Driven by continuing, widespread adoption of Big Data and user hunger for additional use cases, researchers began looking at ways to improve on MapReduce’s capabilities. Spark was one of the most interesting outcomes of this research.
From the beginning, Spark was designed to be a general-purpose computational engine for interactive, batch, and streaming tasks that was capable of leveraging the same types of distributed processing resources that had powered MapReduce initiatives. The designers of Spark created it with the expectation that it would work with petabytes of data that were distributed across clusters of thousands of servers – both physical and virtual. From the beginning, Spark was meant to exploit the potential – particularly speed and scalability – offered by in-memory information processing. These time savings really add up when compared with how MapReduce jobs tend to continually write intermediate data back to disk throughout an often lengthy processing cycle. A rule of thumb is that Spark delivers results on the same data set up to 100 times faster than MapReduce.
Another complaint about MapReduce was the somewhat awkward ways applications were developed, which led to very long learning curves. MapReduce solutions can be a bit disjointed, requiring software developers to cobble together multiple Hadoop open-source ecosystem components. On the other hand, Spark is inherently able to blend multiple software development tactics when processing data by abstracting away much of the complexity associated with MapReduce.
This flexibility means that Spark can coordinate interactive, streaming, and batch workflows at the same time. It’s also able to abstract other computational engines, which lets developers benefit from others’ work without having to decipher deep technical details. The resulting data pipelines can be very rich, complex, fast – and reusable too.
In fact, Spark offers a number of advantages for developing Big Data solutions in comparison with earlier approaches based on MapReduce, including much better performance, greater simplicity, easier administration, and faster application development.
When you combine Spark’s architecture, programming model, and development tools with its inherent speed, the outcome is a new breed of high-performance, more intuitive applications. Its speed, simplicity, and developer productivity make enormous numbers of previously lengthy and expensive projects much more practical than with earlier Big Data tooling.
Spark’s advantages shine especially bright when we look at several common Big Data use cases:
- Streaming data. Many enterprises are now guzzling a torrent of streaming data from sources such as sensors, financial trades, mobile devices, medical monitors, and social media updates. Using financial services as an example, a Spark application can analyze raw incoming credit card authorizations and completed transactions in essentially real-time, rather than wading through this information after the fact. This is a game-changer for use cases like security, fraud prevention (instead of just detection), etc. Financial firms can make fact-based business decisions to reject bogus credit card transactions at the point of sale in real-time, rather than
permitting fraudsters to acquire merchandise and thus trigger losses for banks and retailers.
- Artificial Intelligence. In AI-based applications, the algorithms calculate enormous numbers of potential outcomes, make decisions, and then examine the results to learn if what transpired is what was predicted. Enterprises are increasingly using AI to develop powerful new analytic capabilities across many use cases, from computer vision, through natural language processing, to sophisticated anomaly detection. In-memory processing and highly efficient data pipelines make Spark an ideal analytics engine for AI projects.
[Learn more: IBM AI Infrastructure Reference Architecture]
- Business intelligence. To be more effective at their jobs, business analysts and end users alike continually seek more up-to-date, accurate visibility into the full spectrum of an organization’s data. Instead of being served in a canned report, ideally these queries should be carried out right away and also give users the flexibility to change inquiries, drill down into the data, and so on. Spark provides the requisite performance necessary to allow on-the-fly analysis – not only by algorithms but by people, too. In a financial services investment scenario, this combination of computerized and human analysis of risk factors could result in faster, yet more accurate trading decisions.
Spark is good, but it’s far from perfect. As with most open-source-based technologies, Spark has some obstacles that must be overcome. For example, it’s easy for a Spark environment to devolve into a collection of siloed servers and data stores. If not precisely managed, enterprise assets can be poorly utilized, which wastes time and money and places undue burdens on administrators and operational staff. Also, open-source Spark excels at interacting with data hosted in the Hadoop environment, but easily getting to other sources can be problematic. Finally, an open-source Big Data environment has enormous numbers of moving parts – even without Spark in the mix. Spark has numerous components that require administration, and each of these is continually being revised. Keeping pace with the torrent of open-source Spark updates – while not violating SLAs or introducing security risks – can tax the capabilities of any IT team.
IBM has made — and continues to make — enormous contributions and investments in open-source Spark. To complement these efforts, IBM also created IBM Spectrum Conductor to accelerate Spark workloads plus an all-inclusive, turnkey commercial distribution that delivers all of Spark’s advantages, while mitigating many of its challenges.
These days, many Spark-based applications are considered mission-critical, but such solutions are composed of numerous open-source components, and downloading various elements and then assembling a stable, scalable, manageable environment isn’t straightforward.
An integrated solution from a vendor provides a single point of contact to help get the Big Data infrastructure up and running — and keep it running if you have problems. These integrated commercial solutions are becoming a popular way for enterprises to reap the benefits of Spark, while minimizing operational overhead.
In our next article, we’ll take a much deeper dive into how Spark, Hadoop, and MapReduce work together to see why so many enterprises are introducing Spark into Hadoop environments