The Spark for Smarter Business

By Jeff Karmiol, Offering Manager, IBM Spectrum Computing

April 30, 2019

Analysts predict that Artificial intelligence (AI) may add 16% or around $13 trillion by 2030 to current global economic output – an annual average productivity growth rate of about 1.2%. By comparison, the introduction of steam engines during the 1800s boosted labor productivity by an estimated 0.3% a year, and the spread of IT during the 2000s accelerated productivity by 0.6%.

It’s clear that AI will be a big deal.

But AI isn’t something that exists all alone – it’s a particular type of Big Data analytics that requires high performance, highly functional underlying IT infrastructure. Organizations such as the Center for Open-Source Data and AI Technologies (CODAIT), formerly the Spark Technology Center, are demonstrating that open source Apache Spark and its commercial implementations offer powerful foundations upon which to build effective AI-enhanced Big Data analytics solutions.

[For in-depth information about Spark see Spark for Dummies]

When it comes to Spark – the memory-based high-performance complement to more traditional MapReduce/Hadoop Big Data analytics frameworks – we don’t need to wait for decades to see its impacts across many different industry sectors. Spark-based solutions are providing real value to enterprises around the globe right now. And some of these Spark use cases are literally out of this world.

The Search for Extraterrestrial Intelligence (SETI) Institute is a nonprofit research organization dedicated to exploring, understanding, and explaining the origin and nature of life in the universe. Among its tools is the Allen Telescope Array (ATA) – an array of 42 radio telescope dishes that can output 25 gigabytes of data per second. But due to limitations with SETI’s existing on-premises distributed computing, the organization did not have the capabilities to analyze the majority of the incoming ATA data – it could not effectively analyze the monumental amount of data without a more powerful solution.

Working with IBM, SETI recently implemented Spark. Spark has enabled the organization to introduce multiple new ways to look at the constantly growing archive of data from the ATA and other sources. In one instance, using the lowest number of computational processors offered by Spark, the SETI Institute was able to process 200 million records in nine hours, finding six bodies of interest that had not and would not have been discovered otherwise. Bill Diamond, CEO of the SETI Institute, stated: “Spark’s capabilities give us the opportunity to look at complex and massive data sets with a powerful new set of tools. We are now able to look for and extract structure and patterns in the data that were previously invisible to us.

[Learn how to accelerate and simplify Spark at scale with IBM Spectrum Conductor]

There are many good reasons why Spark has become popular:

  • Real-time applications: By leveraging in-memory processing efficiencies, Spark has the potential to return results much more quickly than MapReduce. This is important in all applications, but especially for solutions that will provide guidance in real-time.
  • Heterogeneous data: The most compelling Big Data solutions blend information that’s been stored in diverse silos that are built on other architectures. If you’re developing a solution that crosses data boundaries, Spark will be a good choice.
  • Less experienced software developers: The learning curve for creating Spark solutions is less intimidating than for MapReduce. In essence, Spark offers a simpler application development methodology than MapReduce. The upshot of this ease-of-use is faster delivery of more reliable code.
  • Streaming data: Sensors, mobile devices, and trading systems — to name just a few sources — are generating colossal volumes of streaming data. Spark is particularly adept at coping with these torrents and making it easy for developers to create innovative solutions.
  • AI, machine learning, and deep learning: Previously restricted to only the most educated data scientists working with massive investments, Spark brings machine and deep learning capabilities within reach of a much larger audience.
  • Smaller codebase: Spark’s extensive language and API support directly translates to less code to write, which means that there’s less code to maintain.

With AI, machine learning, and deep learning beginning to take hold as standard business tools, Spark is fast becoming the preferred foundation on which to base new business application solutions.

[Learn more: IBM AI Infrastructure Reference Architecture]

The Financial Services industry offers plenty of examples of powerful ways to implement Spark-based analytics. This is a highly dynamic sector that sees a continual flow of new products and services designed to create additional revenue streams. During the course of a given business day, a bank or brokerage can be inundated with streaming data consisting of quotes, trades, settlements, etc. It’s likely that fraudulent activities will be hidden in this torrent of raw data.

Spark’s in-memory processing power and aptitude for working with streaming data make it feasible to inspect activities in real-time, compare them with historical models and other proprietary fraud indicators, and then flag suspicious happenings. Based on what’s discovered, financial firms can reject, delay, or require manual inspection of questionable transactions.

ATMs are a principal point of customer contact for retail financial services firms, and every large organization’s network is made up of thousands of individual machines. Service outages, errors, or other interruptions reflect very badly on the institution, but it can be difficult to identify which ATMs are about to experience a failure.

The tandem of Hadoop and Spark can deliver a solution that helps prevent downtime before it occurs. Hadoop and MapReduce can be tasked with analyzing reams of historical maintenance data in batch mode, while Spark can be focused on pairing the results of this analysis with the real-time Apache MLlib machine learning library to incorporate up-to-the-minute customer complaints and ATM diagnostic messages. By using the results of the linkage between these two previously isolated systems, maintenance personnel can be guided to service machines that are on the verge of a breakdown.

Healthcare is another industry that offer plenty of examples of how Spark along with management frameworks such as IBM Spectrum Conductor can reduce complexity and speed time to results.

With the rise of smartphones, many innovative in-home medical devices are capturing metrics such as blood pressure, blood sugar, epileptic incidents, and numerous other potentially serious events. Although they serve diverse purposes, one thing these devices have in common is that they can each generate hundreds of events per hour. The vast majority of them are unremarkable, but certain ones are potentially deadly.

A MapReduce solution would be a great start toward churning through the vast logs created by these sensors. It could report on potentially serious happenings and alert appropriate personnel. However, MapReduce’s batch processing nature would make this system suboptimal. A better strategy would be to use Spark to monitor the same events, but in real-time as they arrive directly from the devices. If a serious situation arises, emergency personnel can be instantaneously alerted.

Partially driven by human error and insufficient oversight, drug interactions sicken and even kill untold numbers of patients each year. Applications that can monitor a patient’s specific medication profile and warn of potential cross-drug issues would go a long way toward reducing these unnecessary events. Rather than waiting for a dangerous interaction to occur, the best place for this type of solution would be the pharmacy — before the patient even starts treatment. A Spark-based application could evaluate new prescriptions and associate them with the medications already being taken by the patient. This highly computationally intense solution would then compare the results against a highly detailed database of side effects and interactions, all in real-time.

The list of ways that Spark-accelerated analytics can be leveraged to make business faster and smarter is very long – and growing rapidly. The advent of AI-enhanced applications will only increase the value of IT infrastructure components, such as Spark, that enable more efficient and effective processing of the new types of data streams pouring into 21st century enterprises. Plus, Spark works very well with established Big Data analytics solutions such as MapReduce/Hadoop.

When it comes to helping modern business be faster, smarter, safer, more agile, and more profitable, the real-time analytics capabilities of Spark are hard to beat. To learn more about Spark, download Spark For Dummies, IBM limited Edition here.


The Wall Street Journal: The Economic Impact of Artificial Intelligence on the World Economy, November 2018

Forrester: The Total Economic Impact of Apache Spark, September 2016


Return to Solution Channel Homepage

IBM Resources

Follow @IBMSystems

IBM Systems on Facebook

Do NOT follow this link or you will be banned from the site!
Share This