Deploying Big Data with a Spark

By Jeff Karmiol, Offering Manager, IBM Spectrum Computing

June 4, 2019

Deploying Apache Spark has the potential to transform the way your organization does business. [Wiley Books: Spark for Dummies, 2019]

This is a powerful statement. Imagine, one single step with the potential to change the competitive posture of almost any enterprise on the planet. Unfortunately, there’s another statement that’s also true – and as discouraging as the first statement is inspiring:

More than 50% of IT software projects still fail because they run out of time, resources, funds, etc1.

Apache Spark is the well-known open source memory-optimized data processing engine that provides essentially one hundred times better performance than previous Big Data analytics solutions.2

Earlier articles in this series on Apache Spark have explored how Spark and related technologies accelerate data analytics, enable essentially real-time results, and help organizations big and small derive greater value and deeper insights from their burgeoning information assets.

[Discover the revolutionary origins of Apache Spark]

But once you are convinced that Spark is right for your enterprise, the question becomes – what steps should you take to most effectively implement this powerful tool and minimize the very real chances of deployment project failures?

The truth is, the adoption rates for data analytics solutions are so high across the commercial, scientific, and governmental landscapes that there’s a very good chance you are already embarked on your Spark journey. On it, there’s a lot to think about – mostly about your analytics applications and how you will leverage their power to help you gain competitive advantage. But you also need to figure out how to actually get Spark into production at an effective scale – because as its advantages become clear, more and more groups within your enterprise will want the power of the results and the value of using Spark in business-critical, operational deployments.3

Lessons learned from successful Spark deployments

The good news is that plenty of companies are already successfully deploying Spark at large scale. Their experiences often reveal several common themes within successful Spark implementations:

  • Effective Spark deployments include open-ended agility. The data analytics environment is moving and changing quickly, especially now that artificial intelligence (AI) is disrupting the domain in new ways. Successful Spark deployments are agile enough to keep up with new versions, tools, technologies, and processes as they emerge.
  • Enterprise-grade reliability and security is key. As soon as your Spark-based application grows, becomes mainstream, and people start depending on it, IT and other folks who are in the deployment business will start to care a lot about its reliability and security.
  • The underlying infrastructure should truly enable real-time performance. Most enterprises deploy Spark to accelerate the velocity of their data analytics. The underlying infrastructure must provide the performance and functionality needed to keep up.
  • Finally, and perhaps most importantly for successful Spark implementations and on-going growth and agility, support for the entire Spark software stack is crucial. You can go get the componentry and try to stitch it all together, but a lower risk and less costly alternative is to choose a complete Spark solution that comes with everything you need to make life easy.

Let’s explore this final point more thoroughly. Spark — and the entire Big Data open-source universe for that matter — is continually evolving. Because you’ll be basing new business-crucial applications on these technologies, it’s vital that you select a commercial Spark solution from a vendor with a strong commitment to the open-source community. This can smooth out the rough edges that are inherent in open-source-based initiatives, particularly in projects such as Spark that are advancing so rapidly. Your chosen Spark environment must be capable of running multiple versions as well as smoothly interacting with the other components that are already operating in your technology portfolio. It must also be open and flexible enough to incorporate new and evolving technologies like machine learning and deep learning frameworks.

[Learn how AI can give Big Data The Spark for Smarter Business]

Spark makes building Big Data solutions simpler, particularly compared with the original Hadoop and MapReduce approach. Nevertheless, it still can be daunting for organization to launch their first Spark efforts. Outsiders with proven success can provide guidance at all phases of a project and make it easier to get things off the ground. Seek out consultants that help your organization get up to speed but also enthusiastically transfer knowledge to you. Your goal should be to take over maintenance after you go live and then develop new solutions internally.

Plenty of Big Data initiatives are marked by extensive upfront hardware expenditures to store and process all the information that will drive the new solutions. Because money still doesn’t grow on trees, keeping costs under control is important. To increase the likelihood of a successful Spark deployment, you should try to select infrastructure elements and technologies that use storage-side virtualization, dynamic resource allocation, and other related functionality to ensure that your Spark environment fully utilizes all your current processing capabilities. This can translate to big savings on hardware and other ancillary costs.

Finally, it’s very likely that the structured and unstructured data sets needed to power your new Spark applications don’t all reside in the same location. Given this reality, make sure that your Spark implementation will let you interact with multiple types of information repositories. To derive maximum value from your entire portfolio, you should select Spark technology that can interact with the full range of storage and file systems.

Getting help

IBM is one technology vendor that offers powerful solutions to all of the challenges noted above. Since the beginning, IBM has been a strong supporter of the open source community and of the Apache Spark initiative in particular. And IBM offers a broad spectrum of software and hardware systems, expertise proven in thousands of engagements, and a worldwide support organization that can help ensure the success of any Spark deployment.

Two IBM solutions in particular are worth mentioning in any discussion related to Spark – IBM Spectrum Conductor and IBM Spectrum Scale. IBM Spectrum Conductor is an enterprise-class, multi-tenant platform for deploying and managing Apache Spark, Anaconda, Cassandra, MongoDB, and other application frameworks and services on a common shared cluster of resources. It provides the ability to support multiple concurrent and different versions of these applications while dynamically allocating and sharing resources between tenants and applications. IBM Spectrum Conductor provides field proven enterprise security while enabling performance at scale and maximizes resource usage and sharing to consolidate resource silos that would otherwise be tied to separate application environments. By maximizing infrastructure performance and operational efficiency, Spectrum Conductor delivers faster analytics results and drives down TCO.

IBM Spectrum Scale is an easily-deployed solution that can help you utilize and optimize all your existing infrastructure and resources for Spark-powered Big Data analytics, to reduce complexity, increase agility, and save plenty of time and money.

IBM Spectrum Scale is a true software-defined storage (SDS) infrastructure solution designed to provide high-performance and highly functional data management for all the types of data that your business activities may generate, including structured data, unstructured data, and objects. It’s a full-featured set of data management tools with advanced storage virtualization, massively parallel data access, automated tiered storage management, and leading-edge security features. It is designed to support a wide range of Spark application workloads across many different repositories, protocols, and geographical locations and has been proven extremely effective in large, demanding environments.

As a true SDI solution, IBM Spectrum Conductor can be installed on any appropriate commercial servers or deployed with IBM Spectrum Scale in a pre-assembled, comprehensive, high performance, highly scalable shared storage solution called IBM Elastic Storage Server (ESS). The ESS building block architecture enables you to start small and grow your storage capacity and performance as needed. ESS is becoming more and more popular to support Hadoop and Spark workloads and has been deployed in some of the largest AI environments, including the Summit supercomputer at Oak Ridge National Laboratory and Sierra at Lawrence Livermore, two of the most powerful supercomputers in the world.

Apache Spark has the power to transform gushers of raw data into the refined currencies of faster business insight and greater competitive advantage – but only if it’s actually working effectively for your organization. Many IT projects fail for lack of proper commitment, expertise, or budget, but yours doesn’t have to – if you leverage the technologies, expertise, and worldwide resources of an IT vendor such as IBM.


References:

[1] Objectstyle: Why 50% of software projects fail, and how not to let that happen to you, February 2018 https://www.objectstyle.com/agile/software-projects-failure-statistics-and-reasons

2 Wiley Books: Spark for Dummies, 2019

3 MAPR: Taking Your Spark to Production Scale, July 2015 https://mapr.com/blog/taking-your-spark-production-scale/

Return to Solution Channel Homepage

IBM Resources

Follow @IBMSystems

IBM Systems on Facebook

Do NOT follow this link or you will be banned from the site!
Share This