Addressing Cluster Sprawl – a Key Challenge in High-Performance Analytics

It’s no secret that the use of analytics is soaring in the enterprise. Having the capacity to perform deep analysis of large datasets has become critical in all industries with Telecom, Insurance, Advertising, and Financial Services leading the charge. According to Forbes magazine, enterprise adoption of big data grew from 17% in 2015 to 59% in 2018, a Compound Annual Growth Rate (CAGR) of 36%⁽¹⁾. Managing and analyzing big data is an ongoing challenge – especially as organizations embrace new tools and race to take advantage of AI and data-hungry machine learning models to make better business decisions and enable new, differentiated services.

Open source is where the action is

While time-tested tools such as SPSS, MATLAB, and SAS remain essential, organizations are rapidly supplementing these applications with open-source software to meet new requirements. Examples include Python, R, DASK, Jupyter, TensorFlow, and many others. In fact, in two separate KDnuggets surveys in 2018/2019, data scientists reported using seven different software tools(²⁾ and more than seven different analytic techniques on average to do their jobs⁽³⁾. Owing to their ease of use, and the vast number of available statistical and scientific packages, Python and R have overtaken SQL becoming the most popular languages for data science.

Parallelism and the challenge it presents for IT

In the age of big data and AI, parallelism is the only game in town when it comes to analyzing large datasets quickly. Customers need to deliver faster, better results, and demonstrate a clear ROI to the business. Not surprisingly, most analytic tools support parallel frameworks so that processing can be distributed across multiple nodes – examples include IPython, Dask, Spark, and Horovod, an OpenMPI-based framework used to orchestrate deep learning frameworks.

The challenge for IT departments is that most of these tools have been developed independently and have an affinity to different workload managers. As examples:

Tools such as MapReduce and Spark can be deployed across multiple environments but are most often deployed on Hadoop clusters running YARN
Dask (for distributed Python) supports multiple schedulers including SGE, Condor, IBM Spectrum LSF, and Mesos but integration is left to the user.
Parallel R frameworks can be made to work across most environments, but most parallel integrations have an affinity for batch-oriented schedulers such as IBM Spectrum LSF
Commercial tools such as SAS, MATLAB, and IBM DataStage provide parallel frameworks but typically support a limited set of workload managers.

This patchwork of compatibility means that IT organizations are often forced to deploy different analytic tools on separate clusters leading to “cluster sprawl”. It’s not uncommon to see separate physical clusters dedicated to a single application, like Hadoop, SAS Grid Manager, TensorFlow, and other applications, making it challenging for applications to interoperate and share data. Interestingly, cloud computing can actually aggravate this problem because cloud-based PaaS offerings tend to be deployed on siloed clusters comprised of dedicated machine instances.

A similar challenge exists with workflow tools. For example, Hadoop environments may rely on Apache Oozie or open-source tools such as Azkaban or Luigi. IPython users may use NetworkX to create directed acyclic graphs (DAGs) for Python-based flows. SAS Grid Manager and DataStage users will use workflow managers native to their tools.

Consolidating silos to drive better business results

To tackle this challenge of cluster sprawl for analytic applications, in May of 2019, IBM announced the new IBM Spectrum Computing Suite for High-Performance Analytics (Suite for HPA)

The IBM Spectrum Computing Suite for HPA is designed to support a broad range of analytic applications and frameworks on a single, shared on-premises or cloud-resident cluster while helping deliver faster, deeper analytics. Based on IBM’s proven IBM Spectrum LSF technology, the IBM Spectrum Computing Suite for HPA can help organizations:

Deliver faster, deeper analysis using modern analytic frameworks by enabling analysis of larger datasets with advanced GPU and container support and the capability to spread calculations across many nodes
Reduce cost and boost ROI, by maximizing hardware utilization and throughput through superior workload management
Reduce complexity and administration costs by avoiding the need to operate multiple clusters.

The IBM Spectrum Computing Suite for HPA allows IT organizations to side-step the challenge of integrating commercial and open-source distributed frameworks. It supports pre-existing integrations with open-source frameworks such as Python/IPython, R/RStudio, Dask, TensorFlow, Caffe, Spark, Hadoop, Horovod, and Jupyter. It also supports commercial grid-friendly applications such as IBM InfoSphere DataStage, Mathworks MATLAB⁽⁴⁾, and SAS⁽⁵⁾.

Depending on the mix of applications that customers run, the IBM Spectrum Computing Suite for HPA can help improve flexibility and productivity and reduce infrastructure costs by avoiding cluster sprawl. It supports modern container-based applications, provides sophisticated GPU-aware scheduling, and transparent application-aware cloud bursting to your choice of public clouds.

References

Forbes Magazine – Big Data Anylytics Adoption Soared in the Enterprise in 2018 – https://www.forbes.com/sites/louiscolumbus/2018/12/23/big-data-analytics-adoption-soared-in-the-enterprise-in-2018/#5479b015332f
2018 KDnuggets software poll – https://www.kdnuggets.com/2018/05/poll-tools-analytics-data-science-machine-learning-results.html
2018/2019 poll of top data science, machine learning methods – https://www.kdnuggets.com/2019/04/top-data-science-machine-learning-methods-2018-2019.html
Requires that MathWorks MATLAB Parallel Server be licensed from MathWorks for any cluster nodes that will run parallel MATLAB applications.
Requires that SAS Grid Manager for Platform be licensed for any cluster/grid nodes that will run SAS applications.

Open source is where the action is

Parallelism and the challenge it presents for IT

Consolidating silos to drive better business results

The Information Nexus of Advanced Computing and Data systems for a High Performance World

Share

Copy short link