Resource Management in the Age of Artificial Intelligence

By Jeff Karmiol, Offering Management, IBM Spectrum Computing

January 16, 2019

New challenges demand fresh approaches

Fueled by GPUs, big data, and rapid advances in software, the AI revolution is upon us. Enterprises are re-tooling systems and exploring AI for everything from customer service to fraud surveillance to enhanced decision making. For IT organizations, deploying, managing and sustaining these environments is a significant challenge. In this article, we look at AI through the prism of workload and resource management and explain how new challenges are driving fresh innovation.

AI resource management is an “all-of-the-above” challenge

Building and deploying AI applications is a multi-stage workflow, and each stage involves different applications and frameworks with unique workload and resource management challenges.

AI models are fueled by vast amounts of training data from sources that include SQL databases, NoSQL stores, and semi-structured data in object stores or data lakes. When it comes to extracting, cleansing, and manipulating data, Spark has emerged as the tool of choice. Spark is fast, allows easy access to almost any data source, and supports familiar programming and query languages including Spark SQL.

From a resource management perspective not only does Spark need to be orchestrated on a cluster alongside other frameworks, but a variety of Spark operations need to be managed and prioritized. Some operations may be multi-step flows or batch jobs, while others may be interactive queries made from data science notebooks or applications. ETL workflows may need to be triggered automatically on a periodic basis, refreshing training data and storing it in an intermediate datastore such as Cassandra or MongoDB. Workload management is the key to running these processes reliably and efficiently.

In multi-tenant environments, there are typically many analysts and applications running diverse queries that manifest themselves as Spark jobs. Workload managers need to consider urgency, business priorities, sharing policies, deadlines, user response times, pre-emption policies and balance all these considerations when allocating resources. As if this was not complicated enough, in production environments different applications may require different versions of Spark or Spark libraries, so the multitenant environment needs to support all these capabilities while managing multiple versions of Spark running simultaneously on the cluster.

Once training data is prepared, data scientists may run software frameworks such as TensorFlow or Caffe to train AI models, a compute-intensive and highly-iterative process. Scientists look for optimal model topologies and hyperparameter sets that will deliver the highest predictive quality (accurately identifying a face in a picture as an example). Just as with Spark workloads, in multi-tenant environments multiple learning models and frameworks may be running at the same time, competing for scarce and expensive resources such as GPUs. Model training jobs should ideally be “elastic” so that resources can be dialed up and down at run time, or shifted between tenants to accommodate deadlines, changes in business priorities, or allocate additional resources to models that are showing promise.

Some AI frameworks bear a strong resemblance to parallel MPI workloads familiar to HPC users. For example, users typically run TensorFlow with different numbers of parameter servers and workers requiring hosts with GPUs. Like MPI workloads, TensorFlow applications chat among themselves, usually over high-speed interconnects. The scheduler needs to be GPU-aware and consider details like the internal bus architectures of machines, GPU capabilities, and GPU interconnect technologies to place software components optimally.

Finally, running trained models (referred to as inference) presents other challenges. While some models may run on embedded devices such as phones or automotive electronics, other models are deployed in software and invoked using software APIs. Asking Alexa a question or refreshing your Uber app to get an updated arrival time for your ride are good examples. For inference, timely predictable model execution is critical, and workload and resource scheduling play an essential role in auto-scaling resources to ensure application service-levels are met.

Today’s resource managers fall short

Most resource managers were designed to solve specific types of problems but fall short when it comes to addressing the entire AI workflow. For example, HPC schedulers excel at managing a variety of jobs, and topology-aware scheduling on GPU clusters, but most were not designed or optimized to manage long-running software frameworks such as Spark, or containerized application environments.

Open-source YARN was designed to decouple resource management from scheduling and overcome limitations of MapReduce in early versions of Hadoop. While YARN brings multitenancy to Hadoop and can support long-running frameworks like HBase, Storm or Spark it lacks other capabilities such dynamically re-prioritizing jobs, GPU-aware scheduling and managing containerized applications. Also, running multiple versions of the same application framework simultaneously is challenging.

Kubernetes (K8s) is a popular open source container orchestration tool built by Google that has a native resource manager. While K8s is excellent for deploying and managing containerized applications (scalable web applications, or AI inference workloads as examples), it has no notion of things like queues, parallel jobs, or topology-aware scheduling. K8s can orchestrate frameworks, but only if they already live in containers and were specifically built to run on K8s. K8s provides only basic multitenancy and limited GPU scheduling features.

Apache Mesos is another open-source resource manager that supports multiple tenants and applications. Mesos can orchestrate big data frameworks such as Spark or Cassandra, but it lacks the granular management controls necessary to manage application SLAs and place GPU-enabled workloads optimally based on topology. It also lacks workflow capabilities and the dynamic resource sharing controls needed to share resources among multiple tenants and training jobs optimally.

Because of these limitations, enterprises often end up with a patchwork of siloed workload and resource management environments managing different aspects of the AI environment.

Challenges in AI resource management are driving fresh innovation

To meet these challenges, IBM is investing heavily in scheduling and resource management solutions tailored to the diverse needs of AI environments drawing from 25 years of HPC resource management experience.

IBM Spectrum Conductor provides a common resource management foundation supporting Spark and a wide variety of AI workloads on a shared, multi-tenant environments on-premises or in the cloud. An add-on to IBM Spectrum Conductor called IBM Spectrum Conductor Deep Learning Impact provides an end-to-end solution that helps data scientists be more productive training, tuning, and deploying models into production.

You can learn more about IBM Spectrum Conductor at https://www.ibm.com/it-infrastructure/spectrum-computing

 

 

 

Return to Solution Channel Homepage

IBM Resources

Follow @IBMSystems

IBM Systems on Facebook

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

Insights from Optimized Codes on Cineca’s Marconi

February 15, 2019

What can you do with 381,392 CPU cores? For Cineca, it means enabling computational scientists to expand a large part of the world’s body of knowledge from the nanoscale to the astronomic, from calculating quantum effe Read more…

By Ken Strandberg

What Will IBM’s AI Debater Learn from Its Loss?

February 14, 2019

The utility of IBM’s latest man-versus-machine gambit is debatable. At the very least its Project Debater got us thinking about the potential uses of artificial intelligence as a way of helping humans sift through al Read more…

By George Leopold

ClusterVision in Bankruptcy, Fate Uncertain

February 13, 2019

ClusterVision, European HPC specialists that have built and installed over 20 Top500-ranked systems in their nearly 17-year history, appear to be in the midst of bankruptcy proceedings. According to Dutch news site Drimb Read more…

By Tiffany Trader

HPE Extreme Performance Solutions

HPE Systems With Intel Omni-Path: Architected for Value and Accessible High-Performance Computing

Today’s high-performance computing (HPC) and artificial intelligence (AI) users value high performing clusters. And the higher the performance that their system can deliver, the better. Read more…

IBM Accelerated Insights

Medical Research Powered by Data

“We’re all the same, but we’re unique as well. In that uniqueness lies all of the answers….”

  • Mark Tykocinski, MD, Provost, Executive Vice President for Academic Affairs, Thomas Jefferson University

Getting the answers to what causes some people to develop diseases and not others is driving the groundbreaking medical research being conducted by the Computational Medicine Center at Thomas Jefferson University in Philadelphia. Read more…

South African Weather Service Doubles Compute and Triples Storage Capacity of Cray System

February 13, 2019

South Africa has made headlines in recent years for its commitment to HPC leadership in Africa – and now, Cray has announced another major South African HPC expansion. Cray has been awarded contracts with Eclipse Holdings Ltd. to upgrade the supercomputing system operated by the South African Weather Service (SAWS). Read more…

By Oliver Peckham

Insights from Optimized Codes on Cineca’s Marconi

February 15, 2019

What can you do with 381,392 CPU cores? For Cineca, it means enabling computational scientists to expand a large part of the world’s body of knowledge from th Read more…

By Ken Strandberg

ClusterVision in Bankruptcy, Fate Uncertain

February 13, 2019

ClusterVision, European HPC specialists that have built and installed over 20 Top500-ranked systems in their nearly 17-year history, appear to be in the midst o Read more…

By Tiffany Trader

UC Berkeley Paper Heralds Rise of Serverless Computing in the Cloud – Do You Agree?

February 13, 2019

Almost exactly ten years to the day from publishing of their widely-read, seminal paper on cloud computing, UC Berkeley researchers have issued another ambitious examination of cloud computing - Cloud Programming Simplified: A Berkeley View on Serverless Computing. The new work heralds the rise of ‘serverless computing’ as the next dominant phase of cloud computing. Read more…

By John Russell

Iowa ‘Grows Its Own’ to Fill the HPC Workforce Pipeline

February 13, 2019

The global workforce that supports advanced computing, scientific software and high-speed research networks is relatively small when you stop to consider the magnitude of the transformative discoveries it empowers. Technical conferences provide a forum where specialists convene to learn about the latest innovations and schedule face-time with colleagues from other institutions. Read more…

By Elizabeth Leake, STEM-Trek

Trump Signs Executive Order Launching U.S. AI Initiative

February 11, 2019

U.S. President Donald Trump issued an Executive Order (EO) today launching a U.S Artificial Intelligence Initiative. The new initiative - Maintaining American L Read more…

By John Russell

Celebrating Women in Science: Meet Four Women Leading the Way in HPC

February 11, 2019

One only needs to look around at virtually any CS/tech conference to realize that women are underrepresented, and that holds true of HPC. SC hosts over 13,000 H Read more…

By AJ Lauer

IBM Bets $2B Seeking 1000X AI Hardware Performance Boost

February 7, 2019

For now, AI systems are mostly machine learning-based and “narrow” – powerful as they are by today's standards, they're limited to performing a few, narro Read more…

By Doug Black

Assessing Government Shutdown’s Impact on HPC

February 6, 2019

After a 35-day federal government shutdown, the longest in U.S. history, government agencies are taking stock of the damage -- and girding for a potential secon Read more…

By Tiffany Trader

Quantum Computing Will Never Work

November 27, 2018

Amid the gush of money and enthusiastic predictions being thrown at quantum computing comes a proposed cold shower in the form of an essay by physicist Mikhail Read more…

By John Russell

Cray Unveils Shasta, Lands NERSC-9 Contract

October 30, 2018

Cray revealed today the details of its next-gen supercomputing architecture, Shasta, selected to be the next flagship system at NERSC. We've known of the code-name "Shasta" since the Argonne slice of the CORAL project was announced in 2015 and although the details of that plan have changed considerably, Cray didn't slow down its timeline for Shasta. Read more…

By Tiffany Trader

The Case Against ‘The Case Against Quantum Computing’

January 9, 2019

It’s not easy to be a physicist. Richard Feynman (basically the Jimi Hendrix of physicists) once said: “The first principle is that you must not fool yourse Read more…

By Ben Criger

AMD Sets Up for Epyc Epoch

November 16, 2018

It’s been a good two weeks, AMD’s Gary Silcott and Andy Parma told me on the last day of SC18 in Dallas at the restaurant where we met to discuss their show news and recent successes. Heck, it’s been a good year. Read more…

By Tiffany Trader

Intel Reportedly in $6B Bid for Mellanox

January 30, 2019

The latest rumors and reports around an acquisition of Mellanox focus on Intel, which has reportedly offered a $6 billion bid for the high performance interconn Read more…

By Doug Black

US Leads Supercomputing with #1, #2 Systems & Petascale Arm

November 12, 2018

The 31st Supercomputing Conference (SC) - commemorating 30 years since the first Supercomputing in 1988 - kicked off in Dallas yesterday, taking over the Kay Ba Read more…

By Tiffany Trader

Looking for Light Reading? NSF-backed ‘Comic Books’ Tackle Quantum Computing

January 28, 2019

Still baffled by quantum computing? How about turning to comic books (graphic novels for the well-read among you) for some clarity and a little humor on QC. The Read more…

By John Russell

Contract Signed for New Finnish Supercomputer

December 13, 2018

After the official contract signing yesterday, configuration details were made public for the new BullSequana system that the Finnish IT Center for Science (CSC Read more…

By Tiffany Trader

Leading Solution Providers

SC 18 Virtual Booth Video Tour

Advania @ SC18 AMD @ SC18
ASRock Rack @ SC18
DDN Storage @ SC18
HPE @ SC18
IBM @ SC18
Lenovo @ SC18 Mellanox Technologies @ SC18
NVIDIA @ SC18
One Stop Systems @ SC18
Oracle @ SC18 Panasas @ SC18
Supermicro @ SC18 SUSE @ SC18 TYAN @ SC18
Verne Global @ SC18

Deep500: ETH Researchers Introduce New Deep Learning Benchmark for HPC

February 5, 2019

ETH researchers have developed a new deep learning benchmarking environment – Deep500 – they say is “the first distributed and reproducible benchmarking s Read more…

By John Russell

ClusterVision in Bankruptcy, Fate Uncertain

February 13, 2019

ClusterVision, European HPC specialists that have built and installed over 20 Top500-ranked systems in their nearly 17-year history, appear to be in the midst o Read more…

By Tiffany Trader

IBM Quantum Update: Q System One Launch, New Collaborators, and QC Center Plans

January 10, 2019

IBM made three significant quantum computing announcements at CES this week. One was introduction of IBM Q System One; it’s really the integration of IBM’s Read more…

By John Russell

Nvidia’s Jensen Huang Delivers Vision for the New HPC

November 14, 2018

For nearly two hours on Monday at SC18, Jensen Huang, CEO of Nvidia, presented his expansive view of the future of HPC (and computing in general) as only he can do. Animated. Backstopped by a stream of data charts, product photos, and even a beautiful image of supernovae... Read more…

By John Russell

HPC Reflections and (Mostly Hopeful) Predictions

December 19, 2018

So much ‘spaghetti’ gets tossed on walls by the technology community (vendors and researchers) to see what sticks that it is often difficult to peer through Read more…

By John Russell

IBM Bets $2B Seeking 1000X AI Hardware Performance Boost

February 7, 2019

For now, AI systems are mostly machine learning-based and “narrow” – powerful as they are by today's standards, they're limited to performing a few, narro Read more…

By Doug Black

The Deep500 – Researchers Tackle an HPC Benchmark for Deep Learning

January 7, 2019

How do you know if an HPC system, particularly a larger-scale system, is well-suited for deep learning workloads? Today, that’s not an easy question to answer Read more…

By John Russell

Intel Confirms 48-Core Cascade Lake-AP for 2019

November 4, 2018

As part of the run-up to SC18, taking place in Dallas next week (Nov. 11-16), Intel is doling out info on its next-gen Cascade Lake family of Xeon processors, specifically the “Advanced Processor” version (Cascade Lake-AP), architected for high-performance computing, artificial intelligence and infrastructure-as-a-service workloads. Read more…

By Tiffany Trader

  • arrow
  • Click Here for More Headlines
  • arrow
Do NOT follow this link or you will be banned from the site!
Share This