New challenges demand fresh approaches
Fueled by GPUs, big data, and rapid advances in software, the AI revolution is upon us. Enterprises are re-tooling systems and exploring AI for everything from customer service to fraud surveillance to enhanced decision making. For IT organizations, deploying, managing and sustaining these environments is a significant challenge. In this article, we look at AI through the prism of workload and resource management and explain how new challenges are driving fresh innovation.
AI resource management is an “all-of-the-above” challenge
Building and deploying AI applications is a multi-stage workflow, and each stage involves different applications and frameworks with unique workload and resource management challenges.
AI models are fueled by vast amounts of training data from sources that include SQL databases, NoSQL stores, and semi-structured data in object stores or data lakes. When it comes to extracting, cleansing, and manipulating data, Spark has emerged as the tool of choice. Spark is fast, allows easy access to almost any data source, and supports familiar programming and query languages including Spark SQL.
From a resource management perspective not only does Spark need to be orchestrated on a cluster alongside other frameworks, but a variety of Spark operations need to be managed and prioritized. Some operations may be multi-step flows or batch jobs, while others may be interactive queries made from data science notebooks or applications. ETL workflows may need to be triggered automatically on a periodic basis, refreshing training data and storing it in an intermediate datastore such as Cassandra or MongoDB. Workload management is the key to running these processes reliably and efficiently.
In multi-tenant environments, there are typically many analysts and applications running diverse queries that manifest themselves as Spark jobs. Workload managers need to consider urgency, business priorities, sharing policies, deadlines, user response times, pre-emption policies and balance all these considerations when allocating resources. As if this was not complicated enough, in production environments different applications may require different versions of Spark or Spark libraries, so the multitenant environment needs to support all these capabilities while managing multiple versions of Spark running simultaneously on the cluster.
Once training data is prepared, data scientists may run software frameworks such as TensorFlow or Caffe to train AI models, a compute-intensive and highly-iterative process. Scientists look for optimal model topologies and hyperparameter sets that will deliver the highest predictive quality (accurately identifying a face in a picture as an example). Just as with Spark workloads, in multi-tenant environments multiple learning models and frameworks may be running at the same time, competing for scarce and expensive resources such as GPUs. Model training jobs should ideally be “elastic” so that resources can be dialed up and down at run time, or shifted between tenants to accommodate deadlines, changes in business priorities, or allocate additional resources to models that are showing promise.
Some AI frameworks bear a strong resemblance to parallel MPI workloads familiar to HPC users. For example, users typically run TensorFlow with different numbers of parameter servers and workers requiring hosts with GPUs. Like MPI workloads, TensorFlow applications chat among themselves, usually over high-speed interconnects. The scheduler needs to be GPU-aware and consider details like the internal bus architectures of machines, GPU capabilities, and GPU interconnect technologies to place software components optimally.
Finally, running trained models (referred to as inference) presents other challenges. While some models may run on embedded devices such as phones or automotive electronics, other models are deployed in software and invoked using software APIs. Asking Alexa a question or refreshing your Uber app to get an updated arrival time for your ride are good examples. For inference, timely predictable model execution is critical, and workload and resource scheduling play an essential role in auto-scaling resources to ensure application service-levels are met.
Today’s resource managers fall short
Most resource managers were designed to solve specific types of problems but fall short when it comes to addressing the entire AI workflow. For example, HPC schedulers excel at managing a variety of jobs, and topology-aware scheduling on GPU clusters, but most were not designed or optimized to manage long-running software frameworks such as Spark, or containerized application environments.
Open-source YARN was designed to decouple resource management from scheduling and overcome limitations of MapReduce in early versions of Hadoop. While YARN brings multitenancy to Hadoop and can support long-running frameworks like HBase, Storm or Spark it lacks other capabilities such dynamically re-prioritizing jobs, GPU-aware scheduling and managing containerized applications. Also, running multiple versions of the same application framework simultaneously is challenging.
Kubernetes (K8s) is a popular open source container orchestration tool built by Google that has a native resource manager. While K8s is excellent for deploying and managing containerized applications (scalable web applications, or AI inference workloads as examples), it has no notion of things like queues, parallel jobs, or topology-aware scheduling. K8s can orchestrate frameworks, but only if they already live in containers and were specifically built to run on K8s. K8s provides only basic multitenancy and limited GPU scheduling features.
Apache Mesos is another open-source resource manager that supports multiple tenants and applications. Mesos can orchestrate big data frameworks such as Spark or Cassandra, but it lacks the granular management controls necessary to manage application SLAs and place GPU-enabled workloads optimally based on topology. It also lacks workflow capabilities and the dynamic resource sharing controls needed to share resources among multiple tenants and training jobs optimally.
Because of these limitations, enterprises often end up with a patchwork of siloed workload and resource management environments managing different aspects of the AI environment.
Challenges in AI resource management are driving fresh innovation
To meet these challenges, IBM is investing heavily in scheduling and resource management solutions tailored to the diverse needs of AI environments drawing from 25 years of HPC resource management experience.
IBM Spectrum Conductor provides a common resource management foundation supporting Spark and a wide variety of AI workloads on a shared, multi-tenant environments on-premises or in the cloud. An add-on to IBM Spectrum Conductor called IBM Spectrum Conductor Deep Learning Impact provides an end-to-end solution that helps data scientists be more productive training, tuning, and deploying models into production.
You can learn more about IBM Spectrum Conductor at https://www.ibm.com/it-infrastructure/spectrum-computing