AI and deep learning are quickly finding their way into applications across all industries. Applications range from AI-powered algorithms that can accurately predict earthquakes, market movements, and fraud to self-driving vehicles and automated assessments of insurance claims. One of the practical barriers impeding enterprise adoption of AI is the amount of computing power required for optimization and training, and the difficulty of implementing distributed deep learning frameworks. In this article, we’ll discuss elastic training and how new distributed training approaches can result in higher-quality AI applications that are faster, less expensive and easier to train, deploy and maintain.
The challenge of building and training Deep Neural Networks
It’s no secret that training a deep neural network, a type of machine learning algorithm, requires a lot of computing power. Deep learning is an iterative process that involves analyzing batches of training data, and continually updating model parameters until an acceptable level of predictive accuracy is achieved. As an example, ResNet-50 is a well-known image classification model with a 50-layer neural network. Training the model requires choosing optimal values for approximately 25 million parameters and the popular ImageNet dataset used for training has over 14 million images. Training a single model can take days on a host with a single GPU.
Training a neural network for computer vision is roughly analogous to how humans learn to classify objects. Imagine presenting a toddler with millions of cue cards (a training set) – “This is a collie which is a type of dog,” “this is a balloon,” and “this is a strawberry which is a type of fruit.” As synapses form, connecting millions of neurons, we get better at recognizing objects in the real world. The human toddler has a big advantage, however. Millennia of evolution have equipped our brains with a neural network already remarkably adapted to vision. For the computer, suitable neural networks need to be constructed – a massive and laborious process of determining an optimal set of layers and model-specific parameters from countless possible permutations, and even then, today’s AI is still in its infancy when compared to the human brain.
Compounding the problem, neural network models need to be continually trained and evaluated with new data to maintain their effectiveness. For example, a recommendation engine for an online store is only as effective as the model’s knowledge of all the SKUs in the catalog, what’s in fashion, and the latest trends affecting consumer behaviors.
Distributed training limitations
To discover, train and continuously update neural networks within a reasonable timeframe, most deep learning frameworks rely on parallelism, distributing the training problem across many computers and GPUs. Distributed Tensorflow, Caffe2, and Pytorch all have native distributing training capabilities.
While distributed training helps train neural networks faster, they have several limitations:
- The process of parallelizing execution can be manual and tedious. For example, with TensorFlow developers need to hardcode details like IP addresses and port numbers to specify the server locations of software components bound to specific CPUs and GPUs.
- Workload managers help, but each distributed training framework is unique and requires a separate often complex integration with the workload manager.
- The server placement of each part of the parallel training job is critical to performance and most workload managers lack the sophistication to place training jobs optimally, especially when considering factors such as NVLink and GPU topologies.
- Finally, once a distributed training job is started, its “geometry” (numbers and locations of GPUs) remains fixed for the duration of the job. If there are any changes, say due to a failure or lost connection, the whole training job fails and must be re-run.
With traditional distributed training, the software components that train neural networks are placed on a cluster in a static configuration and communicate among themselves.
Unfortunately, this approach is poorly suited to enterprise environments. Model training is part of a deep learning workflow that involves many steps including data ingest, cleansing and pre-processing, creating training and validation datasets, training models, hyper-parameter search, etc.
Also, not all jobs have the same level of urgency. Data Scientists may want to run lightweight models interactively through shared Jupyter notebooks, even as long-running ETL or validation jobs chug along in the background. Rather than having some users wait for hours until a long-running training job completes, all workflow steps should ideally be dynamic, loaning and borrowing resources to accommodate priority jobs and maximize overall effectiveness.
Dynamic, elastic training boosts performance and productivity
This is where dynamic, elastic training comes in. State-of-the-art deep learning environments no longer require that distributed training jobs remain static at runtime. In an elastic training environment, multiple users, departments and frameworks can run deep learning and other analytic workloads simultaneously sharing CPU and GPU resources based on sophisticated sharing policies.
Rather than the number of GPUs being fixed at run-time, training jobs can flex-up and down based on sharing policies, job priorities, and flexible rules. Resources can be shifted dynamically among diverse application tenants, users and groups while respecting ownership and SLAs. Elastic training is a game changer for data scientists because it allows them to maximize training performance, easily accommodate ad-hoc jobs, and shift additional resources at runtime to models that are showing the greatest promise.
While solutions for elastic training are beginning to appear in the cloud, these solutions typically support only a single deep learning framework, and they don’t allow sharing among Spark jobs, big data frameworks, inference and the variety of other applications typically used in production.
IBM Watson Machine Learning Accelerator for elastic AI
At the IBM Think Conference 2019, IBM unveiled the most sophisticated solution for elastic training to date. IBM Watson Machine Learning Accelerator provides end-to-end workflow support for deep learning applications. It includes complete lifecycle management for data ingest, data preparation, and building, optimizing, training and testing models.
In addition to supporting multi-tenant resource sharing for Spark, Keras, Pytorch and other deep learning frameworks and libraries, the IBM Watson Machine Learning Accelerator provides elastic distributed training for both TensorFlow and Caffe. With only minor changes to existing models, distributed training can be made fully elastic, adding or removing hosts or GPUs at runtime to accelerate model training, boost analyst productivity, and maximize resource usage. The solution also provides built-in support for automated hyperparameter optimization.
In addition to making training elastic, the solution also provides an elastic inference engine, allowing you to put your models into production on the same multi-tenant cluster. Organizations can now consolidate model training and deployment environments for greater efficiencies while ensuring service levels are met.