The M5 program within Amazon Search owns the discovery learning strategy for Amazon and builds large-scale models across modalities: multilingual, multi-entity, and multitask. To build and train models with billions of parameters at scale, M5 uses accelerated compute such as Amazon Elastic Compute Cloud (Amazon EC2) instances with GPUs and AWS Trainium. One of our central tenets is to keep the infrastructure and operational costs under control.
In this post, we focus on how we evolved our systems to manage accelerated compute resources efficiently, and schedule the distributed deep learning workloads by leveraging AWS Batch fair-share scheduling. By continuously improving the approach to managing compute resources and scheduling, we have: 1) reduced idle resources by 14%; 2) increased GPU utilization of our fleet by 19%; and 3) eliminated downtime during reallocation of compute.