In science fiction and future studies, the word “singularity” is invoked in reference to a rapidly snowballing artificial intelligence that, repeatedly iterating on itself, eclipses all human knowledge and ability. It is this word that Microsoft—perhaps ambitiously—has invoked for its new AI project, a “globally distributed scheduling service for highly efficient and reliable execution of deep learning training and inference workloads.”
Microsoft’s Singularity is a response to the computational costs of training deep learning workloads—costs that have quickly spiraled as those workloads have grown in size, complexity and number. It is also an attempt to maximize the use of idle time, which has increasingly become a focus of discussions of how to minimize the costs and environmental footprints of high-performance computing systems and AI model training on such systems.
“Singularity is built with one key goal,” explains the preprint paper, which was written by a team of more than two dozen Microsoft researchers and published on arXiv, “driving down the cost of AI by maximizing the aggregate useful throughput on a given fixed pool of capacity of accelerators on a planet scale, while providing stringent [service-level agreements] for multiple pricing tiers.”
“At the heart of Singularity is a novel, workload-aware scheduler that can transparently preempt and elastically scale deep learning workloads to drive high utilization without impacting their correctness or performance across a global fleet of AI accelerators (e.g., GPUs, FPGAs).”
The researchers say that Singularity treats this entire fleet of accelerators as a “single local, shared cluster, and avoids any resource fragmentation or static reservation of capacity.” Singularity manages this by elastically scaling the jobs as resources scale up and down and, where necessary, checkpointing, preempting and migrating jobs across nodes, clusters or regions. This scheduler, they say, transcends cluster, region and workload boundaries, while ensuring resilience to failure by resuming jobs from where they were preempted.
The paper spends most of its time focusing on the scheduler—”in this paper, we focus only on the above core mechanisms of the Singularity scheduler,” it reads, despite Singularity being a “significantly broad and complex distributed system[.]” For more details on how the scheduler works, read the paper—titled “Singularity: Planet-Scale, Preemptive and Elastic Scheduling of AI Workloads”—here.
But the researchers did give a glimpse into a hardware implementation of Singularity, which they say is capable of scaling to a global fleet of hundreds of thousands of accelerators. The paper describes an evaluation using Nvidia DGX-2 servers, each comprising two Intel Xeon Platinum 8168 CPUs, 1384GB of RAM and 16 V100 GPUs. Microsoft didn’t specify how many of these DGX-2 servers were used in the evaluation.
“Singularity achieves all of this with a remarkably simple user experience,” the paper adds. “The user focuses only on the ML task and does not need to think about checkpointing or elasticity; these mechanisms are infrastructure optimizations that are completely transparent to the user.”
To learn more about this research, read the paper here.