Microsoft Azure continues to infuse its cloud platform with HPC- and AI-directed technologies. Today the cloud services purveyor announced a new virtual machine family aimed at “supercomputer-class AI,” backed by Nvidia A100 Ampere GPUs, AMD Eypc Rome CPUs, 1.6 Tbps HDR InfiniBand, and PCIe 4.0 connectivity. The NDv4 VM instances are scalable to more than 100 billion parameters and exaops of compute, according to Evan Burness, principal program manager for HPC & Big Compute at Azure.
“In our continuum of Azure innovation, we’re excited to announce the new ND A100 v4 VM series, our most powerful and massively scalable AI VM, available on-demand from eight, to thousands of interconnected Nvidia GPUs across hundreds of VMs,” said Ian Finder, senior program manager, accelerated HPC infrastructure at Azure.
Before building these instances into its Azure cloud service, Microsoft first designed and deployed an AI supercomputer for OpenAI out of similar elements: Nvidia GPUs and AMD Eypc Rome chips. With more than 285,000 CPU cores, 10,000 GPUs and 400 gigabits-per-second of network connectivity for each GPU server in the cluster, Microsoft claimed the system would place within the top five echelon of the Top500 list (although it did not appear on the June 2020 edition of the bellwether list).
The supercomputer allowed researchers to establish OpenAI‘s 175-billion-parameter GPT-3 model, which is able to support tasks it wasn’t explicitly trained for, including composing poetry and language translation, advancing artificial intelligence toward its foundational objective.
The new instances are part of Azure’s NDs-series VMs, designed for the needs of AI and deep learning workloads.
NDv4 VMs are a follow on to the NDv2-series virtual machines, built on top of the Nvidia HGX system, powered by eight Nvidia V100 GPUs with 32 GB of memory each, 40 non-hyperthreaded Intel Xeon Platinum 8168 processor cores, and 672 GiB of system memory. Azure NDv3 series, currently in preview, feature the Graphcore IPU, a novel architecture that enables high-throughput processing of neural networks even at small batch sizes.
The ND A100 v4 VM series brings Ampere A100 GPUs into the Azure cloud just four months after their debut launch at GTC (Nvidia’s GPU Technology Conference), illustrating the sped-up adoption cycle of AI- and HPC-class technologies flowing into the cloud. Google Cloud introduced its A2 family, based on A100 GPUs, less than two months after Ampere’s arrival. Cloud giant AWS has said it will offer A100 GPUs.
“The ND A100 v4 VM series is backed by an all-new Azure-engineered AMD Rome-powered platform with the latest hardware standards like PCIe Gen4 built into all major system components. PCIe Gen 4 and NVIDIA’s third-generation NVLink architecture for the fastest GPU-to-GPU interconnection within each VM keeps data moving through the system more than 2x faster than before,” Finder stated in a blog post.
He added most customers can expect “an immediate boost of 2x to 3x compute performance over the previous generation of systems based on Nvidia V100 GPUs with no engineering work,” while leveraging A100 features, such as multi-precision, sparsity acceleration and multi-instance GPU (MIG), customers provide up to a 20x boost.
“Azure’s A100 instances enable AI at incredible scale in the cloud,” said partner Nvidia. “To power AI workloads of all sizes, its new ND A100 v4 VM series can scale from a single partition of one A100 to an instance of thousands of A100s networked with Nvidia Mellanox interconnects.”
The accelerated compute leader added, “This [announcement] comes on the heels of top server makers unveiling plans for more than 50 A100-powered systems and Google Cloud’s announcement of A100 availability.”
Azure ND A100 v4 machines are available now in preview.
For more details, see https://azure.microsoft.com/en-us/blog/bringing-ai-supercomputing-to-customers/