AI-optimized Azure VM series featuring AMD’s flagship MI300X GPU

By Marc Charest, Microsoft Azure Specialized Compute, Sr Program Manager

October 24, 2024

Artificial intelligence is transforming every industry and creating new opportunities for innovation and growth. On top of this, AI models are continually advancing and becoming more complex and accurate. More powerful computers with purpose-built AI accelerators that have resources like high bandwidth memory (HBM), specialized data formats, and exceptional compute performance are needed to fuel these technological advances.

To meet this need, Azure is proud to be the first cloud service to offer general availability of the new Azure ND MI300X v5 virtual machine (VM) series based on AMD’s latest Instinct GPU, MI300X. This new VM series is the first cloud offering of its kind and is designed to give the highest bandwidth memory (HBM) capacity of any available VM with industry-leading speeds, letting customers serve larger models faster, and with fewer GPUs

Unmatched infrastructure optimized at every layer

The new Azure ND MI300X virtual machine series is a product of a long collaboration with AMD to build powerful cloud systems for AI with open-source software. This collaboration includes optimizations across the entire hardware and software stack. For example, these new VMs are powered by 8x AMD MI300X GPUs, each VM with 1.5 TB of high bandwidth memory (HBM) and 5.3 TB/s of HBM bandwidth. HBM is essential for AI applications due to its high bandwidth, low power consumption, and compact size. It is ideal for AI applications that need to quickly process vast amounts of data. The result is a VM with industry-leading performance, HBM capacity, and HBM bandwidth, enabling you to fit larger models in GPU memory and/or use less GPUs.  In the end, you save power, cost, and time-to-solution.

On the software side, Azure ND MI300X uses the AMD ROCm open-source software platform, which provides a comprehensive set of tools and libraries for AI development and deployment. The ROCm platform supports popular frameworks such as TensorFlow and PyTorch, as well as Microsoft libraries for AI acceleration like ONNX RuntimeDeepSpeed, and MSCCL. The ROCm platform also enables seamless porting of models and solutions from one platform to another, lowering your engineering costs and speeding up time to market for your AI solutions.

For customers looking to scale out efficiently to thousands of GPUs, it’s as simple as using Azure ND MI300X v5 VMs with a standard Azure Virtual Machine Scale Set (VMSS). Azure ND MI300X v5 VMs feature high-throughput, low latency InfiniBand communication between different VMs. Each GPU has its own dedicated 400 Gb/s NVIDIA Quantum-2 CX7 InfiniBand link to give 3.2 Tb/s of bandwidth per VM. InfiniBand is the standard for AI workloads needing to scale out to large numbers of VMs/GPUs.

Scalable AI infrastructure running capable OpenAI models

These Azure ND MI300X VMs and the software that powers them, were purpose-built for our own Azure AI services production workloads. We have already optimized the most capable natural language model in the world, GPT-4 Turbo, for these VMs. If you want to generate text, answer questions, summarize documents, or create new applications, you can leverage the power and scalability of the Azure AI infrastructure to run these models at lightning speed, huge scale, and, optimized efficiency.

ND MI300X v5 VMs offer leading cost performance for popular OpenAI and open-source models.

Leading with innovation to advance the ecosystem

We are also working closely with our partners and customers so they can take full advantage of these new VMs and accelerate their AI projects and applications. One of our partners, Hugging Face, is a popular provider of natural language processing open-source models. Hugging Face easily ported their models to Azure ND MI300X VMs without any code changes and achieved 2x to 3x performance gains over AMD’s MI250 using these VMs. Now you can use these open-source models and Hugging Face libraries on Azure ND MI300X VMs to create and deploy your own NLP applications with ease and efficiency.

Get started today

We’re excited to see what our customers will do with the new VMs. Whether you want to bring your own models, use our models through the Azure OpenAI Service, or use open models from Azure AI catalog or from Hugging Face, you can get the best performance at the best price on the new Azure AI infrastructure VMs. You can also scale up or down your VMs as needed, thanks to the flexibility and elasticity of the Azure cloud.

Learn more about Azure ND MI300X and get started today.

Learn more about Azure and AMD

Azure AI Infrastructure
Azure High Performance Computing
Achieve more with Microsoft Azure and AMD

Return to Solution Channel Homepage
Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

High-Performance Storage for AI and Analytics Panel

October 31, 2024

When storage is mentioned in an AI or Big Data analytics context, it is assumed to be a high-performance system. In practice, it may not be, and the user eventually learns about scaleable storage as the amounts of data g Read more…

White House Mulls Expanding AI Chip Export Bans Beyond China

October 31, 2024

The Biden administration is reportedly considering capping sales of advanced artificial intelligence (AI) chips from US-based manufacturers like AMD and Nvidia to certain countries, including those in the Middle East. � Read more…

Lottery to Determine Major AI Conference Attendees Amid Registration Boom

October 31, 2024

A boom in AI has created a problem for the organizers of the NeurIPS conference, which is considered an essential machine-learning research conference. The sheer number of registrations has overwhelmed organizers, who Read more…

Role Reversal: Google Teases Nvidia’s Blackwell as It Softens TPU Rivalry

October 30, 2024

Customers now have access to Google's homegrown hardware -- its Axion CPU and latest Trillium TPU -- in its Cloud service.  At the same time, Google gave customers a teaser on Nvidia's Blackwell coming to Google Cloud, Read more…

AI Has a Data Problem, Appen Report Says

October 30, 2024

AI may be a priority at American companies, but the difficulty in managing data and obtaining high quality data to train AI models is becoming a bigger hurdle to achieving AI aspirations, according to Appen’s State of Read more…

Microsoft Azure & AMD Solution Channel

Join Microsoft Azure and AMD at SC24

Atlanta, Georgia is the place to be this fall as the high-performance computing (HPC) community convenes for Supercomputing 2024. SC24 will bring together an unparalleled mix of scientists, engineers, researchers, educators, programmers, and developers for a week of learning and sharing. Read more…

Report from HALO Details Issues Facing HPC-AI Industry

October 28, 2024

Intersect360 Research has released a comprehensive new report concerning the challenges facing the combined fields of high-performance computing (HPC) and artificial intelligence (AI). Titled “Issues Facing the HPC-AI Read more…

High-Performance Storage for AI and Analytics Panel

October 31, 2024

When storage is mentioned in an AI or Big Data analytics context, it is assumed to be a high-performance system. In practice, it may not be, and the user eventu Read more…

Shutterstock_556401859

Role Reversal: Google Teases Nvidia’s Blackwell as It Softens TPU Rivalry

October 30, 2024

Customers now have access to Google's homegrown hardware -- its Axion CPU and latest Trillium TPU -- in its Cloud service.  At the same time, Google gave custo Read more…

AI Has a Data Problem, Appen Report Says

October 30, 2024

AI may be a priority at American companies, but the difficulty in managing data and obtaining high quality data to train AI models is becoming a bigger hurdle t Read more…

Report from HALO Details Issues Facing HPC-AI Industry

October 28, 2024

Intersect360 Research has released a comprehensive new report concerning the challenges facing the combined fields of high-performance computing (HPC) and artif Read more…

Archetype AI’s Newton Model Masters Physics From Raw Data

October 28, 2024

Physicists have developed a deep understanding of the fundamental laws of nature through careful observations, experiments, and precise measurements. However, w Read more…

PNNL-Microsoft Collaborate on Cloud Computing for Chemistry, More to Come

October 25, 2024

RICHLAND, Wash.—Some computing challenges are so big that it’s necessary to go all in. That’s the approach a diverse team of scientists and computing expe Read more…

Xeon 6 vs. Zen-5 HPC Benchmark Showdown

October 24, 2024

In this GPU age, CPUs are often considered second citizens because most of the performance comes from the GPU. In most systems, GPUs are separate PCIe devices u Read more…

Nvidia’s Newest Foundation Model Can Actually Spell ‘Strawberry’

October 23, 2024

A new AI model from Nvidia knows just how many R’s are in the word strawberry, a feat that OpenAI’s GPT-4o model has yet to achieve. In what is known as the Read more…

Leading Solution Providers

Contributors

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire