AWS Releases Amazon EC2 Trn1 Instances Powered by AWS-Designed Trainium Chips

October 11, 2022

SEATTLE, Oct. 11, 2022 — Amazon Web Services, Inc. (AWS) today announced the general availability of Amazon Elastic Compute Cloud (Amazon EC2) Trn1 instances powered by AWS-designed Trainium chips. Trn1 instances are purpose built for high-performance training of machine learning models in the cloud while offering up to 50% cost-to-train savings over comparable GPU-based instances.

Trn1 instances provide the fastest time to train popular machine learning models on AWS, enabling customers to reduce training times, rapidly iterate on models to improve accuracy, and increase productivity for workloads like natural language processing, speech and image recognition, semantic search, recommendation engines, fraud detection, and forecasting. There are no minimum commitments or upfront fees to use Trn1 instances, and customers pay only for the amount of compute used. To get started with Trn1 instances, click here.

More customers are building, training, and deploying machine learning models to power applications that have the potential to reinvent their businesses and customer experiences. These machine learning models are becoming increasingly complex and consume ever-growing amounts of training data to help improve accuracy. As a result, customers must scale their models across thousands of accelerators, which makes them more expensive to train. This directly impacts the ability of research and development teams to experiment and train different models, which limits how quickly customers are able to bring their innovations to market. AWS already provides the broadest and deepest choice of compute offerings featuring hardware accelerators for machine learning, including Inf1 instances with AWS-designed Inferentia chips, G5 instances, P4d instances, and DL1 instances. But even with the fastest accelerated instances available today, training more complex machine learning models can still be prohibitively expensive and time consuming.

New Trn1 instances powered by AWS Trainium chips offer the best price performance and the fastest machine learning model training on AWS, providing up to 50% lower cost to train deep learning models compared to the latest GPU-based P4d instances. AWS Neuron, the software development kit (SDK) for Trn1 instances, enables customers to get started with minimal code changes and is integrated into popular frameworks for machine learning like PyTorch and TensorFlow. Trn1 instances feature up to 16 AWS Trainium accelerators that are purpose built for deploying deep learning models. Trn1 instances are the first Amazon EC2 instance to offer up to 800 Gbps of networking bandwidth (lower latency and 2x faster than the latest EC2 GPU-based instances) using the second generation of AWS’s Elastic Fabric Adapter (EFA) network interface to improve scaling efficiency. Trn1 instances also use NeuronLink, a high-speed, intra-instance interconnect, for faster training. Customers deploy Trn1 instances in Amazon EC2 UltraClusters consisting of tens of thousands of Trainium accelerators to rapidly train even the most complex deep learning models with trillions of parameters. With EC2 UltraClusters, customers will be able to scale the training of machine learning models with up to 30,000 Trainium accelerators interconnected with EFA petabit-scale networking, which gives customers on-demand access to supercomputing-class performance to cut training time from months to days. Each Trn1 instance supports up to 8 TB of local NVMe SSD storage for fast access to large datasets. AWS Trainium supports a wide range of data types (FP32, TF32, BF16, FP16, and configurable FP8) and stochastic rounding, a way of rounding probabilistically that enables high performance and higher accuracy as compared to legacy rounding modes often used in deep learning training. AWS Trainium also supports dynamic tensor shapes and custom operators to deliver a flexible infrastructure designed to evolve with customers’ training needs.

“Over the years we have seen machine learning go from a niche technology used by the largest enterprises to a core part of many of our customers’ businesses, and we expect machine learning training will rapidly make up a large portion of their compute needs,” said David Brown, vice president of Amazon EC2 at AWS. “Building on the success of AWS Inferentia, our high-performance machine learning chip, AWS Trainium is our second-generation machine learning chip purpose built for high-performance training. Trn1 instances powered by AWS Trainium will help our customers reduce their training time from months to days, while being more cost efficient.”

Trn1 instances are built on the AWS Nitro System, a collection of AWS-designed hardware and software innovations that streamline the delivery of isolated multi-tenancy, private networking, and fast local storage. The AWS Nitro System offloads the CPU virtualization, storage, and networking functions to dedicated hardware and software, delivering performance that is nearly indistinguishable from bare metal. Trn1 instances will be available via additional AWS services including Amazon SageMaker, Amazon Elastic Kubernetes Service (Amazon EKS), Amazon Elastic Container Service (Amazon ECS), and AWS Batch. Trn1 instances are available for purchase as On-Demand Instances, with Savings Plans, as Reserved Instances, or as Spot Instances. Trn1 instances are available today in US East (N. Virginia) and US West (Oregon), with availability in additional AWS Regions coming soon. For more information on Trn1 instances, click here.

Amazon’s product search engine indexes billions of products, serves billions of customer queries daily, and is one of the most heavily used services in the world. “We are training large language models that are multi-modal, multilingual, multi-locale, pre-trained on multiple tasks, and span multiple entities (products, queries, brands, reviews, etc.) to improve the customer shopping experience,” said Trishul Chilimbi, senior principal scientist at Amazon Search. “Amazon EC2 Trn1 instances provide a more sustainable way to train large language models by delivering the best performance/watt compared to other accelerated machine learning solutions and offers us high performance at the lowest cost. We plan to explore the new configurable FP8 datatype and hardware accelerated stochastic rounding to further increase our training efficiency and development velocity.”

PyTorch is an open source machine learning framework that accelerates the path from research prototyping to production deployment. “At PyTorch, we want to accelerate taking machine learning from research prototyping to production ready for customers. We have collaborated extensively with AWS to provide native PyTorch support for new AWS Trainium-powered Trn1 instances. Developers building PyTorch models can start training on Trn1 instances with minimal code changes,” said Geeta Chauhan, Applied AI, engineering manager at PyTorch. “Additionally, we have worked with the OpenXLA community to enable PyTorch Distributed libraries for easy model migration from GPU-based instances to Trn1 instances. We are excited about the innovation that Trn1 instances bring to the PyTorch community, including more efficient data types, dynamic shapes, custom operators, hardware-optimized stochastic rounding, and eager debug mode. All these capabilities make Trn1 well suited for wide adoption by PyTorch developers, and we look forward to future joint contributions to PyTorch to further optimize training performance.”

Helixon builds next-generation artificial intelligence (AI) solutions to protein-based therapeutics, developing AI tools that empower scientists to decipher protein function and interaction, interrogate large-scale genomic datasets for target identification, and design therapeutics such as antibodies and cell therapies. “Today, we use training distribution libraries like Fully Sharded Data Parallel to parallelize model training over many GPU-based servers, but this still takes us weeks to train a single model,” said Jian Peng, CEO at Helixon. “We are excited to utilize Amazon EC2 Trn1 instances featuring the highest networking bandwidth available on AWS to improve the performance of our distributed training jobs and reduce our model training times, while also reducing our training costs.”

Money Forward, Inc. serves businesses and individuals with an open and fair financial platform. “We launched a large-scale AI chatbot service on the Amazon EC2 Inf1 instances and reduced our inference latency by 97% over comparable GPU-based instances while also reducing costs. As we keep fine-tuning tailored natural language processing models periodically, reducing model training times and costs is also important,” said Takuya Nakade, CTO at Money Forward. “Based on our experience from successful migration of inference workload on Inf1 instances and our initial work on AWS Trainium-based EC2 Trn1 instances, we expect Trn1 instances will provide additional value in improving end-to-end machine learning performance and cost.”

Magic is an integrated product and research company developing AI that feels like a colleague to make the world more productive. “Training large autoregressive transformer-based models is an essential component of our work. AWS Trainium-powered Trn1 instances are designed specifically for these workloads, offering near-infinite scalability, fast inter-node networking, and advanced support for 16-bit and 8-bit data types,” said Eric Steinberger, co-founder and CEO at Magic. “Trn1 instances will help us train large models faster, at a lower cost. We are particularly excited about the native support for BF16 stochastic rounding in Trainium, increasing performance while numerical accuracy indistinguishable from full precision.”

About Amazon Web Services

For over 15 years, Amazon Web Services has been the world’s most comprehensive and broadly adopted cloud offering. AWS has been continually expanding its services to support virtually any cloud workload, and it now has more than 200 fully featured services for compute, storage, databases, networking, analytics, machine learning and artificial intelligence (AI), Internet of Things (IoT), mobile, security, hybrid, virtual and augmented reality (VR and AR), media, and application development, deployment, and management from 87 Availability Zones within 27 geographic regions, with announced plans for 21 more Availability Zones and seven more AWS Regions in Australia, Canada, India, Israel, New Zealand, Spain, and Switzerland. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—trust AWS to power their infrastructure, become more agile, and lower costs.


Source: AWS

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

From Exasperation to Exascale: HPE’s Nic Dubé on Frontier’s Untold Story

December 2, 2022

The Frontier supercomputer – still fresh off its chart-topping 1.1 Linpack exaflops run and maintaining its number-one spot on the Top500 list – was still very much in the spotlight at SC22 in Dallas last month. Six Read more…

At SC22, Carbon Emissions and Energy Costs Eclipsed Hardware Efficiency

December 2, 2022

The race to ever-better flops-per-watt and power usage effectiveness (PUE) has, historically, dominated the conversation over sustainability in HPC – but at SC22, held last month in Dallas, something felt different. Ac Read more…

HPC Career Notes: December 2022 Edition

December 1, 2022

In this monthly feature, we’ll keep you up-to-date on the latest career developments for individuals in the high-performance computing community. Whether it’s a promotion, new company hire, or even an accolade, we’ Read more…

IBM Quantum Summit: Osprey Flies; Error Handling Progress; Quantum-centric Supercomputing

December 1, 2022

Part scorecard, part grand vision, IBM’s annual Quantum Summit held last month is a fascinating snapshot of IBM’s progress, evolving technology roadmap, and issues facing the quantum landscape broadly. Thankfully, IB Read more…

AWS Introduces a Flurry of New EC2 Instances at re:Invent

November 30, 2022

AWS has announced three new Amazon Elastic Compute Cloud (Amazon EC2) instances powered by AWS-designed chips, as well as several new Intel-powered instances – including ones targeting HPC – at its AWS re:Invent 2022 Read more…

AWS Solution Channel

Shutterstock 110419589

Thank you for visiting AWS at SC22

Accelerate high performance computing (HPC) solutions with AWS. We make extreme-scale compute possible so that you can solve some of the world’s toughest environmental, social, health, and scientific challenges. Read more…

 

shutterstock_1431394361

AI and the need for purpose-built cloud infrastructure

Modern AI solutions augment human understanding, preferences, intent, and even spoken language. AI improves our knowledge and understanding by delivering faster, more informed insights that fuel transformation beyond anything previously imagined. Read more…

Quantum Riches and Hardware Diversity Are Discouraging Collaboration

November 28, 2022

Quantum computing is viewed as a technology for generations, and the spoils for the winners are huge, but the diversity of technology is discouraging collaboration, an Intel executive said last week. There are close t Read more…

From Exasperation to Exascale: HPE’s Nic Dubé on Frontier’s Untold Story

December 2, 2022

The Frontier supercomputer – still fresh off its chart-topping 1.1 Linpack exaflops run and maintaining its number-one spot on the Top500 list – was still v Read more…

At SC22, Carbon Emissions and Energy Costs Eclipsed Hardware Efficiency

December 2, 2022

The race to ever-better flops-per-watt and power usage effectiveness (PUE) has, historically, dominated the conversation over sustainability in HPC – but at S Read more…

HPC Career Notes: December 2022 Edition

December 1, 2022

In this monthly feature, we’ll keep you up-to-date on the latest career developments for individuals in the high-performance computing community. Whether it Read more…

IBM Quantum Summit: Osprey Flies; Error Handling Progress; Quantum-centric Supercomputing

December 1, 2022

Part scorecard, part grand vision, IBM’s annual Quantum Summit held last month is a fascinating snapshot of IBM’s progress, evolving technology roadmap, and Read more…

AWS Introduces a Flurry of New EC2 Instances at re:Invent

November 30, 2022

AWS has announced three new Amazon Elastic Compute Cloud (Amazon EC2) instances powered by AWS-designed chips, as well as several new Intel-powered instances Read more…

Quantum Riches and Hardware Diversity Are Discouraging Collaboration

November 28, 2022

Quantum computing is viewed as a technology for generations, and the spoils for the winners are huge, but the diversity of technology is discouraging collaborat Read more…

2022 HPC Road Trip: Los Alamos

November 23, 2022

With SC22 in the rearview mirror, it’s time to get back to the 2022 Great American Supercomputing Road Trip. To refresh everyone’s memory, I jumped in the c Read more…

QuEra’s Quest: Build a Flexible Neutral Atom-based Quantum Computer

November 23, 2022

Last month, QuEra Computing began providing access to its 256-qubit, neutral atom-based quantum system, Aquila, from Amazon Braket. Founded in 2018, and built o Read more…

Nvidia Shuts Out RISC-V Software Support for GPUs 

September 23, 2022

Nvidia is not interested in bringing software support to its GPUs for the RISC-V architecture despite being an early adopter of the open-source technology in its GPU controllers. Nvidia has no plans to add RISC-V support for CUDA, which is the proprietary GPU software platform, a company representative... Read more…

RISC-V Is Far from Being an Alternative to x86 and Arm in HPC

November 18, 2022

One of the original RISC-V designers this week boldly predicted that the open architecture will surpass rival chip architectures in performance. "The prediction is two or three years we'll be surpassing your architectures and available performance with... Read more…

AWS Takes the Short and Long View of Quantum Computing

August 30, 2022

It is perhaps not surprising that the big cloud providers – a poor term really – have jumped into quantum computing. Amazon, Microsoft Azure, Google, and th Read more…

Chinese Startup Biren Details BR100 GPU

August 22, 2022

Amid the high-performance GPU turf tussle between AMD and Nvidia (and soon, Intel), a new, China-based player is emerging: Biren Technology, founded in 2019 and headquartered in Shanghai. At Hot Chips 34, Biren co-founder and president Lingjie Xu and Biren CTO Mike Hong took the (virtual) stage to detail the company’s inaugural product: the Biren BR100 general-purpose GPU (GPGPU). “It is my honor to present... Read more…

AMD Thrives in Servers amid Intel Restructuring, Layoffs

November 12, 2022

Chipmakers regularly indulge in a game of brinkmanship, with an example being Intel and AMD trying to upstage one another with server chip launches this week. But each of those companies are in different positions, with AMD playing its traditional role of a scrappy underdog trying to unseat the behemoth Intel... Read more…

Tesla Bulks Up Its GPU-Powered AI Super – Is Dojo Next?

August 16, 2022

Tesla has revealed that its biggest in-house AI supercomputer – which we wrote about last year – now has a total of 7,360 A100 GPUs, a nearly 28 percent uplift from its previous total of 5,760 GPUs. That’s enough GPU oomph for a top seven spot on the Top500, although the tech company best known for its electric vehicles has not publicly benchmarked the system. If it had, it would... Read more…

JPMorgan Chase Bets Big on Quantum Computing

October 12, 2022

Most talk about quantum computing today, at least in HPC circles, focuses on advancing technology and the hurdles that remain. There are plenty of the latter. F Read more…

Using Exascale Supercomputers to Make Clean Fusion Energy Possible

September 2, 2022

Fusion, the nuclear reaction that powers the Sun and the stars, has incredible potential as a source of safe, carbon-free and essentially limitless energy. But Read more…

Leading Solution Providers

Contributors

UCIe Consortium Incorporates, Nvidia and Alibaba Round Out Board

August 2, 2022

The Universal Chiplet Interconnect Express (UCIe) consortium is moving ahead with its effort to standardize a universal interconnect at the package level. The c Read more…

Nvidia, Qualcomm Shine in MLPerf Inference; Intel’s Sapphire Rapids Makes an Appearance.

September 8, 2022

The steady maturation of MLCommons/MLPerf as an AI benchmarking tool was apparent in today’s release of MLPerf v2.1 Inference results. Twenty-one organization Read more…

SC22 Unveils ACM Gordon Bell Prize Finalists

August 12, 2022

Courtesy of the schedule for the SC22 conference, we now have our first glimpse at the finalists for this year’s coveted Gordon Bell Prize. The Gordon Bell Pr Read more…

Intel Is Opening up Its Chip Factories to Academia

October 6, 2022

Intel is opening up its fabs for academic institutions so researchers can get their hands on physical versions of its chips, with the end goal of boosting semic Read more…

AMD’s Genoa CPUs Offer Up to 96 5nm Cores Across 12 Chiplets

November 10, 2022

AMD’s fourth-generation Epyc processor line has arrived, starting with the “general-purpose” architecture, called “Genoa,” the successor to third-gen Eypc Milan, which debuted in March of last year. At a launch event held today in San Francisco, AMD announced the general availability of the latest Epyc CPUs with up to 96 TSMC 5nm Zen 4 cores... Read more…

AMD Previews 400 Gig Adaptive SmartNIC SOC at Hot Chips

August 24, 2022

Fresh from finalizing its acquisitions of FPGA provider Xilinx (Feb. 2022) and DPU provider Pensando (May 2022) ), AMD previewed what it calls a 400 Gig Adaptive smartNIC SOC yesterday at Hot Chips. It is another contender in the increasingly crowded and blurry smartNIC/DPU space where distinguishing between the two isn’t always easy. The motivation for these device types... Read more…

Google Program to Free Chips Boosts University Semiconductor Design

August 11, 2022

A Google-led program to design and manufacture chips for free is becoming popular among researchers and computer enthusiasts. The search giant's open silicon program is providing the tools for anyone to design chips, which then get manufactured. Google foots the entire bill, from a chip's conception to delivery of the final product in a user's hand. Google's... Read more…

Not Just Cash for Chips – The New Chips and Science Act Boosts NSF, DOE, NIST

August 3, 2022

After two-plus years of contentious debate, several different names, and final passage by the House (243-187) and Senate (64-33) last week, the Chips and Science Act will soon become law. Besides the $54.2 billion provided to boost US-based chip manufacturing, the act reshapes US science policy in meaningful ways. NSF’s proposed budget... Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire