Optimizing your AWS Batch architecture for scale with observability dashboards

By Amazon Web Services

February 7, 2023

AWS Batch is a fully managed service enabling you to run computational jobs at any scale without the need to manage compute resources. Customers often ask for guidance to optimize their architectures and make their workload to scale rapidly using the service. Every workload is different, and some optimizations may not yield the same results for every case. For this reason, we built an observability solution that provides insights into your AWS Batch architectures and allows you to optimize them for scale and quickly identify potential throughput bottlenecks for jobs and instances. This solution is designed to run at scale and is regularly used to monitor and optimize workloads running on millions of vCPUs — in one notable case, the observability solution supported an analysis across 2.38M vCPUs within a single AWS Region.

In this blogpost, you will learn how AWS Batch works to:

  1. Scale Amazon Elastic Compute Cloud (Amazon EC2) instances to process jobs.
  2. How Amazon EC2 provisions instances for AWS Batch.
  3. How to leverage our observability solution to peek under the covers of Batch to optimize your architectures.

The learnings in this blogpost apply to regular and array jobs on Amazon EC2 — AWS Batch compute environments that leverage AWS Fargate resources and multi-node parallel jobs are not discussed here.

AWS Batch: how instances and jobs are scaling on-demand

To run a job on AWS Batch, you submit it to a Batch job queue (JQ) which manages the lifecycle of the job, from submission intake to dispatch to tracking the return status. Upon submission, the job is in the SUBMITTED state and then transitions to the RUNNABLE state which means it is ready to be scheduled for execution. The AWS Batch scheduler service regularly looks at your job queues and evaluates the requirements (vCPUs, GPUs, and memory) of the RUNNABLE jobs. Based on this evaluation, the service identifies if the associated Batch compute environments (CEs) need to be scaled-up in order to process the jobs in the queues.

To initiate this action AWS Batch generates a list of Amazon EC2 instances based on the jobs’ compute requirements (number of vCPUs, GPUs and memory) from the Amazon EC2 instance types you selected when configuring your CE. Depending on the allocation strategy selected, this list is provided to an Amazon EC2 Auto Scaling group or a Spot Fleet.

Once instances are created by Amazon EC2, they register with the Amazon ECS cluster connected to the CE scaling-up and the instances become ready to run your jobs. At this point, depending on the scheduling policy, the jobs waiting in RUNNABLE and fitting the free resources, transition to the STARTING state during which the containers are started. Once the containers are executing, the job transition to the state RUNNING states. It remains there until either successful completion or it encounters an error. When there are no jobs in the RUNNABLE state and instances stay idle, then AWS Batch detaches the instance from the Compute Environment and requests for Amazon ECS to deregister the idle instances from the cluster. AWS Batch then terminates them, and scales-down the EC2 Auto Scaling group or the Spot Fleet associated to your CE.

Figure 1: High level structure of AWS Batch resources and interactions. The diagram depicts a user submitting jobs based on a job definition template to a job queue, which then communicates to a compute environment that resources are needed. The compute environment resource scales compute resources on Amazon EC2 or AWS Fargate and registers them with a container orchestration service, either Amazon ECS or Amazon Elastic Kubernetes Service (Amazon EKS). In this post we only cover Amazon ECS clusters with EC2 instances.

Jobs placement onto instances

While the CE are scaling up and down, AWS Batch continuously asks Amazon ECS to place jobs on your instances by calling the RunTask API. Jobs are placed onto an instance if there is room to fit them. If a placement cannot be fulfilled and the API call fails, Amazon ECS provides a reason logged in AWS CloudTrail, such as a lack of available resources to run the job. The rate of successful API calls can help to compare two architectures on how fast jobs are being scheduled to run.

How instances are provisioned

You have seen how instances are requested by AWS Batch, we will now discussed on how they are provisioned. Instance provisioning is handled by Amazon EC2, it picks instances based on the list generated by AWS Batch and the allocation strategy selected when creating the CE.

With the allocation strategy called BEST_FIT (BF), Amazon EC2 pick the least expensive instance, with BEST_FIT_PROGRESSIVE (BFP) pick the least expensive instances and pick additional instance types if the previously selected types are not available. In the case of SPOT_CAPACITY_OPTIMIZED, Amazon EC2 Spot selects instances that are less likely to see interruptions.

Why optimize the Amazon EC2 instance selection in your Compute Environments?

Understanding how AWS Batch selects instances helps to optimize the environment to achieve your objective. The instances you get through AWS Batch are picked based the number of jobs, their shapes in terms of number of vCPUs and amount of memory they require, the allocation strategy, and the price. If they have low requirements (1-2 vCPUs, <4GB of memory) then it is likely that you will get small instances from Amazon EC2 (i.e., c5.xlarge) since they can be cost effective and belong to the deepest Spot instances pools. In this case, you may provision the maximum number of container instances per ECS Cluster (sitting beneath each Compute Environment) and therefore potentially hinder your ability to run at large scale. One solution to this challenge is to use larger instances in your Compute Environments configuration and help you achieve the desired capacity while staying under the limit imposed by ECS per cluster. To solve that, you need visibility on the instances picked for you.

Another case where observability on your AWS Batch environment is important is when your jobs need storage scratch space when running. If you set your EBS volumes to a fixed size, you may see some jobs crash on larger instances due to DockerTimeoutErrors or CannotInspectContainerError. This can be due to the higher number of jobs packed per instance which will consume your EBS Burst balance and process IO operations at a slow pace which prevents the Docker daemon from running checksums before it times out. Possible solutions are to increase the EBS volume sizes in your launch template, or prefer EC2 instance with instance stores. Having the visibility to link job failures to instance kinds is helpful in quickly determining the relevant solutions.

To help you in understanding and tuning your AWS Batch architectures, we developed and published a new open-source observability solution on GitHub. This solution provides metrics that you can use to evaluate your AWS Batch architectures at scale. The metrics include: the instances provisioned per compute environment and their types, the Availability Zones where they are located, the instances your jobs are scheduled on, how your jobs transition between states, and the placement rate per instance type.

We will discuss this observability solution in detail and demonstrate how it can be used to understand the behavior of your compute environment using a given workload. The examples provided here can be deployed on your account using Amazon CloudFormation templates.

AWS Batch observability solution

This open-source observability solution for AWS Batch is a serverless application that collects events from different services, combines them into a set of Amazon CloudWatch metrics and presents them in a set of dashboards. This solution is an AWS Serveless Application Model (SAM) template that you can deploy on an AWS account using the SAM CLI. The application is self-contained and can monitor AWS Batch architectures running from a few instances to dozens of thousands. The architecture below provides details of the SAM application. For instructions on how to install the solution, please refer to the Github repository.

SAM application architecture

The SAM application collects data points from your AWS Batch environment using CloudWatch events. Depending on their nature, these events trigger AWS Step Functions to capture the relevant data, add it to Amazon DynamoDB tables, and write the metrics using the CloudWatch embedded metric format. These metrics are aggregated in Amazon CloudWatch using a SEARCH expression and displayed onto dashboards that you can use to visualize the behavior of your workloads. The architecture diagram of the application is shown on Figure 2.

Figure 2: SAM application architecture. The diagram depicts how the different API calls made by Amazon ECS, AWS Batch, and Amazon EC2 are collected and displayed in the dashboards. Amazon ECS has the RunTask API call and the EC2 instance registration/deregistration monitored which trigger AWS Step Functions which trigger AWS Lambda to publish the metrics to a queue and then the respective dashboards. AWS Batch has the Job State Transition trigger Step Functions to publish in the transitions in the dashboard. Amazon EC2’s CreateAutoScalingGroup API call uses Lambda to monitor the Auto Scaling group metrics.

Using Amazon CloudWatch Events with AWS Batch

This solution only uses events generated by services instead of calling APIs to describe the infrastructure. This is a good practice when running at large scale to avoid potential API throttling when trying to retrieve information from thousands of instances or millions of jobs.

The events captured by the serverless application are the following:

  • ECS container instance registration and deregistration: When an instance reaches the RUNNING state, it registers itself with ECS to become part of the resource pool where your jobs will run. When an instance is removed, it deregisters and will no longer be used by ECS to run jobs. Both API calls contain information on the Instance Kind, its resources (vCPUs, Memory), the Availability Zone where the instance is located, the ECS cluster and the container instance Instance registration status is collected in an Amazon DynamoDB table to determine the EC2 instance, Availability Zone and ECS cluster used by a job using the container instance ID.
  • RunTask API calls are used by Batch to request the placement of jobs on instances. If successful, a job is placed. If an error is returned, the cause can be a lack of available instances or resources (vCPUs, Memory) to place a job. Calls are captured in an Amazon DynamoDB table to associate a Batch job to the container instance it runs on.
  • Batch jobs state transitions trigger an event when a job transitions between states. When moving to the RUNNING state, the JobID is matched with the ECS instances and RunTask DynamoDB tables to identify where the job is running.

The serverless application must be deployed on your account before going to the next section. If you haven’t done so, please follow the documentation in the application repository.

Understanding your Batch architecture

Now that you can visualize how AWS Batch acquires instances and places jobs, we will use this data to evaluate the effect of instance selection on a test workload.

To demonstrate this, you will deploy a set of AWS CloudFormation templates as described in Part I of this workshop. The first deploys a VPC stack and the second deploys an AWS Batch stack (see Figure 3). Then you will create and upload a container image on Amazon ECR and submit two Batch array jobs of 500 child jobs each that will use the CPU for a period of 10 minutes per job.

Figure 3: AWS Batch architecture for the tests workloads. This diagram depicts the environment deployed by the AWS CloudFormation templates. The VPC stack deploys a VPC in 2 or more Availability Zones. In one Availability Zone you have both a public and private subnet, in the rest only a private subnet. An internet gateway is created in the public subnet. The Batch stack creates the Amazon Elastic Container Registry, the job queue and the compute environment spanning the private subnets.

Understanding the instance selection of your Compute Environments

When creating a Batch compute environmentyou can select the type and size of instances that will be used to run your jobs (e.g. c5.9xlarge and c5.18xlarge), use entire families of instances (e.g., c5, m5a) or let AWS Batch pick instances for you using the optimal setting. When running jobs, Batch will build a list of instances from your initial selection that can fit your jobs (number of vCPUs, GPUs and memory); then AWS Batch will ask Amazon EC2 to provide the capacity using this refined list. Amazon EC2 then pick instances based on the purchasing model and allocation strategy.

In this scenario, you will explore the instance selection for one compute environment called CE1. The objective is to demonstrate how you can use the serverless application to identify which instances are picked so you can see how batch performs instance selection for a workload provided.

CE1 is configured to pick instances from the c5, c4, m5, m4 Amazon EC2 instance families regardless of the instance and uses the Spot Capacity Optimized allocation strategy…

Read the full blog to learn more. Reminder: You can learn a lot from AWS HPC engineers by subscribing to the HPC Tech Short YouTube channel, and following the AWS HPC Blog channel.

Return to Solution Channel Homepage
Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

Quantum Internet: Tsinghua Researchers’ New Memory Framework could be Game-Changer

April 25, 2024

Researchers from the Center for Quantum Information (CQI), Tsinghua University, Beijing, have reported successful development and testing of a new programmable quantum memory framework. “This work provides a promising Read more…

Intel’s Silicon Brain System a Blueprint for Future AI Computing Architectures

April 24, 2024

Intel is releasing a whole arsenal of AI chips and systems hoping something will stick in the market. Its latest entry is a neuromorphic system called Hala Point. The system includes Intel's research chip called Loihi 2, Read more…

Anders Dam Jensen on HPC Sovereignty, Sustainability, and JU Progress

April 23, 2024

The recent 2024 EuroHPC Summit meeting took place in Antwerp, with attendance substantially up since 2023 to 750 participants. HPCwire asked Intersect360 Research senior analyst Steve Conway, who closely tracks HPC, AI, Read more…

AI Saves the Planet this Earth Day

April 22, 2024

Earth Day was originally conceived as a day of reflection. Our planet’s life-sustaining properties are unlike any other celestial body that we’ve observed, and this day of contemplation is meant to provide all of us Read more…

Intel Announces Hala Point – World’s Largest Neuromorphic System for Sustainable AI

April 22, 2024

As we find ourselves on the brink of a technological revolution, the need for efficient and sustainable computing solutions has never been more critical.  A computer system that can mimic the way humans process and s Read more…

Empowering High-Performance Computing for Artificial Intelligence

April 19, 2024

Artificial intelligence (AI) presents some of the most challenging demands in information technology, especially concerning computing power and data movement. As a result of these challenges, high-performance computing Read more…

Quantum Internet: Tsinghua Researchers’ New Memory Framework could be Game-Changer

April 25, 2024

Researchers from the Center for Quantum Information (CQI), Tsinghua University, Beijing, have reported successful development and testing of a new programmable Read more…

Intel’s Silicon Brain System a Blueprint for Future AI Computing Architectures

April 24, 2024

Intel is releasing a whole arsenal of AI chips and systems hoping something will stick in the market. Its latest entry is a neuromorphic system called Hala Poin Read more…

Anders Dam Jensen on HPC Sovereignty, Sustainability, and JU Progress

April 23, 2024

The recent 2024 EuroHPC Summit meeting took place in Antwerp, with attendance substantially up since 2023 to 750 participants. HPCwire asked Intersect360 Resear Read more…

AI Saves the Planet this Earth Day

April 22, 2024

Earth Day was originally conceived as a day of reflection. Our planet’s life-sustaining properties are unlike any other celestial body that we’ve observed, Read more…

Kathy Yelick on Post-Exascale Challenges

April 18, 2024

With the exascale era underway, the HPC community is already turning its attention to zettascale computing, the next of the 1,000-fold performance leaps that ha Read more…

Software Specialist Horizon Quantum to Build First-of-a-Kind Hardware Testbed

April 18, 2024

Horizon Quantum Computing, a Singapore-based quantum software start-up, announced today it would build its own testbed of quantum computers, starting with use o Read more…

MLCommons Launches New AI Safety Benchmark Initiative

April 16, 2024

MLCommons, organizer of the popular MLPerf benchmarking exercises (training and inference), is starting a new effort to benchmark AI Safety, one of the most pre Read more…

Exciting Updates From Stanford HAI’s Seventh Annual AI Index Report

April 15, 2024

As the AI revolution marches on, it is vital to continually reassess how this technology is reshaping our world. To that end, researchers at Stanford’s Instit Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Intel’s Server and PC Chip Development Will Blur After 2025

January 15, 2024

Intel's dealing with much more than chip rivals breathing down its neck; it is simultaneously integrating a bevy of new technologies such as chiplets, artificia Read more…

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Baidu Exits Quantum, Closely Following Alibaba’s Earlier Move

January 5, 2024

Reuters reported this week that Baidu, China’s giant e-commerce and services provider, is exiting the quantum computing development arena. Reuters reported � Read more…

Shutterstock 1179408610

Google Addresses the Mysteries of Its Hypercomputer 

December 28, 2023

When Google launched its Hypercomputer earlier this month (December 2023), the first reaction was, "Say what?" It turns out that the Hypercomputer is Google's t Read more…

AMD MI3000A

How AMD May Get Across the CUDA Moat

October 5, 2023

When discussing GenAI, the term "GPU" almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word "GPU" is assumed to mean "Nvidia" products. (As an aside, the popular Nvidia hardware used in GenAI are not technically... Read more…

Leading Solution Providers

Contributors

Shutterstock 1606064203

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

January 25, 2024

In under two minutes, Meta's CEO, Mark Zuckerberg, laid out the company's AI plans, which included a plan to build an artificial intelligence system with the eq Read more…

China Is All In on a RISC-V Future

January 8, 2024

The state of RISC-V in China was discussed in a recent report released by the Jamestown Foundation, a Washington, D.C.-based think tank. The report, entitled "E Read more…

Shutterstock 1285747942

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, codenamed Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

Eyes on the Quantum Prize – D-Wave Says its Time is Now

January 30, 2024

Early quantum computing pioneer D-Wave again asserted – that at least for D-Wave – the commercial quantum era has begun. Speaking at its first in-person Ana Read more…

GenAI Having Major Impact on Data Culture, Survey Says

February 21, 2024

While 2023 was the year of GenAI, the adoption rates for GenAI did not match expectations. Most organizations are continuing to invest in GenAI but are yet to Read more…

The GenAI Datacenter Squeeze Is Here

February 1, 2024

The immediate effect of the GenAI GPU Squeeze was to reduce availability, either direct purchase or cloud access, increase cost, and push demand through the roof. A secondary issue has been developing over the last several years. Even though your organization secured several racks... Read more…

Intel’s Xeon General Manager Talks about Server Chips 

January 2, 2024

Intel is talking data-center growth and is done digging graves for its dead enterprise products, including GPUs, storage, and networking products, which fell to Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire