AWS Batch updates: higher compute utilization, AWS PrivateLink support, and updatable compute environments

The AWS Batch team has been busy this year, releasing features that have given Batch customers better performance, more advanced security and compliance controls, and eased their operational procedures. This blog post describes a few of them.

Faster and more efficient job placements

AWS Batch is a container-centric, fully-managed service for you to use to submit work for background processing. You can define a set of resources that tells the Batch managed services what kind and how many resources to provision for your jobs to run. Batch can handle workloads requests of any size, and scales automatically to your job queue. Our customers leverage Batch for a truly diverse set of workloads: from background image processing to massive scale genomics analysis.

Last year we talked about Batch’s faster scaling features that improved resource scaling by up to 5x and job placement by up to 2x. These improvements to the managed services are valuable when you need to quickly scale up and crunch through the job queue as quickly as possible.

Figure 1 – The AWS Batch request flow from job submission through to job execution. The diagram shows which parts of the process where improved, with job submission rate improved by up to 1.6 times, internal job scheduling and execution start improved by up to 2 times, and scaling of resources by up to 5 times faster than before the scaling improvements.

The other side of this challenge is to maintain costs as low as possible by utilizing resources efficiently. When there is a lot of work to be done, Batch launches compute resources as fast as it can to address the need, and then starts placing jobs as quickly as possible. As the work queue is drained, fastest-possible job placement can have a side-effect of keeping the launched capacity up longer than it needed to be, and decreases the utilization rates of these instances at the tail end of workload batches – adding to the overall cost of the batch analyses.

We recently switched our job placement logic to take into account the remaining number of jobs in the queue and intelligently switch to a more conservative approach that packs jobs on a smaller set of the launched instances at the expense of slightly longer job placement times. This dynamic job placement strategy has resulted in a 29% better utilization of available capacity for Batch in a wide spread of test scenarios we tried. This improved instance utilization results in faster scale-down of the fleet, which in turn lowers the overall cost of running jobs.

Support for AWS PrivateLink

While AWS Batch could launch and manage resources within private Virtual Private Cloud (Amazon VPC) subnets, customers still need to route requests to the Batch API via publicly accessible endpoints. Some customers, for security or compliance reasons, do not want to expose any internet accessible endpoint to their internal services running within on-premises or within private subnets.

AWS Batch has now enabled the use of AWS PrivateLink (PrivateLink) to access the Batch APIs. AWS PrivateLink provides private connectivity between VPCs, AWS services, and your on-premises networks.

To use Batch with PrivateLink, you will need to create an interface VPC endpoint for AWS Batch in your VPC using the VPC management console, SDK, or CLI. You can also access the VPC endpoint from on-premises environments or from other VPCs using AWS VPN, AWS Direct Connect, or VPC Peering.

New compute environment update capabilities

Batch compute environments (CEs) define the set of compute and storage resources your jobs will run on. You can define the minimum and maximum total vCPU capacity of the fleet, as well as storage, security groups, and a number of other parameters. You can also define whether the underlying compute resource provider is AWS Fargate or Amazon Elastic Compute Cloud (Amazon EC2).

Before today, once a compute environment was created, you were only able to update certain features, such as the minimum/maximum number of allocatable CPUs, or the service role used by your job requests. Any other update, for example needing to update an AMI for a security patch, would require you to create a new compute environment and replace the existing one that was linked to from your Batch job queues.

Today we are pleased to announce the release of new capabilities…

Read the full blog to learn more about AWS Batch updates.

Reminder: You can learn a lot from AWS HPC engineers by subscribing to the HPC Tech Short YouTube channel, and following the AWS HPC Blog channel.