Rearchitecting AWS Batch managed services to leverage AWS Fargate

AWS service teams continuously improve the underlying infrastructure and operations of managed services, and AWS Batch is no exception. The AWS Batch team recently moved most of their job scheduler fleet to a serverless infrastructure model leveraging AWS Fargate. I had a chance to sit with Devendra Chavan, Senior Software Development Engineer on the AWS Batch team, to discuss the move to AWS Fargate and its impact on the Batch managed scheduler service component.

First off, what is AWS Batch and what benefits does it provide our customers?

AWS Batch enables customers to run batch computing jobs on AWS. It removes the undifferentiated heavy lifting of configuring and managing the required infrastructure, much like traditional batch computing software. The Batch service can efficiently provision resources in response to jobs submitted in order to eliminate capacity constraints, reduce compute costs, and deliver results quickly. It plans, schedules, and executes your batch computing workloads across the full range of AWS compute services and features, such as AWS Fargate, Amazon EC2 On-demand and Spot Instances.

What is a Batch scheduler and what does it do?

AWS Batch provides a cloud-native scheduler that is responsible for evaluating jobs that you have submitted to a queue, and managing the lifecycle of those jobs. The scheduler is a managed service that handles job dependencies, timeouts, and retries. It also helps dynamically provision the optimal quantity and type of compute resources based on the aggregate resource requirements of the submitted jobs. The scheduler performs these operations using a mix of poll-based and event-driven mechanisms. It also periodically checks in with the Batch control plane to determine if any configurations have changed.

Recently, the engineering team decided to move some services to a serverless model based on AWS Fargate. What was there before and what motivated that move?

Each AWS Batch customer gets their own scheduler process running as an ECS task. In the previous architecture, Batch schedulers ran on EC2 instances in auto scaling groups managed by AWS CloudFormation. In AWS regions where Batch has many customers, we were reaching some scaling performance limits for our CloudFormation-managed capacity. Specifically, we have a goal for updates to the underlying EC2 instances to complete within a 1-hour window. In large regions, these updates would at times fail to meet this service level objective. This limited how well we could scale out the scheduler fleet as Batch grew more popular. We needed to find another solution, which was to build on AWS Fargate.

AWS Fargate allows you to use Amazon ECS to run containers without managing clusters of EC2 instances. Instead, you package your ECS workload up as a Fargate task, specifying the operating system, security and networking configuration, and resource requirements, then launch the application. Scaling Fargate tasks becomes a matter of setting a desired task count, with the Fargate service transparently taking care of monitoring and scaling on your behalf.

Moving to Fargate tasks had a couple of advantages. First, Fargate tasks eliminated the overhead of periodically patching host EC2 instances that powered the Batch scheduler fleet, taking valuable engineer time away from development. One reason it took so much effort was that during the update process, sometimes there were transient failures while connecting with dependent services. This could cause EC2 instance replacements to fail, which prevented the autoscaling group from stabilizing. This in turn caused a CloudFormation rollback, which could time out (or at least take a long time). Moving to Fargate completely eliminated this issue since the Fargate service team handles this work for us.

Second, Fargate tasks gave us granular control of our fleet capacity. AWS Batch would periodically adjust the desired capacity in our Auto Scaling Groups based on how many schedulers needed to run. These groups were configured to use large EC2 instances to efficiently utilize available capacity, and to launch in multiple Availability Zones to provide high availability. This meant that AWS Batch would scale up its fleet with multiple large instances at a time across the Availability Zones. For small Regions in particular, this resulted in significant idle capacity with up to 80% of the fleet sitting unused by customer schedulers. Moving to Fargate has allowed us to scale up one task a time as new customer schedulers are provisioned, rather than in large chunks. We maintain high availability as the ECS service is now responsible for balancing tasks across Availability Zones.

Moving to such a different architecture does not seem straight forward. What changes to your team’s overall development methods and operational tooling were implemented as part of this move?

The Batch scheduler was already a containerized application. Rather than run those containers on hosts we manage, our new approach uses Fargate ECS services. We use pipelines built using the AWS Cloud Development Kit (CDK) that deploy the schedulers over a series of cells in all supported AWS regions. To reduce our operational overhead, we invested early in automating the management of this cellular infrastructure. To achieve this, we built CDK-based stacks that include infrastructure to manage the scheduler containers, provide monitoring (dashboards and alarms), and support compliance.

To ensure a safe migration, we deployed the new scheduler fleet to all supported regions using a new set of AWS accounts, while the existing schedulers were running production customer workloads…

Read the full blog to learn more. Reminder: You can learn a lot from AWS HPC engineers by subscribing to the HPC Tech Short YouTube channel, and following the AWS HPC Blog channel.