This blog helps you understand the AWS Batch job termination process and how you may take actions to gracefully terminate a job by capturing SIGTERM signal inside the application. It provides you with an efficient way to exit your Batch jobs. You also get to know about how job timeouts occur, and how the retry operation works with both traditional AWS Batch jobs and array jobs.
Jobs are the unit of work invoked by AWS Batch. Jobs are invoked as containerized applications running on Amazon ECS container instances in an ECS cluster. When you submit a job to an AWS Batch job queue, the job enters the SUBMITTED state and proceeds through a series of job states, as depicted in Figure 1, until it succeeds (exits with code 0) or fails (exits with a non-zero code). AWS Batch jobs can have the following jobs states:

The whole idea behind working with AWS Batch or batch processing workloads in general is to run workloads at scale with minimal intervention. Let’s consider an example where a customer is running a combination of both On-demand and Spot Instances in their AWS Batch compute environment to strike a balance between resource availability and higher instance cost. In this scenario, whenever an instance is taken out of service due to spot interruption, the Batch job running on the instance also gets terminated. Hence, customers are constantly looking to achieve fine grained control over their job’s termination process. This is made possible by having more information around how the job termination process mechanism works in AWS Batch.
Going ahead, we will discuss how AWS Batch handles job termination process for various job states – what goes under the hood in this managed service, how you can handle job terminations gracefully inside container application using a sample example, how timeout terminations happen, and how one can work with automated job retries.
Deep Dive – AWS Batch job termination process
AWS Batch utilizes Amazon ECS service under the hood. Every compute environment that you create in your Batch setup gets a corresponding ECS Cluster created to manage the compute resources. Similarly, for every job that is submitted to Batch, a corresponding ECS Task is run in the backend to process the workload on the containers.
Let us understand how AWS Batch handles termination for jobs in different states.
Termination of jobs in RUNNING state
When an AWS Batch job in RUNNING state is terminated (TerminateJob), the backend handler concerning the termination event invokes a termination event. This event contains metadata about the job including the JobARN. The handler fetches critical information about the job like the job status, from the service’s internal database. Batch sees that the status of the job is running, and consequently it proceeds with stopping the task with a StopTask API call.
The job details are pulled from the service datastore and the task details are pulled from the termination event.
The handler proceeds making the StopTask call with the information stored in backend database such as the ECS cluster ARN, task ID, and reason it received from the termination event…
Read the full blog to learn more. Reminder: You can learn a lot from AWS HPC engineers by subscribing to the, and following the AWS HPC Blog channel.
Read the full blog to learn more. Reminder: You can learn a lot from AWS HPC engineers by subscribing to the HPC Tech Short YouTube channel, and following the AWS HPC Blog channel.