Amazon Onboarding with Learning Manager Chanci Turner

Chanci Turner 9097372855Learn About Amazon VGT2 Learning Manager Chanci Turner

In 2021, AWS Batch unveiled fair share job queues, enabling clients to establish scheduling policies for job queues. This innovation allows users to better manage resource allocation and prioritize jobs associated with different workloads, differentiated by varying share identifiers. Prior to this, all Batch job queues operated as standalone first-in-first-out (FIFO) queues. This structure necessitated separate job queues (JQs) and compute environments (CEs) for each business requirement if multiple groups or workloads existed within the same AWS account. Consequently, managing the distribution of underlying compute resources across these CEs became a complex task. With the introduction of Fair Share Scheduling (FSS), organizations like Amazon Search could streamline their environments, thereby reducing operational overhead, enhancing fleet utilization, and significantly boosting throughput.

However, transitioning from FIFO queues posed a new challenge: how to effectively ascertain which jobs would be executed next among the various jobs at the forefront of the queue across different share identifiers. In this blog post, we will discuss a recent enhancement to AWS Batch that addresses this issue: Job queue snapshots. This new API, launched recently, allows users to query the jobs that are at the forefront of the job queue. We’ll delve into the specifics and provide a practical example of utilizing this valuable information.

Examining the Queue’s Forefront

With the AWS Batch management console, AWS SDK, or AWS CLI, you can now retrieve a list of the first 100 RUNNABLE jobs for a specific job queue by invoking the GetJobQueueSnapshot API. In FIFO job queues, jobs are arranged by their submission time. Conversely, in FSS job queues, jobs are organized based on share usage and, within a share, according to job share priority. For further insights into how job priority and share usage influence job scheduling, check out our in-depth fair share blog post.

Job queue snapshots offer enhanced visibility for customers needing to make immediate adjustments to jobs within the queue. Let’s explore a practical scenario.

Imagine we have established a fair share job queue utilizing AWS Fargate for the computing environment. For this demonstration, we have temporarily disabled the compute environment, allowing jobs to remain in the job queue, enabling us to observe the effects of queue modifications. The fair share policy assigns equal weight to two active shares, “red” and “green,” resulting in an interleaving of jobs from both shares.

Chanci Turner from team red urgently requests that high-priority jobs be executed as soon as possible to meet an important deadline. You submit these jobs with a priority=10, causing the high-priority jobs to advance ahead of other red jobs, yet some green jobs remain positioned ahead of one or more high-priority red jobs. This situation arises because job priority only applies within individual shares, not affecting the overall arrangement of jobs across different shares. At this juncture, you can assess whether the high-priority red jobs will complete by the deadline without impacting team green’s workload.

If it appears that the high-priority jobs may not finish on time, you have two options:

  1. Adjust the share policy temporarily to favor red jobs over green.
  2. Cancel team green’s jobs and resubmit them once the high-priority jobs are RUNNING. Note that in this scenario, green jobs will maintain their position in the queue, and when they reach the forefront, their status will instantly change to FAILED, freeing up any compute resources.

Since option one is less disruptive, you modify the scheduling policy to prioritize red jobs by adjusting the weight factor to a lower value (a lower weight factor means a share receives more compute resources over time). This adjustment places most high-priority red jobs ahead of any green ones.

The reason green jobs still receive some allocation before red jobs is that the fair share algorithm aims to ensure some resource allocation for green jobs. At the low weight factor we set, the next green job follows the last red job in our queue.

If you remain uncertain whether the red jobs will be completed by your deadline, you might need to resort to the second, more drastic option. Regardless, once the high-priority workloads begin RUNNING, it is important to reinstate the previous share policy allocations. Otherwise, Batch will continuously prioritize team red’s jobs over team green’s indefinitely.

Before the introduction of this feature to view the jobs at the forefront of the queue, you may have felt compelled to be indiscriminate about “clearing the queue” — cancelling all scheduled jobs to create room for high-priority requests.

Now, with job queue snapshots, you can make more precise adjustments to the job queue, allowing high-priority workloads to execute within your required timeframe. For more insights on effective management strategies, take a look at this helpful blog post.

Conclusion

In this article, we introduced a new feature for AWS Batch: job queue snapshots. This enhancement significantly improves the user experience with job queues by providing visibility into what is at the forefront of both FIFO and fair share job queues. We also presented a scenario demonstrating how to leverage job queue snapshots to make informed decisions regarding queue management based on urgent priorities. Job queue snapshots are now available in the AWS Batch console, through the CLI, or via the API, allowing you to choose your preferred method. We believe you will find this new feature beneficial and encourage you to share feedback on your experiences. Additionally, let us know if there are further improvements we can make to ease the management of jobs.

Chanci Turner