AWS Batch Strangeness

Published on

In a recent project at work I stumbled across an annoying gotcha with AWS Batch. In case you don’t know what AWS Batch is there are more details here. In short, it’s a fairly simple way to scale out workloads that happen on an irregular basis. In our case, create a Docker image with everything needed and then run Batch when you have data to analyse. At the end of the day you end up with something similar in functionality to a SGE/PBS/SLURM (insert HPC queueing system here…) HPC cluster, which in previous roles I found were often used for this type of batch job. It always seemed like a bit of a waste of the expensive Infiniband/Mellanox network interconnects!

In case you’re interested in trying it yourself, I found this blog post incredibly helpful to get started with AWS Batch. Hopefully I’ll be able to write more on the specific use case in the future.

The symptoms

We had a nicely working system but on adding more disk space to allow processing bigger jobs, we upped the volume size to 2.4TB, so that the Docker volume had more space. At that point all jobs processed would fail.

It’s important to know that we really want to use Managed Compute Environments since we didn’t want to have to worry about configuring AMIs and maintaining them in the long run. They’re really just a ‘vessel’ for our Docker containers which are containing all of the science smarts (archived using ECR of course!).

The issue

So what was going on? After some debugging by SSHing into the AWS Batch EC2 instance we realised that the Docker volume was no longer getting mounted, curious… After a little bit of Googlefu, it was obvious what the problem was. Amazon Linux only uses MBR partioning by default, so only supports volume sizes up to 2TiB by default.

The solution

The quick fix was to drop the volume size down to 2TB, at which point the issue was confirmed, and we were up and running again. A more permanent fix will be to switch to Amazon Linux 2 AMIs where GPT is the default partition size. In any case AWS recommend that you do that upgrade by June 30th, 2020. It has another side benefit of meaning that you no longer have to set the dm.basesize option for the Docker daemon. I guess that we’ll have other problems to deal with when we need 64 ZiB EBS volumes…