Announcing NVIDIA GPU Support for Bottlerocket on Amazon ECS

Last year, we introduced the general availability of the Amazon Elastic Container Service (Amazon ECS)-optimized Bottlerocket AMI. Bottlerocket is an open-source initiative that prioritizes security and maintainability, offering a dependable and consistent Linux distribution for hosting container-based workloads. We are excited to announce that you can now utilize ECS NVIDIA GPU-accelerated workloads on ECS using Bottlerocket.

In this article, we will guide you through the process of creating an Amazon ECS task to execute an NVIDIA GPU workload on Bottlerocket.

Why Bottlerocket?

As customers increasingly turn to containers for their workloads, AWS recognized the need for a Linux distribution tailored to optimize these containerized applications. Bottlerocket OS was designed to provide a secure platform for hosts running containers while minimizing operational overhead to manage them at scale. Its architecture supports reliable updates that can be automated seamlessly.

To learn more about starting with Bottlerocket and Amazon ECS, check out this blog post, which is an excellent resource.

Setting Up an ECS Cluster with Bottlerocket and NVIDIA GPUs

Let’s dive into how this is accomplished in practice. We’ll be using the us-west-2 (Oregon) Region.

Prerequisites

The AWS CLI with the necessary credentials
A default VPC in your preferred region (you can also use an existing VPC in your account)

First, let’s create the ECS cluster named ecs-bottlerocket:

aws ecs --region us-west-2 create-cluster --cluster-name ecs-bottlerocket

The instance we will launch requires an AWS Identity and Access Management (IAM) role to interact with both the ECS APIs and the Systems Manager Session Manager APIs. I created an IAM role named ecsInstanceRole that has both the AmazonSSMManagedInstanceCore and AmazonEC2ContainerServiceforEC2Role managed policies attached.

The list of Bottlerocket Amazon Machine Images (AMIs) compatible with NVIDIA GPUs is publicly accessible from AWS Systems Manager Parameter Store. Let’s retrieve the AMI ID for the latest Bottlerocket release. (AMIs are available for both x86_64 and aarch64 architectures). In this article, we will use the x86_64 AMI.

latest_bottlerocket_ami=$(aws ssm get-parameter --region us-west-2 
--name "/aws/service/bottlerocket/aws-ecs-1-nvidia/x86_64/latest/image_id" 
--query Parameter.Value --output text)

Next, let’s fetch the list of subnets configured to allocate a public IP address:

aws ec2 describe-subnets 
--region us-west-2 
--filter=Name=vpc-id,Values=$vpc_id 
--query 'Subnets[?MapPublicIpOnLaunch == `true`].SubnetId'

[
    "subnet-bc8993e6",
    "subnet-b55f6bfe",
    "subnet-e1e27fca",
    "subnet-21cbc058"
]

To connect our EC2 instance to the ECS cluster, we need to provide some information to the instance during creation: a simple configuration file (userdata.toml) containing the ECS cluster details, which we will save in the current directory.

A complete list of supported settings is available here.

cat > ./userdata.toml << 'EOF'
[settings.ecs]
cluster = "ecs-bottlerocket"
EOF

Let’s deploy one Bottlerocket instance within one of the aforementioned subnets. For this blog post, we will choose a public subnet, making it easier to debug and connect to the instances if necessary. You can opt for private or public subnets based on your requirements.

We will utilize the p3.2xlarge instance type, which features one NVIDIA Tesla V100 GPU.

aws ec2 run-instances 
--subnet-id subnet-bc8993e6 
--image-id $latest_bottlerocket_ami 
--instance-type p3.2xlarge 
--region us-west-2 
--tag-specifications 'ResourceType=instance,Tags=[{Key=bottlerocket,Value=quickstart}]' 
--user-data file://userdata.toml 
--iam-instance-profile Name=ecsInstanceRole

Next, let’s create the task definition for the sample application.

cat > ./sample-gpu.json << 'EOF'
{
  "containerDefinitions": [
    {
      "memory": 80,
      "essential": true,
      "name": "gpu",
      "image": "nvidia/cuda:11.0-base",
      "resourceRequirements": [
         {
           "type":"GPU",
           "value": "1"
         }
      ],
      "command": [
        "sh",
        "-c",
        "nvidia-smi"
      ],
      "cpu": 100,
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
           "awslogs-group": "/ecs/bottlerocket",
           "awslogs-region": "us-west-2",
           "awslogs-stream-prefix": "demo-gpu"
           }
      }
    }
  ],
  "family": "example-ecs-gpu"
}
EOF

In the task definition, we assign one NVIDIA GPU to our task through the resourceRequirements parameter. We also configure the awslogs-group for the task to direct the log output from our container into Amazon CloudWatch.

The log group configuration is as follows:

Region: us-west-2
Log group name: /ecs/bottlerocket
Log stream prefix: demo-gpu

Create the specified CloudWatch log group in the task definition:

aws logs create-log-group --log-group-name '/ecs/bottlerocket' --region us-west-2

Now, register the task in ECS:

aws ecs register-task-definition 
--region us-west-2 
--cli-input-json file://sample-gpu.json

To run the task, execute:

aws ecs run-task --cluster ecs-bottlerocket 
--task-definition bottlerocket-gpu:1

The task will execute a command inside the container to display GPU configuration details and then terminate.

Afterwards, you can visit the ECS console in your account. Click on “Clusters” in the left menu, select the ecs-bottlerocket cluster, and then click on the “Tasks” tab.

Select the task ID and view the Logs tab to see the output from the executed task:

You can also retrieve the log output from the command line using the log group name, log stream name, and timeframe. For instance:

aws logs tail '/ecs/bottlerocket' 
--log-stream-names 'demo-gpu/gpu/7af782059c644872977da89a06023483' 
--since 1h --format short

Cleanup

To remove the resources created during this tutorial, run the following commands:

aws ecs deregister-task-definition 
--region us-west-2 
--task-definition bottlerocket-gpu:1

delete_instances=$(aws ec2 describe-instances --region us-west-2 
--filters "Name=tag-key,Values=bottlerocket" "Name=tag-value,Values=quickstart" 
--query 'Reservations[].Instances[].InstanceId')

for instance in $delete_instances
  do aws ec2 terminate-instances --instance-ids $instance --region us-west-2
done 

aws ecs delete-cluster 
--region us-west-2 
--cluster ecs-bottlerocket

aws logs delete-log-group --log-group-name '/ecs/bottlerocket'

Conclusion

In this article, we explored how to create an ECS task definition configured to run a GPU-enabled workload within a container on Bottlerocket, efficiently and securely. We also examined how the container logs are accessible in CloudWatch and how to retrieve them via the command line. If you’re interested in more examples of GPU-accelerated workloads to deploy with Bottlerocket on ECS, you can explore the NVIDIA GPU-optimized containers available in the NVIDIA NGC catalog on AWS Marketplace. For further insights, check out this informative piece.