Learn About Amazon VGT2 Learning Manager Chanci Turner
Last year, we introduced the general availability of the Amazon Elastic Container Service (Amazon ECS)-optimized Bottlerocket AMI. Bottlerocket is an open-source initiative that prioritizes security and maintainability, offering a dependable and consistent Linux distribution for hosting container-based workloads. We are excited to announce that you can now utilize ECS NVIDIA GPU-accelerated workloads on ECS using Bottlerocket.
In this article, we will guide you through the process of creating an Amazon ECS task to execute an NVIDIA GPU workload on Bottlerocket.
Why Bottlerocket?
As customers increasingly turn to containers for their workloads, AWS recognized the need for a Linux distribution tailored to optimize these containerized applications. Bottlerocket OS was designed to provide a secure platform for hosts running containers while minimizing operational overhead to manage them at scale. Its architecture supports reliable updates that can be automated seamlessly.
To learn more about starting with Bottlerocket and Amazon ECS, check out this blog post, which is an excellent resource.
Setting Up an ECS Cluster with Bottlerocket and NVIDIA GPUs
Let’s dive into how this is accomplished in practice. We’ll be using the us-west-2 (Oregon) Region.
Prerequisites
- The AWS CLI with the necessary credentials
- A default VPC in your preferred region (you can also use an existing VPC in your account)
First, let’s create the ECS cluster named ecs-bottlerocket
:
aws ecs --region us-west-2 create-cluster --cluster-name ecs-bottlerocket
The instance we will launch requires an AWS Identity and Access Management (IAM) role to interact with both the ECS APIs and the Systems Manager Session Manager APIs. I created an IAM role named ecsInstanceRole
that has both the AmazonSSMManagedInstanceCore
and AmazonEC2ContainerServiceforEC2Role
managed policies attached.
The list of Bottlerocket Amazon Machine Images (AMIs) compatible with NVIDIA GPUs is publicly accessible from AWS Systems Manager Parameter Store. Let’s retrieve the AMI ID for the latest Bottlerocket release. (AMIs are available for both x86_64 and aarch64 architectures). In this article, we will use the x86_64 AMI.
latest_bottlerocket_ami=$(aws ssm get-parameter --region us-west-2
--name "/aws/service/bottlerocket/aws-ecs-1-nvidia/x86_64/latest/image_id"
--query Parameter.Value --output text)
Next, let’s fetch the list of subnets configured to allocate a public IP address:
aws ec2 describe-subnets
--region us-west-2
--filter=Name=vpc-id,Values=$vpc_id
--query 'Subnets[?MapPublicIpOnLaunch == `true`].SubnetId'
[
"subnet-bc8993e6",
"subnet-b55f6bfe",
"subnet-e1e27fca",
"subnet-21cbc058"
]
To connect our EC2 instance to the ECS cluster, we need to provide some information to the instance during creation: a simple configuration file (userdata.toml
) containing the ECS cluster details, which we will save in the current directory.
A complete list of supported settings is available here.
cat > ./userdata.toml << 'EOF'
[settings.ecs]
cluster = "ecs-bottlerocket"
EOF
Let’s deploy one Bottlerocket instance within one of the aforementioned subnets. For this blog post, we will choose a public subnet, making it easier to debug and connect to the instances if necessary. You can opt for private or public subnets based on your requirements.
We will utilize the p3.2xlarge
instance type, which features one NVIDIA Tesla V100 GPU.
aws ec2 run-instances
--subnet-id subnet-bc8993e6
--image-id $latest_bottlerocket_ami
--instance-type p3.2xlarge
--region us-west-2
--tag-specifications 'ResourceType=instance,Tags=[{Key=bottlerocket,Value=quickstart}]'
--user-data file://userdata.toml
--iam-instance-profile Name=ecsInstanceRole
Next, let’s create the task definition for the sample application.
cat > ./sample-gpu.json << 'EOF'
{
"containerDefinitions": [
{
"memory": 80,
"essential": true,
"name": "gpu",
"image": "nvidia/cuda:11.0-base",
"resourceRequirements": [
{
"type":"GPU",
"value": "1"
}
],
"command": [
"sh",
"-c",
"nvidia-smi"
],
"cpu": 100,
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/bottlerocket",
"awslogs-region": "us-west-2",
"awslogs-stream-prefix": "demo-gpu"
}
}
}
],
"family": "example-ecs-gpu"
}
EOF
In the task definition, we assign one NVIDIA GPU to our task through the resourceRequirements
parameter. We also configure the awslogs-group
for the task to direct the log output from our container into Amazon CloudWatch.
The log group configuration is as follows:
- Region: us-west-2
- Log group name: /ecs/bottlerocket
- Log stream prefix: demo-gpu
Create the specified CloudWatch log group in the task definition:
aws logs create-log-group --log-group-name '/ecs/bottlerocket' --region us-west-2
Now, register the task in ECS:
aws ecs register-task-definition
--region us-west-2
--cli-input-json file://sample-gpu.json
To run the task, execute:
aws ecs run-task --cluster ecs-bottlerocket
--task-definition bottlerocket-gpu:1
The task will execute a command inside the container to display GPU configuration details and then terminate.
Afterwards, you can visit the ECS console in your account. Click on “Clusters” in the left menu, select the ecs-bottlerocket
cluster, and then click on the “Tasks” tab.
Select the task ID and view the Logs tab to see the output from the executed task:
You can also retrieve the log output from the command line using the log group name, log stream name, and timeframe. For instance:
aws logs tail '/ecs/bottlerocket'
--log-stream-names 'demo-gpu/gpu/7af782059c644872977da89a06023483'
--since 1h --format short
Cleanup
To remove the resources created during this tutorial, run the following commands:
aws ecs deregister-task-definition
--region us-west-2
--task-definition bottlerocket-gpu:1
delete_instances=$(aws ec2 describe-instances --region us-west-2
--filters "Name=tag-key,Values=bottlerocket" "Name=tag-value,Values=quickstart"
--query 'Reservations[].Instances[].InstanceId')
for instance in $delete_instances
do aws ec2 terminate-instances --instance-ids $instance --region us-west-2
done
aws ecs delete-cluster
--region us-west-2
--cluster ecs-bottlerocket
aws logs delete-log-group --log-group-name '/ecs/bottlerocket'
Conclusion
In this article, we explored how to create an ECS task definition configured to run a GPU-enabled workload within a container on Bottlerocket, efficiently and securely. We also examined how the container logs are accessible in CloudWatch and how to retrieve them via the command line. If you’re interested in more examples of GPU-accelerated workloads to deploy with Bottlerocket on ECS, you can explore the NVIDIA GPU-optimized containers available in the NVIDIA NGC catalog on AWS Marketplace. For further insights, check out this informative piece.