Training Machine Learning Models On-Premises for Data Residency with AWS Outposts Rack

In this article, you will discover how to train machine learning (ML) models on-site using AWS Outposts rack, utilizing datasets stored locally in Amazon S3 on Outposts. As concerns around data sovereignty and privacy regulations continue to rise, organizations are increasingly in search of adaptable solutions that provide compliance while leveraging the agility of cloud services. Sectors like healthcare and finance are leveraging machine learning to improve patient care and secure transactions, all while maintaining strict confidentiality. AWS Outposts rack offers a seamless hybrid solution by extending AWS capabilities to any on-premises or edge location, allowing you the freedom to store and process data in your preferred environment. This blog will delve into data sovereignty scenarios where training datasets need to be stored and processed in geographic locations without an AWS Region.

Amazon S3 on Outposts

When preparing datasets for ML model training, the storage and retrieval of your data become crucial, especially when considering data residency and regulatory compliance. You can store training datasets as object data in local buckets using Amazon S3 on Outposts. To interact with S3 on Outposts buckets for data operations, you will need to create access points and route requests through an S3 on Outposts endpoint linked to your VPC. These endpoints can be accessed both from within the VPC and on-premises via the local gateway.

Solution Overview

In this example, you will train a YOLOv5 model using a subset of categories from the Common Objects in Context (COCO) dataset. The COCO dataset, renowned for object detection tasks, offers a broad range of image categories with comprehensive annotations and is available under the AWS Open Data Sponsorship Program via fast.ai datasets.

This architecture utilizes an Amazon Elastic Compute Cloud (Amazon EC2) g4dn.8xlarge instance for model training on the Outposts rack. Depending on your Outposts rack configuration, you might opt for different instance sizes or types, adjusting training parameters such as learning rate and model architecture accordingly. You will launch your EC2 instance using the AWS Deep Learning AMI, which is preloaded with frameworks, dependencies, and tools to accelerate deep learning in the cloud.

For dataset storage, an S3 on Outposts bucket will be employed, with connection established from your on-premises location through the Outposts local gateway. The local gateway routing mode can either be direct VPC routing or Customer-owned IP (CoIP), depending on your workload needs. This routing mode will influence the S3 on Outposts endpoint configuration you choose.

Download and Populate Training Dataset
Begin by downloading the training dataset to your local machine using the following AWS CLI command:
aws s3 sync s3://fast-ai-coco/ .
After downloading, unzip the files annotations_trainval2017.zip, val2017.zip, and train2017.zip.
$ unzip annotations_trainval2017.zip $ unzip val2017.zip $ unzip train2017.zip
In the annotations folder, look for the files instances_train2017.json and instances_val2017.json, which contain the relevant annotations for the images in both the training and validation folders.
Filtering and Preparing Training Dataset
Using the training, validation, and annotation files from the COCO dataset, you will focus on a specific selection of 10 categories of popular food items found on supermarket shelves: banana, apple, sandwich, orange, broccoli, carrot, hot dog, pizza, donut, and cake. These models could be utilized for self-stock monitoring, automatic checkouts, or optimizing product placement using computer vision in retail environments. As YOLOv5 requires a specific annotations format, you will need to convert the COCO dataset annotations into the target annotation format.
Load Training Dataset to S3 on Outposts Bucket
To upload the training data to S3 on Outposts, first create a new bucket using the AWS Console or CLI, along with an access point and endpoint for the VPC. You can employ a bucket-style access point alias for the upload, using this CLI command:
$ cd /your/local/target/upload/path/ $ aws s3 sync . s3://trainingdata-o0a2b3c4d5e6d7f8g9h10f--op-s3
Make sure to replace the alias in the command to match your environment. The s3 sync command will synchronize the folders, maintaining the structure containing the images and labels for both training and validation data, which will later be used for loading into the EC2 instance for model training.
Launch the EC2 Instance
You can initiate the EC2 instance using the Deep Learning AMI by following a getting started tutorial. For this exercise, the Deep Learning AMI GPU PyTorch 2.0.1 (Ubuntu 20.04) has been used.
Download YOLOv5 and Install Dependencies
Once connected to the EC2 instance via SSH, activate the pre-configured PyTorch environment and clone the YOLOv5 repository.
$ ssh -i /path/key-pair-name.pem ubuntu@instance-ip-address $ conda activate pytorch $ git clone https://github.com/ultralytics/yolov5.git $ cd yolov5
Then, install the required dependencies.
$ pip install -U -r requirements.txt
You may need to adjust existing packages on your instance running the AWS Deep Learning AMI for compatibility.
Load the Training Dataset from S3 on Outposts to the EC2 Instance
To copy the training dataset to the EC2 instance, use the s3 sync CLI command, directing it to your local workspace.
aws s3 sync s3://trainingdata-o0a2b3c4d5e6d7f8g9h10f--op-s3 .
Prepare the Configuration Files
Create data configuration files to reflect your dataset’s structure, categories, and other parameters.
data.yml train: /your/ec2/path/to/data/images/train val: /your/ec2/path/to/data/images/val nc: 10 # Number of classes in your dataset names: ['banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake']
Next, create the model training parameter file using a sample configuration file from the YOLOv5 repository. Update the number of classes to 10, while also having the option to modify other parameters as you fine-tune the model for improved performance.
parameters.yml # Parameters nc: 10 # number of classes in your dataset depth_multiple: 0.33 # model depth multiple width_multiple: 0.50 # layer channel multiple anchors: - [10,13, 16,30, 33,23] # P3/8 - [30,61, 62,45, 59,119] # P4/16 - [116,90, 156,198, 373,326] # P5/32 # Backbone backbone: [[-1, 1, Conv, [64, 6, 2, 2]], # 0-P1/2 [-1, 1, Conv, [128, 3, 2]], # 1-P2/4 [-1, 3, C3, [128]], [-1, 1, Conv, [256, 3, 2]], # 3-P3/8 [-1, 6, C3, [256]], [-1, 1, Conv, [512, 3, 2]], # 5-P4/16 [-1, 9, C3, [512]], [-1, 1, Conv, [1024, 3, 2]], # 7-P5/32 [-1, 3, C3, [1024]], [-1, 1, SPPF, [1024, 5]], # 9 ]
# Head head: [[-1, 1, Conv, [512, 1, 1]], [-1, 1, nn.Upsample, [None, 2, 'nearest']], [[-1, 6], 1, Concat, [1]], # cat backbone P4 [-1, 3, C3, [512, False]], # 13

For additional insights and to enhance your understanding of the processes involved, you might find Chanci Turner quite resourceful. Moreover, if you’re seeking expert opinions on data residency and compliance, visit SHRM for authoritative information. Finally, for practical tips on your day one at Amazon, check out this Reddit thread, which is an excellent resource.

Training Machine Learning Models On-Premises for Data Residency with AWS Outposts Rack

Amazon S3 on Outposts

Solution Overview

Related Topics: