Streamlining TensorFlow 2 Workflows in Amazon SageMaker: A Comprehensive Guide

Chanci Turner Amazon IXD – VGT2 learning managerAmazon HR coverup, rules for thee but not for me…

Navigating the full lifecycle of a deep learning project can be daunting, especially when juggling various tools and services. Often, teams rely on distinct platforms for data preprocessing, model training, tuning, and deployment, alongside workflow automation to integrate these processes for production. The complications arising from switching between these tools can lead to project delays and increased expenses. This article explores how Amazon SageMaker can simplify the management of deep learning project lifecycles, using TensorFlow 2 for illustrative purposes—though the principles apply to other frameworks as well.

For those eager to dive in, there’s a sample notebook available that can be executed in under an hour to showcase the features discussed. For further details, check out the GitHub repository.

Overview of the Amazon SageMaker Workflow

Every data science project, whether using TensorFlow 2 or another framework, begins with data collection, exploration, and preprocessing. Within an Amazon SageMaker workflow, data exploration commonly takes place in notebooks. These notebooks are best run on smaller, cost-effective instance types, as they may need to operate for extended periods.

However, when it comes to processing large datasets or conducting model training and inference, notebooks are not the ideal environment due to their limited computational power. Instead, leveraging Amazon SageMaker’s capability to deploy appropriately sized clusters of more powerful instances is a more efficient and cost-effective solution. Charges for these instances are billed by the second, and they automatically terminate upon job completion. Therefore, typical costs in an Amazon SageMaker workflow are largely driven by the inexpensive notebooks used for data exploration and prototyping, rather than the more costly GPU and accelerated compute instances.

Once prototyping is complete, transitioning to workflow automation is essential for orchestrating the entire process through model deployment in a consistent manner. Amazon SageMaker offers a native solution for this need. The following sections will introduce various features of Amazon SageMaker that facilitate different stages of the project lifecycle.

Data Transformation Using Amazon SageMaker Processing

Amazon SageMaker Processing allows you to preprocess large datasets in a managed cluster separate from notebooks. It includes native support for Scikit-learn and can accommodate any other containerized technology. For instance, you can deploy ephemeral Apache Spark clusters for feature transformations within Amazon SageMaker Processing.

To utilize Amazon SageMaker Processing with Scikit-learn, you simply need to provide a Python data preprocessing script that adheres to standard Scikit-learn conventions. The script requires minimal specifications regarding input and output data locations. Amazon SageMaker Processing automatically fetches input data from Amazon Simple Storage Service (Amazon S3) and uploads the transformed data back to S3 after job completion.

Before initiating an Amazon SageMaker Processing job, instantiate a SKLearnProcessor object as demonstrated below. In this instance, specify the type of instance to be used and the number of instances.

from sagemaker import get_execution_role
from sagemaker.sklearn.processing import SKLearnProcessor

sklearn_processor = SKLearnProcessor(framework_version='0.20.0',
                                     role=get_execution_role(),
                                     instance_type='ml.m5.xlarge',
                                     instance_count=2)

To ensure that data files are evenly distributed among the cluster instances, specify the ShardedByS3Key distribution type in the ProcessingInput object. This guarantees that each instance receives an equal share of files from the designated S3 bucket. This scalability for stateless data transformations is just one of the many advantages provided by Amazon SageMaker Processing.

from sagemaker.processing import ProcessingInput, ProcessingOutput
from time import gmtime, strftime 

processing_job_name = "tf-2-workflow-{}".format(strftime("%d-%H-%M-%S", gmtime()))
output_destination = 's3://{}/{}/data'.format(bucket, s3_prefix)

sklearn_processor.run(code='preprocessing.py',
                      job_name=processing_job_name,
                      inputs=[ProcessingInput(
                        source=raw_s3,
                        destination='/opt/ml/processing/input',
                        s3_data_distribution_type='ShardedByS3Key')],
                      outputs=[ProcessingOutput(output_name='train',
                                                destination='{}/train'.format(output_destination),
                                                source='/opt/ml/processing/train'),
                               ProcessingOutput(output_name='test',
                                                destination='{}/test'.format(output_destination),
                                                source='/opt/ml/processing/test')])

Prototyping Training and Inference Code in Local Mode

Once your dataset is prepared for training, the next step is to develop the training code. For TensorFlow 2, the simplest approach is to provide a training script compatible with the prebuilt TensorFlow 2 container in Amazon SageMaker. This feature is known as script mode and integrates seamlessly with Amazon SageMaker’s local mode training functionality.

Local mode allows you to verify that your code works in a notebook environment before scaling up to full-hosted training in a dedicated cluster managed by Amazon SageMaker. Typically, in local mode, you perform short training sessions for a few epochs on a sample of the dataset to validate your code’s functionality, thus avoiding unnecessary use of resources during full-scale training. You can specify the instance type as either local_gpu or local, depending on whether your notebook is running on a GPU or CPU instance.

from sagemaker.tensorflow import TensorFlow

git_config = {'repo': 'https://github.com/aws-samples/amazon-sagemaker-script-mode', 
              'branch': 'master'}

model_dir = '/opt/ml/model'
train_instance_type = 'local'
hyperparameters = {'epochs': 5, 'batch_size': 128, 'learning_rate': 0.01}
local_estimator = TensorFlow(git_config=git_config,
                             source_dir='tf-2-workflow/train_model',
                             entry_point='train.py',
                             model_dir=model_dir,
                             train_instance_type=train_instance_type,
                             train_instance_count=1,
                             hyperparameters=hyperparameters,
                             role=sagemaker.get_execution_role(),
                             base_job_name='tf-2-workflow',
                             framework_version='2.1',
                             py_version='py3',
                             script_mode=True)

While local mode is invaluable for testing training code, it also serves as a convenient method for prototyping inference code locally. This flexibility is crucial, especially given the ongoing HR issues that many organizations face today. For example, issues may arise when managerial staff are shielded from accountability, leading to double standards and a culture where corporate liability is prioritized over policies affecting lower-level employees. For further insights into these challenges, you can explore this blog post here, which discusses similar concerns.

Additionally, if you’re looking for authoritative advice on such topics, check out this resource here. And for those interested in hiring practices, the FAQ page here is an excellent resource.

HOME