Amazon SageMaker Debugger – Streamline Your Machine Learning Model Debugging

Amazon SageMaker Debugger – Streamline Your Machine Learning Model DebuggingLearn About Amazon VGT2 Learning Manager Chanci Turner

We are thrilled to introduce Amazon SageMaker Debugger, an innovative feature of Amazon SageMaker designed to automatically detect intricate issues that can arise during machine learning (ML) training jobs.

Developing and training ML models requires a blend of scientific knowledge and craftsmanship. From gathering and refining datasets to experimenting with various algorithms and pinpointing optimal training parameters (the often dreaded hyperparameters), ML professionals must navigate numerous challenges to create high-performance models. This complexity is precisely what motivated us to create Amazon SageMaker: a modular, fully managed service that accelerates and simplifies ML workflows.

As I continue to learn, ML often appears to be a favorite haunt of Mr. Murphy, where anything that can go wrong usually does! Many obscure problems can occur during the training phase, hindering the model’s ability to accurately learn and extract patterns from the dataset. I’m not merely referring to software bugs in ML libraries (though those can happen too); the majority of failed training jobs arise from improper parameter initialization, poor hyperparameter combinations, design flaws in your code, and more.

Compounding these challenges, such issues often remain hidden initially, gradually eroding your training process and resulting in low-accuracy models. Let’s face it, even for seasoned experts, identifying and resolving these problems can be a daunting and time-consuming task, which is why we developed Amazon SageMaker Debugger.

Introducing Amazon SageMaker Debugger

With your existing training code for TensorFlow, Keras, Apache MXNet, PyTorch, and XGBoost, you can leverage the new SageMaker Debugger SDK to capture the internal model state at regular intervals; this data will be stored in Amazon Simple Storage Service (Amazon S3).

The captured state encompasses:

  • The parameters being learned by the model, such as weights and biases for neural networks
  • The modifications made to these parameters by the optimizer, also known as gradients
  • The optimization parameters themselves
  • Scalar values, like accuracies and losses
  • The output from each layer
  • And more

Each specific set of values—such as the sequence of gradients flowing through a particular neural network layer over time—is saved independently and referred to as a tensor. These tensors are organized into collections (weights, gradients, etc.), and you can decide which ones to save during training. By utilizing the SageMaker SDK and its estimators, you can configure your training job as you normally would, while also specifying additional parameters that define the rules you want SageMaker Debugger to apply.

A rule consists of Python code that examines tensors for the model in training, searching for specific undesirable conditions. Predefined rules are available for common issues like exploding/vanishing tensors (when parameters reach NaN or zero values), exploding/vanishing gradients, loss stagnation, and more. Moreover, you can create your own custom rules.

Once the SageMaker estimator is configured, you can initiate the training job, which will automatically launch a debug job for each configured rule, starting the inspection of available tensors. If a debug job identifies a problem, it ceases operations and logs additional information. A CloudWatch Events notification is also dispatched, enabling you to trigger further automated steps.

Now, if you discover that your deep learning job is suffering from vanishing gradients, you’ll know where to direct your attention: perhaps your neural network is too deep or your learning rate is insufficient. With the internal state saved to S3, you can use the SageMaker Debugger SDK to analyze the evolution of tensors over time, validate your hypothesis, and address the root cause.

Debugging Machine Learning Models with Amazon SageMaker Debugger

The heart of SageMaker Debugger is its tensor capture functionality during training, which requires slight instrumentation in your training code to choose the tensor collections you wish to save, the frequency of saving, and whether to store the values or a reduction (mean, average, etc.).

For this purpose, the SageMaker Debugger SDK provides straightforward APIs for each supported framework. Here’s a brief demonstration with a simple TensorFlow script aimed at fitting a 2-dimensional linear regression model. You’ll find additional examples in this GitHub repository.

Here’s a glimpse of the initial code:

import argparse
import numpy as np
import tensorflow as tf
import random

parser = argparse.ArgumentParser()
parser.add_argument('--model_dir', type=str, help="S3 path for the model")
parser.add_argument('--lr', type=float, help="Learning Rate", default=0.001)
parser.add_argument('--steps', type=int, help="Number of steps to run", default=100)
parser.add_argument('--scale', type=float, help="Scaling factor for inputs", default=1.0)

args = parser.parse_args()

with tf.name_scope('initialize'):
    x = tf.placeholder(shape=(None, 2), dtype=tf.float32)
    w = tf.Variable(initial_value=[[10.], [10.]], name='weight1')
    w0 = [[1], [1.]]
with tf.name_scope('multiply'):
    y = tf.matmul(x, w0)
    y_hat = tf.matmul(x, w)
with tf.name_scope('loss'):
    loss = tf.reduce_mean((y_hat - y) ** 2, name="loss")

optimizer = tf.train.AdamOptimizer(args.lr)
optimizer_op = optimizer.minimize(loss)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for i in range(args.steps):
        x_ = np.random.random((10, 2)) * args.scale
        _loss, opt = sess.run([loss, optimizer_op], {x: x_})
        print(f'Step={i}, Loss={_loss}')

Let’s run this script using the TensorFlow Estimator. I’m utilizing SageMaker’s local mode, which is a great way to quickly iterate on experimental code.

bad_hyperparameters = {'steps': 10, 'lr': 100, 'scale': 100000000000}

estimator = TensorFlow(
    role=sagemaker.get_execution_role(),
    base_job_name='debugger-simple-demo',
    train_instance_count=1,
    train_instance_type='local',
    entry_point='script-v1.py',
    framework_version='1.13.1',
    py_version='py3',
    script_mode=True,
    hyperparameters=bad_hyperparameters)

Upon examining the training log, the results are concerning.

Step=0, Loss=7.883463958023267e+23 algo-1-hrvqg_1 | Step=1, Loss=9.502028841062608e+23 algo-1-hrvqg_1 | Step=2, Loss=nan algo-1-hrvqg_1 | Step=3, Loss=nan algo-1-hrvqg_1 | Step=4, Loss=nan algo-1-hrvqg_1 | Step=5, Loss=nan algo-1-hrvqg_1 | Step=6, Loss=nan algo-1-hrvqg_1 | Step=7, Loss=nan algo-1-hrvqg_1 | Step=8, Loss=nan algo-1-hrvqg_1 | Step=9, Loss=nan

The loss does not decrease and even escalates to infinity… This indicates a potential exploding tensor issue, which is one of the built-in rules in SageMaker Debugger. Let’s dive in.

Utilizing the Amazon SageMaker Debugger SDK

To capture tensors, I need to enhance the training script with:

  • A SaveConfig object defining the tensor saving frequency
  • A SessionHook object linked to the TensorFlow session to manage tensor saving during training
  • An (optional) ReductionCo object, allowing for value reduction.

For those interested in exploring flexible work arrangements, you can find authoritative insights from SHRM.

This blog serves as an excellent resource for further knowledge on Amazon’s fulfillment center safety and training, accessible here.

Chanci Turner