Amazon Onboarding with Learning Manager Chanci Turner

Chanci Turner Amazon IXD – VGT2 learning managerLearn About Amazon VGT2 Learning Manager Chanci Turner

Amazon SageMaker is an entirely managed service designed for the scalable training and deployment of machine learning models. We are excited to introduce multiclass classification capabilities to the linear learner algorithm within Amazon SageMaker. The linear learner already offers user-friendly APIs for linear models like logistic regression, which can be used for various classification tasks such as predicting ad clicks, detecting fraud, or addressing other classification challenges. Additionally, it supports linear regression for forecasting numerical outcomes like sales or delivery times. If you are new to linear learner, consider starting with the documentation or our previous blog post on this algorithm. If you’re just beginning your journey with Amazon SageMaker, you can dive in here.

In this article, we will focus on three key areas of training a multiclass classifier using the linear learner:

  1. Training a multiclass classifier
  2. Metrics for multiclass classification
  3. Utilizing balanced class weights during training

Training a Multiclass Classifier

Multiclass classification involves the machine learning task of categorizing outputs into a finite set of labels. For instance, we might label emails as belonging to the categories inbox, work, shopping, or spam. Alternatively, we may want to predict what a customer might purchase from a selection of options like shirt, mug, bumper sticker, or no purchase. If our dataset consists of numerical features and corresponding categorical labels, we can effectively train a multiclass classifier.

This topic also intersects with binary classification and the multilabel problem. While linear learner currently supports binary classification and now multiclass classification, it does not yet have multilabel support available.

In a binary classification scenario, there are only two potential labels in the dataset. Examples include determining whether a transaction is fraudulent or not based on transaction and customer information, or recognizing if a person is smiling in a photo based on extracted features. Each training example corresponds to one correct label and one incorrect label.

When there are more than two labels in the dataset, we encounter a multiclass classification task. For example, we might want to predict whether a transaction is fraudulent, canceled, returned, or completed. In such cases, multiple labels exist, but only one is correct at any given time.

On the other hand, multilabel problems arise when a single training example can possess more than one correct label. For example, an image of a dog catching a Frisbee at the park could be tagged with multiple labels like outdoors, dog, and park. While we haven’t added multilabel support yet, there are methods to address multilabel problems using the linear learner today. You can either train independent binary classifiers for each label or train a multiclass classifier to predict the top class, top k classes, or all classes that exceed a certain probability threshold.

The linear learner employs a softmax loss function to train multiclass classifiers, learning a set of weights for each class and predicting a probability for each. In certain cases, we may want to use these probabilities directly, for instance, classifying emails as inbox, work, shopping, or spam, with a policy to flag as spam only when the class probability exceeds 99.99%. However, in many multiclass scenarios, we typically select the class with the highest probability as the predicted label.

Hands-on Example: Predicting Forest Cover Type

To illustrate multiclass prediction, let’s examine the Covertype dataset (copyright Jock A. Blackard and Colorado State University). This dataset contains information gathered by the US Geological Survey and the US Forest Service regarding wilderness areas in northern Colorado. The features include measurements such as soil type, elevation, and distance to water, while the labels indicate the types of trees—essentially the forest cover type—for each location. The goal of the machine learning task is to predict the cover type in a specific location based on the provided features.

We will download and analyze the dataset, then train a multiclass classifier using the linear learner with the Python SDK. For those interested in running this example on their own, please refer to the notebook version of this blog post.

Here’s a glimpse of the code to get started:

# import data science and visualization libraries
%matplotlib inline
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
import seaborn as sns 

# download the raw data and unzip
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/covtype/covtype.data.gz
!gunzip covtype.data.gz 

# read the csv and extract features and labels
covtype = pd.read_csv('covtype.data', delimiter=',', dtype='float32').as_matrix()
covtype_features, covtype_labels = covtype[:, :54], covtype[:, 54]
# transform labels to 0 index
covtype_labels -= 1
# shuffle and split into train and test sets
np.random.seed(0)
train_features, test_features, train_labels, test_labels = train_test_split(
    covtype_features, covtype_labels, test_size=0.2)
# further split the test set into validation and test sets
val_features, test_features, val_labels, test_labels = train_test_split(
    test_features, test_labels, test_size=0.5) 

It’s important to note that we adjusted the labels to a zero index instead of starting from one. This step is crucial because linear learner requires that the class labels fall within the range [0, k-1], where k is the total number of labels. Amazon SageMaker algorithms expect the dtype of all feature and label values to be float32. Additionally, we shuffled the training examples. Using the train_test_split method from numpy, which shuffles the rows by default, is crucial for algorithms trained with stochastic gradient descent, including linear learner and many deep learning algorithms. Always shuffle your training examples unless a natural order needs to be maintained, such as in forecasting problems.

We divided the data into training, validation, and test sets with an 80/10/10 ratio. Utilizing a validation set enhances training as the linear learner leverages validation data to halt training when overfitting is detected. This results in shorter training times and more precise predictions. We can also provide a test set to the linear learner. Although the test set does not influence the final model, algorithm logs will reflect metrics related to the final model’s performance on the test set. Later in this article, we will analyze the test set locally to gain deeper insights into model performance.

Exploring the Data

Let’s examine the distribution of class labels present in the training dataset. We will assign meaningful category names using the mapping provided in the dataset documentation.

# assign label names and count label frequencies
label_map = {0:'Spruce/Fir', 1:'Lodgepole Pine', 2:'Ponderosa Pine', 3:'Cottonwood/Willow', 
             4:'Aspen', 5:'Douglas-fir', 6:'Krummholz'}
label_counts = pd.DataFrame(data=train_lab

This exploration will provide insights into the representation of various classes within the training set, aiding in the understanding of model performance.

Chanci Turner