Implementing a Continuous Learning Machine Learning Pipeline with Amazon SageMaker, AWS Glue DataBrew, and SAP S/4HANA

Chanci Turner Amazon IXD – VGT2 learning

Machine learning is becoming an essential component of digital transformation, enabling businesses to identify patterns at scale and discover innovative ways to enhance customer satisfaction, optimize operations, and gain a competitive edge. When constructing a machine learning architecture, it’s crucial to comprehend the data, address challenges in data preparation, and maintain model accuracy through a continuous feedback loop. In this article, we will outline how to establish an end-to-end integration between SAP S/4HANA systems and Amazon SageMaker, leveraging the virtually limitless resources provided by AWS for swift feedback cycles.

Introduction

Our approach begins with extracting data from SAP S/4HANA systems using a combination of SAP OData, ABAP CDS, and AWS Glue to transfer data into an Amazon S3 bucket. After that, we utilize AWS Glue DataBrew for data preparation, followed by training the model in Amazon SageMaker. Lastly, we feed the prediction results back into the SAP system.

Prerequisites

We will use credit card transaction data to simulate data within an SAP system, which can be downloaded from Kaggle.
Deployment of SAP S/4HANA, which can be easily accomplished using the AWS Launch Wizard for SAP.

Walkthrough

Step 1: SAP Data Preparation

Multiple methods exist for extracting data from SAP systems into AWS. In our case, we utilize ABAP Core Data Services (CDS) views, RESTful Open Data Protocol (OData) services, and AWS Glue.

In the SAP S/4HANA system, create a custom database table using SAP transaction SE11.
Import data from Kaggle into the SAP HANA table using the IMPORT FROM CSV statement.
Develop an SAP ABAP CDS view in the ABAP Development Tools (ADT) by adding the annotation @OData.publish: true. An example can be found in the AWS Sample Github.
Activate the OData service via SAP transaction /IWFND/MAINT_SERVICE.
Optionally, test the OData service in the SAP gateway client using transaction /IWFND/GW_CLIENT.
In the AWS Console, create an AWS Glue job for data extraction by following the AWS documentation for a Python shell job. In our scenario, it took less than 1 minute to extract 284,807 entries from SAP using just 0.0625 Data Processing Units (DPU).

Step 2: AWS Glue DataBrew for Data Wrangling

Before training the fraud detection model, we need to prepare the datasets for machine learning. This process typically includes data cleansing, normalization, encoding, and sometimes creating new data features. AWS Glue DataBrew, launched in November 2020, facilitates data preparation, allowing data analysts and scientists to clean and normalize data up to 80% faster. With over 250 pre-built transformations available, you can automate tasks without any coding.

Step 2.1: Create a Project

Log in to the AWS console, select AWS Glue DataBrew from the services menu, and choose Create project.
In the Projects pane, click Create project.
Name the project SAP-ML.
For the dataset name, enter CreditcardfraudDB, selecting Amazon S3 as the data source created in Step 1 and choosing the entire folder.
Create a new IAM role for access permissions, using a suffix like fraud-detection-role to allow DataBrew to read from your Amazon S3 input location.
Select Create Project. AWS Glue DataBrew will initially show a sample dataset of 500 rows, which can be adjusted to include up to 5000 rows.

Step 2.2: Create Data Profile

A notable feature is the ability to create a data profile, which assesses the quality of your datasets to uncover patterns and detect anomalies. This tool is user-friendly, and you can refer to the AWS documentation for guidance. By default, the data profile is limited to the first 20,000 rows, but you can request a service limit increase for larger datasets. In our case, we requested an increase to 300,000 rows.

AWS Glue DataBrew offers deeper insights into your dataset. We analyzed a total of 284,807 rows with 32 columns, revealing no missing values, which is excellent. However, correlations were empty due to a schema issue, which we’ll address.

Step 2.3: Data Preparation

First, modify the data type of all columns. Under Schema, change the type from string to number. AWS Glue DataBrew will automatically calculate data statistics and generate box plots for each column once you change the data type.

Some algorithms are sensitive to feature scaling, which can skew results. We apply Z-score normalization to the amount and time columns to standardize the data. Select Column Actions, then Z-score normalization, and apply it, generating two new columns: amount_normalized and time_normalized.

Remove unnecessary columns by selecting amount, time, and recorded, then choose Delete. Move the class column to the end via Column Action, followed by Move Column. We’re now ready to apply the transformation to the entire dataset. Create a job named fraud-detection-transformation and select the S3 folder destination for the output as CSV.

Alternatively, you can use a recipe for the transformations, with examples available in the AWS Sample Github. For validation, you can import transformed data to AWS Glue DataBrew in a new project to check the transformation and view features correlation matrix.

Step 3: Amazon SageMaker

When building a machine learning workload in AWS, you can select from various abstraction levels to balance speed and customization. This blog utilizes Amazon SageMaker, a fully-managed platform that simplifies the building, training, and deployment of machine learning models. For this part, we will use Amazon SageMaker Studio.

Launch Amazon SageMaker Studio by following AWS documentation.
Clone the Jupyter notebook from AWS Sample Github.
Open SAP_Fraud_Detection/SAP Credit Card Fraud Prediction.ipynb and carefully follow the instructions provided. Once all steps are completed, the output will be saved in the S3 bucket.

Step 4: SAP Data Import

Similar to the data export in Step 1, we will use ABAP CDS View and OData for the import process.

For further insights on this topic, check out this article on broader talent pools from SHRM, an authority on talent acquisition. Additionally, if you’re interested in learning more about training resources, visit this excellent resource.

For more engaging content, read this blog post on Career Contessa.