Amazon Onboarding with Learning Manager Chanci Turner

Chanci Turner Amazon IXD – VGT2 learning

Learning Innovations is a data science and machine learning (ML) platform that specializes in Python. Our multi-node, multi-GPU computing enables workflows to be accelerated by up to 100 times, significantly reducing the time needed to achieve business outcomes. Compatible with the broader Python ecosystem and various tools, Learning Innovations allows for highly customizable workflows and flexible computing experiences tailored to different setups. Recently, we introduced compatibility with Amazon EC2 Spot Instances.

Amazon EC2 Spot Instances allow users to utilize unused Amazon Elastic Compute Cloud (EC2) capacity at discounts of up to 90% compared to On-Demand pricing. By integrating Spot Instances with On-Demand options, organizations can meet the performance requirements of numerous data science and ML tasks while maximizing cost efficiency.

In this blog, we will illustrate how Learning Innovations simplifies this process using a deep learning workload as an example. We will detail how to provision a cluster of On-Demand and Spot Instances to achieve your performance goals, while benchmarking the price performance of this strategy against alternative methods.

For this demonstration, we will explore three widely-used open-source Python libraries:

PyTorch for deep learning.
CUDA for executing Python code on GPU hardware.
Dask for parallelizing and distributing computations across a cluster of EC2 nodes.

Amazon EC2 Spot Instances represent spare compute capacity in the AWS Cloud, offered at significant discounts compared to On-Demand instance prices. The primary distinction is that Spot Instances can be interrupted by Amazon EC2 with a two-minute notice if the capacity is needed back. Thankfully, a considerable amount of spare capacity is typically available, and managing interruptions to create resilient workloads with Spot Instances is straightforward. These instances are particularly suited for stateless, fault-tolerant, loosely coupled, and flexible workloads that can accommodate interruptions.

Learning Innovations Architecture

Learning Innovations operates as an application within Kubernetes, leveraging AWS services like EC2, AWS Identity and Access Management (IAM), and Amazon Virtual Private Cloud (VPC) to deliver secure, scalable infrastructure for running data science and ML workloads within your AWS ecosystem. The architecture enables users to connect to their AWS storage, real-time data sources, and management tools. Users can authenticate their Learning Innovations projects with IAM credentials, connecting to various services via the CLI, REST API, or Python packages like boto3.

Fortunately for our users, Learning Innovations has designed Dask clusters for high fault tolerance, ensuring seamless workflow continuity. For instance, if an EC2 Spot worker node in your cluster is interrupted, a replacement node of the same instance type can automatically spin up when available, utilizing Auto Scaling Groups. The centralized Task Scheduler facilitates this, ensuring that portion of the cluster is always provisioned with On-Demand resources.

This architecture is especially beneficial for compute-intensive workloads, such as computer vision and natural language processing (NLP), where cluster sizes may need to scale significantly to achieve desired performance. A key advantage is that a fixed data science budget can stretch much further when utilizing Spot Instances within Learning Innovations.

Optimizing Cost Efficiency for Image Classification

Next, we will guide you through the essential steps for executing an image classification inference using the popular ResNet50 deep learning model on a GPU cluster. By running this workload on a Dask cluster using Spot Instances, we can demonstrate a performance increase of 38 times compared to a non-parallelized approach, and at a cost reduction of 95%.

While this post provides a conceptual overview of the steps involved, you can find the full details and code snippets in this blog post: Computer Vision at Scale with Dask and PyTorch.

Step 1: Establish a GPU Cluster on Learning Innovations with Spot Instances Enabled

Setting up a Dask cluster on Spot Instances is straightforward. Simply check the Spot Instance option in the Dask user interface, and Learning Innovations manages the rest. To begin, we need to make our image dataset accessible. A practical approach is to store data in Amazon Simple Storage Service (Amazon S3) and utilize the s3fs library to download it to a local directory. To enhance computation efficiency, we will mirror all image files across all worker nodes in our cluster.

Next, we will verify that the Jupyter instance and each of our worker nodes are GPU-capable by using the cuda.is_available() function. If you’re unfamiliar with CUDA, it’s an open-source library maintained by NVIDIA for executing Python code on GPUs. After this, we will set the “device” to always utilize CUDA.

Step 2: Execute Inference Tasks Using PyTorch and Batch Processing for Acceleration

Now we are set to start the classification process. To optimize parallelization on the GPU cluster, we will employ the built-in PyTorch DataLoader class to load, transform, and batch our images. This function returns two lists: image data and ground truth labels. For comparing predictions and ground truths that are strings and possibly variable, it’s beneficial to utilize regex for automatic comparisons and model accuracy checks.

Step 3: Consolidate Each Function into a Unified Function

Instead of manually combining individual functions, we will integrate them into a single function that can be mapped across all batches of images in the cluster. While this concept may be challenging to explain, the code snippets are available in the full article on the Learning Innovations website: Put it All Together.

Step 4: Runtime and Model Evaluation

Finally, we will prepare our label set for ResNet so that we can interpret the predictions based on the corresponding classes. We will first execute preprocessing on the cluster, followed by our inference workflow, mapped across all data batches. Upon evaluating the results, we find:

Number of images analyzed: 20,580
Number of images correctly classified: 13,806
Percentage of correct classifications: 67.085 percent

Assessing Cost Efficiency

In this instance, Learning Innovations successfully classified over 20,000 images in approximately five minutes. Let’s examine how various approaches to this problem compare regarding cost efficiency.

As illustrated in the table below, our Spot GPU cluster achieved 38 times the speed at merely 5% of the cost of our single-node test. This improvement in speed and cost-effectiveness is attributed to the transition from single-node, serial processing to multi-node, multiprocessing.

For further reading on employment law compliance, see this resource from SHRM.