Amazon Onboarding with Learning Manager Chanci Turner

Chanci Turner Amazon IXD – VGT2 learning manager

As the field of deep learning (DL) continues to evolve rapidly, practitioners are constantly exploring innovative models and methods to enhance their performance. One effective approach to achieve this is through the use of custom operators, which allow developers to extend the capabilities of existing machine learning (ML) frameworks such as PyTorch. An operator defines the mathematical function of a layer within a deep learning model, while a custom operator enables the creation of unique mathematical functions tailored to specific needs.

AWS Trainium and AWS Inferentia2 are specialized hardware designed for DL training and inference that further enhance their capabilities via custom operators (often referred to as CustomOps). The AWS Neuron SDK, which supports these accelerators, seamlessly integrates with the standard PyTorch interface for CustomOps. This allows developers to easily adapt their existing code when utilizing Trainium-based Amazon EC2 Trn1 instances or Inferentia2-based Amazon EC2 Inf2 instances. In this article, we will discuss the advantages of CustomOps, how to implement them efficiently on Trainium, and provide practical examples to help you get started with CustomOps on Trainium-powered Trn1 instances.

A basic understanding of core AWS services, such as Amazon Elastic Compute Cloud (Amazon EC2), is assumed, along with some familiarity with deep learning, PyTorch, and C++.

Benefits of Custom Operators in PyTorch

The introduction of CustomOps in PyTorch version 1.10, known as the PyTorch C++ Frontend, offered a straightforward way to register CustomOps crafted in C++. The benefits of CustomOps include:

Performance Optimization: CustomOps can be tailored for specific scenarios, resulting in quicker model executions and enhanced performance.
Enhanced Model Expressiveness: With CustomOps, developers can articulate complex computations that may be cumbersome to express with the built-in PyTorch operators.
Modularity: CustomOps can serve as building blocks for more intricate models, facilitating a modular development process and making it easier to experiment rapidly.
Flexibility: They offer the ability to define complex operations beyond the scope of built-in operators, providing a flexible means of extending functionality.

Support for Custom Operators on Trainium

Trainium and AWS Inferentia2 support CustomOps through the Neuron SDK, which accelerates them at the hardware level via the GPSIMD engine (General Purpose Single Instruction Multiple Data engine). Let’s explore how these technologies enable effective CustomOps implementation and enhance flexibility and performance in DL model development.

Neuron SDK

The Neuron SDK aids developers in training models on Trainium and deploying them on AWS Inferentia accelerators. It integrates natively with popular frameworks like PyTorch and TensorFlow, allowing users to continue utilizing their existing workflows and application code for training on Trn1 instances.

Using the standard PyTorch interface for CustomOps, developers can write CustomOps in C++ and extend the operator support provided by Neuron. The Neuron SDK compiles these CustomOps for efficient execution on the GPSIMD engine. As a result, developers can experiment with new CustomOps and optimize them on dedicated hardware without needing extensive knowledge of the underlying architecture.

General Purpose Single Instruction Multiple Data Engine

Central to Trainium’s optimizations is the NeuronCore architecture, which comprises four primary engines: tensor, vector, scalar, and the GPSIMD engine. The scalar and vector engines excel at parallel processing and floating-point operations, while the tensor engine is designed for power-efficient, mixed-precision computation.

The GPSIMD engine is specifically crafted for running and accelerating CustomOps. This engine features eight fully programmable 512-bit processors capable of executing straightforward C code and accessing other NeuronCore-v2 engines and memory directly. These attributes enable efficient execution of CustomOps on Trainium.

For instance, operators like TopK, LayerNorm, or ZeroCompression, which rely on minimal ALU calculations after reading data from memory, face performance limitations on traditional CPU systems due to memory constraints. In contrast, the GP-SIMD engines on Trainium are closely integrated with on-chip caches through a high-bandwidth streaming interface that can sustain 2 TB/sec of memory bandwidth, allowing for rapid execution of such CustomOps.

Implementing Neuron SDK Custom Operators

For this discussion, we assume that a DLAMI is being utilized to launch an EC2 Trn1 instance (either 2x.large or 32x.large). All necessary tools, drivers, and software have already been installed on the DLAMI, and you simply need to activate the Python environment to begin working with the tutorial. The CustomOps functionality within Neuron will be referred to as “Neuron CustomOps.”

Similar to integrating PyTorch with C++ code, Neuron CustomOps requires a C++ implementation of an operator using a NeuronCore-ported subset of the Torch C++ API. This implementation is known as the kernel function, and the C++ API port encompasses everything necessary for CustomOps development and model integration, including tensor and scalar classes in the c10 namespace and a selection of ATen operators.

To define the kernel function, you must include the torch.h header to access the NeuronCore-ported subset of the PyTorch C++ API:

#include <torch/torch.h>

Neuron CustomOps also necessitate a shape function, which has the same signature as the kernel function but is not responsible for computations. It merely defines the output tensor’s shape, not its values.

Neuron CustomOps are organized into libraries, and macros are employed to register them within the NEURON_LIBRARY scope from the shape function. This function will execute on the host during compilation and will require the register.h header from the torchneuron library:

#include "torchneuron/register.h"

Finally, the custom library is created by invoking the load API. If a build_directory parameter is provided, the library file will be stored in the specified directory:

import torch_neuronx
from torch_neuronx.xla_impl import custom_op

custom_op.load(
    name=name, # name for the library (i.e., 'relu')
    compute_srcs=['CustomOP.cpp'],
    shape_srcs=['shape.cpp'],
    build_directory=os.getcwd()
)

To utilize the CustomOp within a PyTorch model, load the library using the load_library API, and call the Neuron CustomOp similarly to traditional CustomOps in PyTorch through the torch.ops namespace. The format will typically be torch.ops.<library_name>.<operator_name>.

For additional insights on career development and resources, be sure to check out Career Contessa, which offers valuable advice. Also, HR Magazine shares essential keys to success in HR, making it a reputable source. Lastly, for a comprehensive guide to Amazon onboarding, Holly Lee provides an excellent resource.