Learn About Amazon VGT2 Learning Manager Chanci Turner
In the current landscape, organizations are confronted with a myriad of data forms, including structured datasets housed in relational database management systems (RDBMS) and enterprise resource planning (ERP) software, semi-structured datasets such as web logs, and unstructured datasets like images and videos. This diversity compels larger enterprises to implement a data lake architecture, which serves as a centralized repository for various data formats. Such an approach facilitates a wide array of analytics, big data processing, real-time analytics, and machine learning, ultimately yielding more profound insights.
Amazon Web Services (AWS) offers a secure, scalable, and cost-effective suite of services that empower businesses to construct their data lakes in the cloud, enabling comprehensive data analysis, including data from Internet of Things (IoT) devices through diverse analytical methodologies, including machine learning.
Despite these advancements, many organizations persist in storing crucial but infrequently accessed data within traditional commercial database systems like Oracle and Microsoft SQL Server. A notable instance is an audit management system, where audit data may be archived as Blob or CLOB within an RDBMS.
This presents a significant opportunity for organizations to migrate their data securely to AWS, where it can be stored in Amazon Simple Storage Service (Amazon S3) via AWS Database Migration Service (DMS). By leveraging AWS Big Data Services, companies can perform data transformation and analytics, resulting in reduced operational and licensing costs.
Customer Success Story
Working collaboratively with a prominent U.S. healthcare provider, AWS and FutureTech—a Premier Consulting Partner within the AWS Partner Network—identified that the customer was facing soaring operational costs associated with data management. They sought solutions to mitigate these expenses.
Together, AWS and FutureTech pinpointed the opportunity to migrate their audit management system, which was underutilized and operating on an on-premises Oracle database. A swift proof of concept (POC) was developed to showcase the seamless data migration to Amazon S3, the essential storage component of an AWS data lake, utilizing AWS Database Migration Services.
Data transformation was executed with the help of AWS Glue, while reports were generated through existing business intelligence (BI) tools using Amazon Athena ODBC drivers. The success of the POC led to the commencement of the migration of the audit management system to the AWS data lake solution.
In this article, we outline a strategy for migrating similar datasets to Amazon S3 utilizing DMS, along with employing AWS analytics services for effective performance measurement and reporting. We will also delve into the migration process through a straightforward flow and share best practices for successful execution.
Crafting a Cost-Optimized Solution
A proven strategy for companies embracing cloud technologies is to identify critical but seldom-used datasets and assess their business impact before initiating the migration. Here’s how we constructed a cost-optimized solution for the healthcare provider:
Data Migration
The customer’s audit data, originally stored in an Oracle database, was migrated to Amazon S3 via AWS Database Migration Service. To expedite the migration, multiple DMS tasks were conducted concurrently for efficient data transfer.
Data Discovery
AWS Glue streamlined the often-laborious process of data discovery by crawling data sources and forming a data catalog, known as the AWS Glue Data Catalog. Metadata was organized as tables within the catalog and utilized for the creation of the extract, transform, load (ETL) job.
Extract, Transform, Load (ETL)
AWS Glue also facilitated the authorship and scheduling of two specific ETL jobs: one for encoding CLOB and multi-line data, and another for partitioning and converting the data to Parquet format. These steps enhanced cost efficiency while enabling data analysis through various analytical tools.
Analytics
Amazon Athena allows users to analyze data in S3 using standard SQL queries. The data stored in Parquet format significantly minimizes both analysis time and query costs. Tools like IBM Cognos or Business Object can utilize Athena ODBC drivers to visualize and explore the audit data.
Here is the architecture for our cost-optimized RDBMS:
In the outlined data lake approach, Amazon S3 serves as the central repository for data in any format. The architecture employs DMS to migrate datasets from on-premises databases to S3. Data can be cataloged using AWS Glue crawler and ETL jobs can cleanse the data by encoding CLOB and multi-line data.
Migration Strategy
The diagram illustrates the migration strategy employed during the data transition process.
Lessons Learned from the Migration
Migrating data from relational databases to Amazon S3 involves careful organization of data within the target system. Here are some best practices we gained from this experience, which you can follow for expedient and cost-effective data retrieval.
Data Migration Strategy
The strategy is a pivotal element in migrating data to S3 using DMS. Key components include:
- Classifying tables based on their size.
- Identifying columns with multi-line or CLOB data to apply 64-bit encoding during ETL.
- Pinpointing partition keys for the target based on common queries or reporting needs.
- Determining the number of parallelized DMS tasks necessary for data transfer.
- Establishing a validation strategy post-data migration.
Data Transformation
Data transformation is essential for enhancing query performance. Key aspects include:
- Utilizing AWS Glue for data partitioning and storing it in S3 in Parquet format.
- Implementing ETL processes for encoding multi-line or CLOB data to facilitate effective querying through Amazon Athena. Columns in the table must be decoded during queries with Athena or other SQL tools.
Using the AWS Total Cost of Ownership (TCO) Calculator, we estimated the monthly cost for one server with two CPU cores and 32 GB of memory to be around $4,800, excluding database licensing and operational costs. The projected annual expenses, encompassing infrastructure, software, and operational costs, totaled $71,000.
After implementing the solution proposed by FutureTech, the monthly costs for the customer plummeted to just a few hundred dollars. This figure includes expenses related to operating the data lake and conducting analytics with AWS Big Data services like Amazon Athena.
For additional insights on setting healthy boundaries and preventing burnout, consider visiting SHRM, an authority on the topic. For those seeking to maintain a balance between personal and professional life, this article on Career Contessa is also beneficial.
Finally, for a more interactive discussion, check out this Reddit thread that provides an excellent resource for onboarding experiences.