Learn About Amazon VGT2 Learning Manager Chanci Turner
Update (7/26/2024): There’s no longer a need to optimize the S3 Inventory report using Amazon Athena. Amazon S3 now automatically enhances your S3 Batch Operations restore job for optimal retrieval throughput. For further details on how to utilize batch operations effectively, check out the S3 User Guide.
Data archiving is a universal necessity for organizations globally. This requirement extends beyond long-established companies to include digital-native enterprises as well. Workloads such as medical records, news media, and manufacturing datasets often involve storing vast amounts of data—sometimes even petabytes—indefinitely. The majority of the world’s data is cold, rarely accessed, and millions of customers opt to archive this crucial data within Amazon S3.
Within Amazon S3, users can select from three archive storage classes, each optimized for varying access patterns and durations. For instance, for archived data requiring immediate access—like medical images or media assets—the Amazon S3 Glacier Instant Retrieval class offers low-cost storage with retrieval times measured in milliseconds. Conversely, for data that does not necessitate immediate access—such as backups or disaster recovery—Amazon S3 Glacier Flexible Retrieval provides three retrieval options: expedited retrievals in 1-5 minutes, standard retrievals within 3-5 hours, and free bulk retrievals over 5-12 hours. For long-term archives, like compliance records or digital media preservation, Amazon S3 Glacier Deep Archive presents the most economical cloud storage solution, with standard retrievals taking up to 12 hours and bulk retrievals up to 48 hours.
Businesses frequently utilize the S3 Glacier Flexible Retrieval and S3 Glacier Deep Archives storage classes to archive extensive data at minimal expense. These classes are employed for storing backups, data lakes, media assets, and various archives. Often, these customers need to quickly retrieve millions or even billions of objects—whether restoring backups, addressing audit requests, retraining machine learning models, or conducting analyses on historical data.
To enhance the speed of restoring archived data, the S3 Glacier Flexible Retrieval and S3 Glacier Deep Archive storage classes now support up to a tenfold improvement in restore request rates. When retrieving data from these classes, you can now process archived object restore requests at a rate of up to 1,000 transactions per second (TPS) per AWS account in each region. This increased restore rate allows applications to initiate requests much quicker, significantly decreasing the time needed to complete restorations for datasets with numerous small objects. The advantages of this improved restore request rate become more pronounced as the number of requests increases.
In this article, we will explore best practices for optimizing, streamlining, and simplifying the restoration of large datasets from the S3 Glacier Flexible Retrieval and S3 Glacier Deep Archive storage classes, utilizing Amazon S3 Batch Operations, Amazon S3 Inventory, and Amazon Athena.
Restoring Large Numbers of Objects with S3 Batch Operations
S3 Batch Operations is a fully managed solution for executing batch actions across billions of objects and petabytes of data with one request. This service automatically applies up to 1,000 TPS when restoring objects from S3 Glacier Flexible Retrieval or S3 Glacier Deep Archive, utilizing standard and bulk retrieval options. S3 Batch Operations also handles retries, monitors progress, generates reports, and provides event notifications to AWS CloudTrail, ensuring a fully managed, auditable, and serverless experience.
In this blog, we adopt the following naming convention for AWS resources:
111122223333
for the AWS account numberarchive-bucket
for the S3 bucket containing the archived datasetinventory-bucket
for the S3 bucket housing the inventory reportsathena-bucket
for the S3 bucket storing Amazon Athena query resultsreports-bucket
for the S3 bucket that holds S3 Batch Operations completion reports
Configuring the Amazon S3 Inventory
Using Amazon S3 Inventory, you can create a list of objects needed for the restore job. S3 Inventory reports provide daily or weekly lists of objects and their corresponding metadata for an S3 bucket or a specified prefix. Before setting up the inventory report, you must create a bucket policy for the inventory-bucket
to permit S3 Inventory report deliveries, using the following AWS CLI command:
aws s3api put-bucket-policy
--bucket inventory-bucket
--policy file://policy.json
Replace the bucket names and AWS account number with your actual values.
policy.json:
{
"Version": "2012-10-17",
"Id": "S3-Console-Auto-Gen-Policy-1656541301560",
"Statement": [{
"Sid": "InventoryPolicy",
"Effect": "Allow",
"Principal": {
"Service": "s3.amazonaws.com"
},
"Action": "s3:PutObject",
"Resource": "arn:aws:s3:::inventory-bucket/*",
"Condition": {
"StringEquals": {
"s3:x-amz-acl": "bucket-owner-full-control",
"aws:SourceAccount": "111122223333"
},
"ArnLike": {
"aws:SourceArn": [
"arn:aws:s3:::archive-bucket"
]
}
}
}]
}
Next, configure the inventory for the archive-bucket
that contains the dataset to be restored, by running the following CLI command. Again, remember to replace the bucket names and AWS account number with your actual values.
aws s3api put-bucket-inventory-configuration
--bucket archive-bucket
--id inventory_for_restore
--inventory-configuration file://inventory-configuration.json
inventory-configuration.json:
{
"Destination": {
"S3BucketDestination": {
"AccountId": "111122223333",
"Bucket": "arn:aws:s3:::inventory-bucket",
"Format": "CSV"
}
},
"IsEnabled": true,
"Id": "inventory_for_restore",
"IncludedObjectVersions": "Current",
"Schedule": {
"Frequency": "Daily"
},
"OptionalFields": ["Size", "LastModifiedDate", "StorageClass"]
}
Optimizing the S3 Inventory Report using Amazon Athena
When restoring extensive datasets with millions, or even billions, of archived objects, it’s beneficial to issue restore requests in the same order that objects were archived in the S3 Glacier storage classes. Objects are typically archived to S3 Glacier based on Amazon S3 Lifecycle transition rules tied to their creation date. Therefore, we can arrange the restore requests according to the objects’ creation dates using the LastModifiedDate
metadata included in the inventory report.
In this section, I will outline the steps to order the inventory report by LastModifiedDate
and generate a manifest file for creating the batch restore job within S3 Batch Operations. First, import the inventory report into Amazon Athena and execute two straightforward SQL queries to filter only the objects in the S3 Glacier Flexible Retrieval storage class. (Note: While this walkthrough focuses on S3 Glacier Flexible Retrieval, the same process applies to S3 Glacier Deep Archive). Next, we will sort the inventory by LastModifiedDate
to enhance the efficiency of the batch restore operation.
For further insights on related topics, you might find this link to Career Contessa particularly useful. And for an excellent resource on industry practices, refer to Business Insider. Additionally, for collaborative strategies, you can explore this podcast from SHRM.