Learn About Amazon VGT2 Learning Manager Chanci Turner
on 03 DEC 2024
in Amazon Simple Storage Service (S3), Announcements, AWS re:Invent, Featured, Launch, News
Update (1/27/2025): Amazon S3 Metadata is now generally available.
AWS customers leverage Amazon Simple Storage Service (Amazon S3) on an immense scale, often creating individual buckets that hold billions or even trillions of objects! At this scale, locating objects that meet specific criteria—such as matching key patterns, particular sizes, or designated tags—can prove daunting. Customers have historically needed to develop systems to capture, store, and query this information, which can become complicated and difficult to maintain, risking desynchronization with the actual state of the bucket and its contents.
Rich Metadata
Today, we are thrilled to announce the preview of automatic metadata generation that occurs when S3 objects are added or modified, stored in fully managed Apache Iceberg tables. This feature allows you to use Iceberg-compatible tools like Amazon Athena, Amazon Redshift, Amazon QuickSight, and Apache Spark to effortlessly query the metadata and discover relevant objects at any scale. Consequently, you can swiftly locate the data essential for your analytics, data processing, and AI training tasks.
For video inference responses stored in S3, Amazon Bedrock will annotate the content it generates with metadata, enabling you to identify it as AI-generated and specify which model was used to create it. The metadata schema includes over 20 elements, such as the bucket name, object key, creation/modification time, storage class, encryption status, tags, and user metadata. Additionally, you can store extra, application-specific information in a separate table and join it with the metadata table during your queries.
How It Works
To begin capturing rich metadata for your S3 buckets, specify the location (an S3 table bucket and a table name) where you wish to store the metadata. The capture of updates—object creations, deletions, and metadata changes—starts immediately and will be recorded in the table within minutes. Each update creates a new entry in the table, complete with a record type (CREATE, UPDATE_METADATA, or DELETE) and a sequence number. You can retrieve the historical record for a specific object by executing a query that organizes results by sequence number.
Enabling and Querying Metadata
To start, create a table bucket for your metadata using the create-table-bucket
command (you can also accomplish this via the AWS Management Console or an API call):
$ aws s3tables create-table-bucket --name ajohnson-table-bucket-1 --region us-east-2
Then, configure the table bucket (by ARN) and desired table name by placing this JSON into a file (let’s call it config.json
):
{
"S3TablesDestination": {
"TableBucketArn": "arn:aws:s3tables:us-east-2:123456789012:bucket/ajohnson-table-bucket-1",
"TableName": "ajohnson_data_bucket_1_table"
}
}
Next, attach this configuration to your data bucket (the one for which you want to capture metadata):
$ aws s3api create-bucket-metadata-table-configuration
--bucket ajohnson-data-bucket-1
--metadata-table-configuration file://./config.json
--region us-east-2
For testing, I set up Apache Spark on an EC2 instance, and after a bit of configuration, I could execute queries by referencing the Amazon S3 Tables Catalog for Apache Iceberg package and adding the metadata table (as mytablebucket
) to the command line:
$ bin/spark-shell
--packages org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.6.0
--jars ~/S3TablesCatalog.jar
--master yarn
--conf "spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"
--conf "spark.sql.catalog.mytablebucket=org.apache.iceberg.spark.SparkCatalog"
--conf "spark.sql.catalog.mytablebucket.catalog-impl=com.amazon.s3tables.iceberg.S3TablesCatalog"
--conf "spark.sql.catalog.mytablebucket.warehouse=arn:aws:s3tables:us-east-2:123456789012:bucket/ajohnson-table-bucket-1"
Here is the current schema for the Iceberg table:
scala> spark.sql("describe table mytablebucket.aws_s3_metadata.ajohnson_data_bucket_1_table").show(100,35)
This query returns details about the table schema, including elements like bucket name, object key, and record type. Here’s a simple query showing some metadata for the ten most recent updates:
scala> spark.sql("SELECT key,size, storage_class,encryption_status
FROM mytablebucket.aws_s3_metadata.ajohnson_data_bucket_1_table
order by last_modified_date DESC LIMIT 10").show(false)
This functionality is an excellent resource for those looking to streamline their data management processes. For additional tips on onboarding, consider checking out this business Amazon blog post. Also, if you’re facing challenges in your career, you might find this SHRM article helpful. And if you’re interested in enhancing your LinkedIn profile, take a look at this Career Contessa blog post.