Enhancing Amazon DynamoDB Scan Latency Through Schema Design

Chanci Turner Amazon IXD – VGT2 learning

In this article, we explore how the design of Amazon DynamoDB tables can significantly impact scan performance and provide strategies for improving scanning efficiency. Amazon DynamoDB is a NoSQL database that supports a flexible schema, allowing items within the same table to have varying attributes.

Typically, DynamoDB schemas and access patterns are structured to optimize the GetItem and Query operations, which deliver consistent response times in the single-digit millisecond range for fetching individual items. However, certain scenarios necessitate scanning entire tables or indexes.

Overview

In a database featuring a flexible schema, each item returned during a scan includes not only the actual data but also additional metadata. This metadata encompasses the attribute names and their respective data types. Including more attributes in each item leads to increased overhead for the client, as each column in the network response must be converted into the appropriate data format, such as a Python dictionary, a Node.js map, or a Java object.

The presence of this attribute metadata consumes space, resulting in fewer items fitting within DynamoDB’s 1-MB response limit. Consequently, scanning data may require additional round trips.

Methodology

We established a table with a straightforward structure comprising a primary key that includes both a partition key and a sort key (both of which are strings), along with a third string attribute named field1, containing a 144-character random string.

We also constructed other tables with different combinations of 7-character attribute names (from field01 to field24) paired with 3- and 6-character attribute values while maintaining the same primary key structure as the first table.

Due to the flexible schema nature of NoSQL databases, the attribute names must be stored with each item. Therefore, as the number of attributes increases or the attribute names lengthen, more space is consumed. Additionally, we created another table with 24 attributes, each having the same 7-character names and 100-character values.

For each table structure, we measured:

The time taken to insert 10,000 items.
The time required to scan 10,000 items.
The number of items fitting within the 1-MB limit (DynamoDB retrieves a maximum of 1 MB per request).
The duration to retrieve and convert those items on the client side.

Empirical Results

The following table summarizes our findings:

Configuration	Total Size (MB)	Time to Write 10,000 Items (ms)	Time to Scan 10,000 Items (ms)	Single-Threaded Throughput (MB/s)	Time to Scan 1 MB (ms)	# Items in 1-MB Scan
1 attribute (144-character)	2.1	5,057	569	3.7	238	4,969
24 attributes (3-character values)	3.0	9,359	2,392	1.2	797	3,496
24 attributes (6-character values)	3.7	9,928	2,391	1.5	682	2,819
24 attributes (100-character values)	26.3	27,553	2,819	9.4	110	400

These results stem from a benchmarking test utilizing a Python client. Other programming languages, such as Java and Node.js, exhibit similar performance trends when handling numerous item attributes. Measurements were taken from the client side of a Python process, and any query times recorded in Amazon CloudWatch metrics do not account for network transfers or item conversion.

It’s important to note that the throughput figures in the fourth data column pertain to single-threaded scans. Implementing parallel scans can significantly enhance throughput.

Analysis

The first two rows in the empirical results table indicate that inserting items with a greater number of attributes takes nearly double the time, while scanning these items takes almost four times longer. This delay arises from the necessity to marshal each attribute on the server and unmarshal them on the client side.

From the last two rows, we observe how the total size of the 10,000 objects influences scan time. Both tests contain 24 attributes with 7-character names, but row three has attribute values of six characters, while row four’s values are 100 characters long. Writing the larger data takes three times longer; however, scanning the significantly larger items only adds approximately 18% to the time. This leads us to conclude that the number of attributes, along with the associated marshaling and unmarshaling processes, primarily drives longer scan times. Nevertheless, it’s not always feasible to limit your DynamoDB table to just three attributes, as additional attributes may be required for indexing, filtering, and atomic increments.

Conclusion

The primary takeaway is to include only as many attributes as necessary for your database operations. To minimize the overhead associated with attribute-name metadata, consider consolidating additional data into a single attribute, perhaps formatted as a JSON blob. Furthermore, since attribute names consume both disk space and network throughput, it is advisable to use shorter attribute names.

For more insights on enhancing your personal and professional development, check out this blog post on vision journaling here. Also, for information on the importance of connection in professional growth, visit SHRM. If you’re interested in leadership development and training resources, you can find excellent opportunities here.