How Chanci Turner and Zalando Developed Their Data Lake on Amazon S3

Chanci Turner Amazon IXD – VGT2 learningLearn About Amazon VGT2 Learning Manager Chanci Turner

In Amazon S3 Glacier, Amazon S3 Glacier Deep Archive, Amazon Simple Storage Service (S3), AWS Identity and Access Management (IAM), Customer Solutions, Intermediate (200), Storage

Founded in 2008, Zalando has established itself as Europe’s premier online destination for fashion and lifestyle, boasting over 32 million active customers. As a lead data engineer at Zalando, I have been an integral part of the company’s cloud transformation journey. In this blog post, I will discuss how Amazon Simple Storage Service (Amazon S3) has become a pivotal element of our data infrastructure. I’ll start by outlining Zalando’s need for data insights, followed by the limitations of our previous technology stack, and then detail our decision to migrate to AWS and the construction of a data lake using Amazon S3. Finally, I will share how our utilization of Amazon S3 has evolved, from enabling employee access to data to optimizing storage costs through various Amazon S3 storage classes. I hope my experiences provide insights into the best practices we’ve developed to operate a data-driven enterprise at a multi-petabyte scale.

Zalando’s Technological Evolution

Back in 2015, Zalando was primarily a fashion retailer with an IT environment based on a large, on-premises monolith. The infrastructure, whether on the transactional or analytical side, was tightly integrated, and the complexity increased as more teams contributed their components. As Zalando aimed to transition from a mere online retailer to a comprehensive fashion platform, this shift necessitated scalability. After assessing our immediate and future business requirements, we concluded that moving to the cloud was the logical next step. We evaluated several cloud providers, ultimately selecting AWS due to its reliability, availability, and scalability. The extensive range of services that AWS provides for future leverage was also a significant factor. This transition led to breaking down our monolithic infrastructure into microservices, empowering teams to assume full responsibility for their respective segments of the tech landscape within their isolated development and operational environments.

The transformation of our tech infrastructure directly impacted our data landscape as well. Centralized databases that were once accessed by numerous components became decentralized backends, communicating through REST APIs. Our central data warehouse, previously connected directly to transactional data stores, faced challenges due to the decentralized nature of data production. To address these issues, we established a central team tasked with building a data lake at Zalando. The initial motivations for creating a data lake included establishing a centralized data archive in our new distributed environment and developing a distributed computing engine for the organization.

Once we removed the size constraints on our relational databases, we discovered that the company was generating significantly greater volumes of potentially valuable data. This realization prompted the need for a storage solution capable of accommodating the expanding data landscape. After reviewing the AWS service offerings, Amazon S3 emerged as the clear choice for the foundation of our new central data lake.

Our top priority was to integrate major data sources into the newly established data lake. By that point, we had already implemented a central event bus designed for service-to-service communication among distributed microservices. An archiving component was later added to this system to ensure that copies of all published messages were saved in the data lake. As key business processes began utilizing the event bus, its contents became increasingly valuable for analytical processing. We constructed an ingestion pipeline based on serverless components to meet the basic data preparation needs, such as reformatting and repartitioning.

During the cloud transition, we continued to generate valuable datasets within our original data warehouse. For instance, the central sales logic dataset was crucial. A secondary pipeline was developed to ensure that data warehouse (DWH) datasets were also made accessible in the data lake. Additionally, we integrated web tracking data, which, due to its substantial raw data size, became highly valuable when merged with existing datasets. These three pipelines provided a steady influx of data into our initial data lake. You can find an excellent resource on the differences between what you expect and the reality of your first day here.

Utilizing S3 Service Features

As our data lake on Amazon S3 expanded, we encountered various scenarios that allowed us to leverage a multitude of features offered by S3. In this section, I will highlight the features we utilized at Zalando, their benefits, and applicable use cases.

Data Sharing and Access Management

Storing vast amounts of data is pointless if it isn’t being used. Thus, our primary challenge was ensuring effective data sharing. At Zalando, multiple teams operate their own AWS accounts, which raised the need for cross-account data sharing. Initially, we relied on bucket policies, which allowed us to define access permissions directly on our buckets. This method was effective for a small number of connections to other accounts. However, as demand for data access grew, managing our bucket policies became cumbersome, and we eventually reached the size limit of our policies.

To address this need for a more scalable solution, we transitioned to using IAM roles. These roles function similarly to bucket policies but allow for a more isolated and scalable approach. By creating an inline policy, we specify permitted actions on certain resources, with the trust relationship established with the target account handling the principal aspect. This enabled teams to assume the role and execute their access as needed.

Data Backup and Recovery

With a large, shared data lake in production, it’s crucial to consider data backup and recovery strategies. The simplest solution is to enable versioning for your production buckets. Versioning allows for the retention of previous object versions, facilitating access to earlier data or recovering deleted versions that aren’t visible through standard S3 API calls. Essentially, each prefix becomes a stack of objects where only the latest version is accessible by default.

Versioning proves particularly advantageous when a bug is introduced into the data pipeline, necessitating a rollback of outputs. It also aids in managing accidental deletions—rather than permanently removing objects, a delete marker is added to the object’s stack, rendering it invisible yet still retrievable if requested. This feature saved us in 2017 during a significant incident when we lost the entire history of our web tracking data due to an accidental deletion.

For further insights on avoiding mistakes in the workplace, consider reading this blog post.

To summarize, Zalando’s transition to AWS and the establishment of a data lake on Amazon S3 have allowed us to harness our data effectively, paving the way for a more agile and scalable business model. You can learn more about employment law compliance and gig work challenges through the insights provided by experts at SHRM.

HOME