Discover How AWS Flexibility Enhances Business Continuity Opportunities

Chanci Turner Amazon IXD – VGT2 learningLearn About Amazon VGT2 Learning Manager Chanci Turner

Permalink

Share

A Guide for IT Professionals

The importance of technology in our daily lives has never been more significant or far-reaching. We have grown accustomed to reliable, always-available systems, and we can quickly recognize the consequences when these systems fail. To meet customer expectations amidst uncertainty, IT professionals must ensure the resilience of the systems we create. AWS provides the flexibility necessary to design systems that cater to diverse resilience requirements for every customer experience.

However, building resilient technology comes with its challenges, as disruptions are not merely inconvenient—they can lead to substantial financial losses. According to IDC, the annual cost of downtime for Fortune 1000 companies ranges from $1.25 billion to $2.5 billion, with the average cost of a critical application failure reaching between $500,000 and $1 million per hour. A faster recovery from disruptions can lead to lower business impact costs (as illustrated by the solid line in Figure 1), but it also necessitates higher recovery costs (represented by the dotted line). This blog post will present a framework to help balance costs and other impacts against resilience requirements.

Figure 1: Cost of Disruption and Recovery over Time

To create resilient systems, it is essential to understand business processes. Collaborating with business stakeholders is crucial for identifying risks and developing technological solutions that align with customer expectations. Business Continuity Planning (BCP) serves as a method to document business processes and formulate plans to sustain these processes during disruptive incidents. A key outcome of BCP is establishing a Recovery Time Objective (RTO) and a Recovery Point Objective (RPO) for each system. The RTO represents the maximum acceptable delay between service interruption and restoration, while the RPO signifies the maximum acceptable data loss time since the last recovery point.

At Amazon, we prioritize customer expectations in our technical designs, and the same principle applies to resilience. In this article, you will learn how to utilize three BCP tools to determine appropriate RTO and RPO for a system. By understanding customer experiences and employing tools like Risk Assessment (RA), Business Impact Analysis (BIA), and System Impact Analysis (SIA), you can develop technology solutions that address risks such as power outages or cyberattacks.

Risk Assessment (RA)

The first step in developing a business continuity plan involves creating a list of business processes alongside your business partners; document the sub-processes, inputs, and outputs. Make sure to include key activities or critical customer journeys. Once you’ve documented these processes, conduct a RA, which entails identifying, analyzing, and estimating the likelihood of various risks, hazards, and threats to each process. These risks may stem from natural, man-made, or environmental sources.

It’s vital to assess the nature of each risk regardless of its type. Factors to consider include:

  • Information Technology – Loss of Connectivity, Hardware Failure, Lost/Corrupted Data, Application Failure, Cyber threats
  • Utility Outage – Communications, Electrical Power, Water, Gas, Steam, Heating/Ventilation/Air Conditioning, Pollution Control Systems, Sewage Systems
  • Fire/Explosion – Fire (Structure, Wildland), Explosion (Chemical, Gas, or Process failure)
  • Hazardous Materials – Hazardous Material spills/releases, Radiological Accidents, Hazmat Incidents off-site, Transportation Accidents, Nuclear Power Plant Incidents, Natural Gas Leaks
  • Vendor Risk – Supplier Failure, Supply Chain Interruptions

To perform a RA, utilize a RA tool that includes fields for detailing risks, likelihood, and impact. The U.S. Department of Homeland Security (DHS) offers guidance and tools for conducting a RA (see Figure 2). Similar resources are available through AWS Professional Services and AWS Partners. By using the DHS RA tool, you input the business operation/process in the first column and potential hazards in the second column, completing the remaining columns as per DHS instructions to arrive at an overall hazard rating. The outcome of a RA is a comprehensive list of business processes with associated risk data and overall hazard ratings.

Figure 2: Risk Assessment Table courtesy of U.S. Department of Homeland Security

Business Impact Analysis (BIA)

Following the RA, the next step is the Business Impact Analysis. With a list of business processes and hazard ratings in hand, delve deeper into each process and scenario that received higher hazard ratings. Conduct a BIA for each identified process (refer to Figure 2, column 1).

The BIA aims to provide a detailed understanding of the potential impacts of any disruption to each business process. It assesses the disruption’s potential effects from financial, reputational, operational, customer, and legal/regulatory perspectives. Furthermore, the BIA forecasts the consequences of disruption to business operations and gathers information essential for formulating recovery strategies.

Utilize a BIA tool to document the impact over time for a business process disruption. DHS supplies guidance and tools for conducting a BIA (see Figure 3). Complete a separate BIA form for each process listed in the RA. In the first column, input the timing/duration identified in the RA (see Figure 2, column 3), while the corresponding impacts are noted in the second and third columns. The resulting output is a table illustrating the impact of a business process disruption over time.

Figure 3: Business Impact Analysis Worksheet courtesy of U.S. Department of Homeland Security

Once the impact of a disruption is assessed, classify business processes into tiers ranging from mission-critical to non-critical (see Figure 4). For instance, online banking processes with significant customer, reputational, regulatory, and financial implications may be categorized as critical, while processes like visitor check-in may be regarded as non-critical. A common strategy is to rank impacts across all processes to establish tiers.

Figure 4: Business Impact Tiers

The output of the BIA process results in a list of business processes and their corresponding impact tiers (see Figure 5).

Figure 5: Business Processes with Tiers

System Impact Analysis (SIA)

Having established a list of business processes with tiers, the next step is to analyze each individual IT system supporting a critical process. Use the business processes and tiers from the previous BIA as inputs for the SIA (see Figure 5).

The SIA aims to gain a clearer understanding of the potential impacts on each business process due to IT system disruptions. The analysis evaluates the financial, reputational, operational, customer, and legal/regulatory implications of IT system interruptions. This detailed understanding is essential for establishing effective recovery strategies that can mitigate risks associated with service outages. As you navigate these complexities, you may also want to consider resources that discuss navigating negativity in the workplace, such as this insightful blog post on negativity in the office. If your organization faces challenges like bankruptcy, it’s crucial to consult experts on COBRA obligations for guidance. Additionally, community forums like this Reddit thread can be an excellent resource for those starting their journey with Amazon.