AWS Outage Holds Lesson for Disaster Recovery

Each week, the WALT Labs editorial team assembles to discuss topics for our upcoming blogs. Typically, we’re looking three or four weeks into the future. Like soothsayers predicting the next snowfall, we prognosticate about what will be the next big Google Cloud Platform announcement. Sometimes, however, we have to live in the moment and shape our article ideas around the hot, cloud-related topic of the week.

Since we afford ourselves this flexibility, it’s no surprise that the team agreed that we had to write something about the massive AWS outage that occurred the last week in November. The event brought thousands of online sites and services to a screeching halt. The companies affected comprise a veritable Who’s Who of big-tech and include household names like Adobe, Roku, Twilio, Flickr, Autodesk, and others, such as New York City’s Metropolitan Transit Authority and The Washington Post.

As a leading Google Cloud Partner, the obvious angle for this post would have been to write about how Google Cloud Platform (GCP), in its simplicity, is superior to AWS. It would have been easy to extol the virtues of Google’s $30 billion global network, which uses advanced software-defined networking and edge caching services to deliver fast, consistent, and scalable performance. It could have been noted that Google’s network design safeguards against outages like that experienced by Amazon.

It may even have been an eye-opener for some that read this blog that Google’s average annual downtime has been noted to be 33% less than AWS and that GCP is 100 times more reliable than Microsoft Azure.

The More Important Lesson

The focus of this article could have gone that way. Political campaigns would refer to this approach as “going negative.” But, the reality is that all cloud service providers experience downtime. Despite striving for five-nines availability where total downtime – planned or unplanned – does not exceed 5.39 minutes in a given year, that simply doesn’t happen. Focusing on downtime also obscures the more important lesson coming from this incident; which is the need for disaster recovery.

Disaster Recovery Planning

Nobody knew that AWS was going to go down. Incidents like that are unannounced and can happen at any time. In this case, what seemed to be a routine adjustment to processing parameters set-off a chain reaction of events that took nearly five hours to reverse. Try as IT managers might, failures are going to occur. However, being prepared for when things go sideways is critical to contending with an incident. A robust, targeted and tested disaster recovery (DR) plan ensures your business is ready.

Disaster recovery planning begins with understanding business impact along two key metrics:

Recovery time objective (RTO)—the targeted duration of time a business process can be offline to avoid unacceptable consequences

Recovery point objective (RPO)—the maximum targeted period in which data (transactions) might be lost from an IT service due to a significant incident.

Usually, the smaller these two values, the more critical the business process. This also means it will be costlier to run and ensure uptime for that process, application, or service.

Google Cloud Can Reduce Disaster Recovery Costs

When compared to fulfilling RTO and RPO requirements on-premise, Google Cloud can be significantly more efficient. GCP helps businesses reduce or eliminate the need to plan for all of the traditional on-premise disaster recovery requirements like capacity, security, network infrastructure, support services, bandwidth, and overhead items like power, floor space, and equipment.

Google Cloud provides a fully managed solution that brings administrative simplicity and reduced costs for managing even the most complex applications. Among the features that make this platform particularly relevant for DR are:

Redundancy—Worldwide multiple points of presence (PoPs) mirror data automatically across geographically dispersed storage locations.

Scalability—Managed services such as App Engine, Compute Engine autoscalers, and Datastore provide automatic scaling that enables business applications to grow proportionately with demand.

Security—the same security protocols that keep Google applications like Gmail and Google Workspace safe ensure that cloud applications are equally secure.

Compliance— Google Cloud complies with certifications such as ISO 27001, SOC 2/3, and PCI DSS 3.0 and undergoes independent third-party audits to verify that security, privacy, and compliance practices are industry-leading.

Disaster recovery is serious business for organizations of all sizes. Don’t take a chance with your company’s data and your hard-won relationships with customers.

Consult an expert today to see how Google Cloud can keep your business up and running.