The AWS Outage and a Lesson in Cloud Infrastructure

Alex Morris II
Nov 12, 2025
3 min read

Updated: Nov 29, 2025

The Outage

As many of us know, the Amazon Web Services (AWS) platform experienced a massive disruption in their services, so much so that most of the products and websites we use every day were either down or painfully slow. Like most, I refreshed my page, emptied the browser cache and even restarted my device but to no avail. It seemed like nothing was really working. Then the news came out: Global AWS outage. No one knew exactly what happened, why it happened or even to what degree was this going to affect the population at large. I mean how could a corporate giant like Amazon be down? Technical professionals like myself saw that headline and wondered “just how much of the internet is housed in one platform” and “could this expose a massive problem with our current digital infrastructure”. I’m sure most people even began to think of the dreaded H-word, “was AWS hacked?”

Why Did It Happen?

According to multiple reports, the reason for the outage was a simple misconfiguration with one of the DNS records which took down services all across the us-east-1 region. Essentially, systems were not able to translate domain names like google.com into their associated IP addresses, causing other AWS services to fail which then impacted everyone else. It’s mind-boggling to think that a simple misconfiguration or mistake in a process can have cascading effects. So much so that millions of users and even businesses are not able to operate like they are used to. But that’s the IT world in a nutshell.

What Can We Learn From This?

As someone who’s worked in multiple cloud/DevSecOps environments, I had to ask myself “if my organization’s infrastructure was running in AWS and we were affected, what mitigations could be in place to make sure we minimized downtime?” The easiest would be multi-region deployments. Unfortunately, there is no way to prevent every potential scenario. Sometimes systems may fail or stop running correctly just because. But if there’s one thing I’ve learned about IT, it’s that you can be prepared. As we saw with the outage, any services/infrastructure in the us-east-1 region were not working properly, which meant downtime for those solely relying on it. But the companies who experienced the least amount of downtime were the ones that replicated their infrastructure and services to other regions, such us-east-2, us-west, af-south, ap-east, ca-west or others. That means their dev, test and prod environments weren’t just in one region. Their presence extended across multiple regions so that in the case one goes down, they can just spin up resources in the next. All the while continuing to operate like virtually nothing happened.

Now of course another option is to implement a multi-cloud approach, having resources deployed in both AWS and Azure for example. While it has a similar principle as multi-region, it can be significantly more complicated due to trying to marry services in AWS to their equivalents in Azure, Google Cloud, Oracle etc. Deployments become far more complex than needed which then increases the likelihood of not only errors but security leaks.

Now, one of the primary objections of a multi-region or multi-cloud strategy is that they can increase the likelihood of higher consumption bills, meaning why pay for resources where they’re not being actively used. I’d argue that it’s about preparing for the worst case scenario. Why rely on a single point of failure that could have lasting consequences when you can put your organization in a position to be better prepared for the worst possible outcome(s)? Not only that but the beauty of cloud environments is that they often follow a pay as you go model. Meaning, you only pay for what you use. If a business decides it’s more economically ideal to pay upfront for resources, they can do that instead. Plus, there are resources in AWS that won’t accrue a cost at all as long as they’re not actively used.

A Key Component: Backups

Let’s say a business had their infrastructure replicated across regions during the outage period and while the us-east-1 region was down, they were able to get back online in us-west-1. This is where it’s equally important to have a consistent backup schedule. Why? Because it’s about picking up from where you left off. And downtime means a business is not making money.

The AWS Outage and a Lesson in Cloud Infrastructure

The Outage

Why Did It Happen?

What Can We Learn From This?

A Key Component: Backups

Recent Posts

Comments