Amazon reveals cause of AWS outage that took everything from banks to smart beds offline

3 hours ago 2

Amazon has revealed the cause of this week’s hours-long AWS outage, which took everything from Signal to smart beds offline, was a bug in automation software that had widespread consequences.

In a lengthy outline of the cause of the outage published on Thursday, AWS revealed a cascading set of events brought down thousands of sites and applications that host their services with the company.

AWS said customers were unable to connect to DynamoDB, its database system where AWS customers store, due to “a latent defect within the service’s automated DNS [domain name system] management system”.

DynamoDB maintains hundreds of thousands of DNS records. It uses automation to monitor the system to ensure records are updated frequently to ensure additional capacity is added as required, hardware failures are handled and traffic is distributed efficiently.

The root cause of the issue, AWS said, was an empty DNS record for the Virginia-based US-East-1 datacentre region. The bug failed to automatically repair, and required manual operator intervention to correct.

AWS said it had disabled the DynamoDB DNS planner and DNS enactor automation worldwide while it fixes the conditions that led to the outage and adds extra protections.

The issue also caused outages for other AWS tools as a result.

Platforms including Signal, Snapchat, Roblox, Duolingo, as well as services such as banking sites and the Ring doorbell company were some of the 2,000 companies affected by the outage, according to Downdetector – a site that monitors internet outages – with more than 8.1m reports of problems from users across the world.

While services were restored in a matter of hours, the impact of the outage was felt widely.

Customers of Eight Sleep – a smart bed company that connects to the internet to control the temperature and incline of a person’s bed – found they were unable to adjust the bed or the temperature of the bed during the outage because they were unable to connect to the bed in their phone app.

The company’s chief executive, Matteo Franceschetti, apologised to customers on X and this week rolled out an update to its services that would allow users to control the bed’s critical functions via Bluetooth in the event of an outage.

Dr Suelette Dreyfus, a computing and information systems lecturer at the University of Melbourne, said the outages showed how dependent the world was on single points of failure on the internet.

“That single point isn’t just AWS – they’re the biggest cloud provider with 30% or so of the market – but rather the cloud as a whole, which is basically just three companies,” she said.

“The internet was designed to be resilient; many other channels existed for routing around problems or attacks, but we’ve lost some of that resilience by becoming so dependent on a handful of giant tech companies to provide not just data storage but also house data services.”

Read Entire Article
Infrastruktur | | | |