Amazon Prime Day: A lesson in how not to handle an IT outage

Amazon website with a magnifying glass over 'Prime Day: Coming soon' message

(Image credit: Shutterstock)

Nobody likes it when their website goes down. Not only is it highly embarrassing, for organisations that do business through the web it can be expensive in terms of both revenue and reputation. Yet sites do go down, and even the biggest and best-known names can have problems.

You know that piece of buttered toast that when dropped always seems to land butter side down? Unscheduled downtime can be like that. Messy and creating the worst possible disruption at the most inconvenient moments. Unlike scheduled downtime, which often happens overnight or at times of minimal traffic, you can guarantee that unexpected downtime will rear its head at your busiest moment.

Yet, this is to be expected. After all, sites are working hardest when they are stretched the furthest. This is precisely what Amazon faced during its recent Prime Day outage a period of catastrophic downtime during one of the company's busiest and most lucrative times of the year.

Customers reported issues with links not loading correctly and shopping carts mysteriously emptying. A shopper's nightmare, and Amazon's too.

Amazon apparently planning to ditch Oracle 'within two years' TSB outage: CEO Paul Pester quits after IT meltdown

While it's difficult to pin down exactly how much Amazon lost as a result of the outage, estimates from a number of data analytics firms suggest the figure could be as much as $100 million.

Amazon isn't the first to experience a serious outage, and it won't be the last. Last year British Airways suffered an IT crash affecting 75,000 passengers, for example. In fact, recent figures from the Ponemon Institute suggested that more than half of organisations globally are unprepared for IT outages.

Plan for the worst

Still, it's an ill wind that blows nobody any good. While Amazon takes a look at its internal systems and figures out how to try to prevent a similar thing happening in the future, we can all learn some lessons.

One key piece of learning that applies to every site, no matter how large or small, is to plan for likely scenarios.

One solution may be to periodically put stress on a system to test it for weaknesses. Management Consultancy McKinsey published an analysis of Prime Day which discussed the use of SWAT teams made up of a range of skills such as merchandisers, product leads, customer service, fulfilment, media managers, and IT.

These teams can stress-test systems before they go live to find weak points that need addressing, and can be on hand during the event itself to help troubleshoot issues in real time.

This approach can be scaled down to even the smallest of organisations which can stress test new services or new website areas through their own mix of multidisciplinary skills.

Check on cloud provision

When some big new event or service is likely to pull in additional traffic, including many people that have never been to a site before, it's important to ensure there is plenty of bandwidth.

Reports suggest Amazon could have done better in this respect. McKinsey's analysis has noted that some waits for the Amazon site to load were as long as 20 seconds. Internal documents obtained by CNBC also showed that Amazon simply didn't have enough servers on hand to cater for traffic, forcing the company to display a simpler front page and suspend international traffic for a time to reduce workload. This all happened right at the start of Prime Day.

And it gets worse. The documents also revealed that the process for automatically adding servers based on demand, known as 'auto-scaling', may have failed. The result was that Amazon had to add servers manually, which takes time and is far less efficient.

Amazon also had issues with authentication and video playback, as well as a breakdown in communication with warehouses wanting to scan products and ship goods to those customers who were able to place orders.

"One of the most compelling use-cases for cloud computing is the highly variable need an event or campaign that runs for just a few days for instance," says Peter Groucutt, MD of disaster recovery firm DataBarracks. "You want to be able to scale up fast enough to service the demand but also scale down quickly to keep costs low. On Prime Day even the biggest public cloud service provider (Amazon Web Services) wasn't able to keep up with the demand of Amazon's Prime Day promotion."

Despite having its own cloud to work from, some may be surprised to learn that Amazon still relies on Oracle for a part of its infrastructure. Although Amazon hasn't disclosed exactly what percentage of its cloud is Oracle-based, we know that much of the company's e-commerce systems, built prior to the creation of AWS, are running on Oracle.

Amazon has said it plans to migrate the entirety of its systems to AWS within the next two years, which could make things easier to manage during the busiest periods. Despite Oracle's claim that its database technology far outstrips anything provided by AWS, Amazon clearly thinks it can operate independently.

Regardless of how prepared you are, outages will happen. However, how you handle such outages can be as important as trying to stop them in the first place.

As is often the case with gigantic international companies, Amazon has remained relatively quiet about the details of the outage, instead focusing on the great success that Prime Day was. In a press notice issued shortly after, it boasted that Prime Day generated more than $1bn in sales.

But there was a public statement made on Twitter, which acknowledged the issue. Unfortunately, it was also very upbeat the message being that regardless of how frustrated you are by the service, plenty of others have successfully bought their items. Oh, and there's still lots of time left.

Regardless of the size of your company, customers in these situations only care about two things that you're working on a solution, and that you're working on a way to reimburse them for their wasted time.

Amazon's response naturally drew outage, with over two thousand comments from customers, unappreciative of being told how well things were going otherwise, demanding answers from the retailer.

While Twitter is a great place for this first communication, it should always be accompanied by a prepared holding page at the website acknowledging and apologising for issues, and again explaining that you are working on resolving them. You might also want to have resources on hand to update social feeds, and the website information as the situation unfolds.

Amazon might truly be described as one of the world's mega-retailers, and as such its reputation may not be dented by the Prime Day outage. For most other organisations offering online services, such an outage may prove near-catastrophic for business, but provided you're planning to fail, and can communicate effectively with your customers, it needn't be fatal.

Image: Shutterstock

Sandra Vogel is a freelance journalist with decades of experience in long-form and explainer content, research papers, case studies, white papers, blogs, books, and hardware reviews. She has contributed to ZDNet, national newspapers and many of the best known technology web sites.

At ITPro, Sandra has contributed articles on artificial intelligence (AI), measures that can be taken to cope with inflation, the telecoms industry, risk management, and C-suite strategies. In the past, Sandra also contributed handset reviews for ITPro and has written for the brand for more than 13 years in total.