Amazon S3 outage spotlights disaster recovery tradeoffs
by Trevor Jones
Tuesday's Amazon S3 outage reverberated around the internet, but cost and complexity will likely keep many users from scrambling to change their redundancy practices.
One of the largest service disruptions to hit AWS in years would have had far less impact if customers added more redundancy safeguards, but many cloud customers are only willing to go so far to keep their workloads running seamlessly.
Companies can implement myriad contingencies to safeguard against massive cloud outages. AWS added Cross-Region Replication in 2015; IT shops also can rely on a range of disaster-recovery-as-a-service tools on the market. There also are techniques to spread workloads across regions and to back them up in other public clouds or on premises. Netflix, which as recently as last year said it used U.S. East-1, champions several of these techniques and reported no issues when Amazon Simple Storage Service (S3) went down Tuesday.
But the Amazon S3 outage, which lasted four hours in the U.S. East-1 region, managed to take down or slow huge chunks of the internet Tuesday after Amazon S3 became unresponsive because of human error and outdated debugging techniques. The Amazon S3 outage also had a knock-on effect that took down multiple other services. The net result of the Amazon S3 outage was 54 of the top 100 internet retailers saw a 20% or greater decrease in performance, according to Apica, a web monitoring provider, and it cost Standard & Poor's 500 Index companies $150 million, according to cyber-risk startup Cyence Inc.
And a high number of companies that either didn't fail over or couldn't maintain services without interruption suffered. Nike, which has given talks at AWS conferences on security and redundancy, saw load times on its website increase by 642%, according to Apica. There were reports that Apple's iCloud experienced slower performance, even though the product reportedly relies on Microsoft Azure and Google Cloud Platform, too.
There's an inherent leap of faith that comes with passing uptime responsibility to a cloud vendor, but the far-reaching effects of this incident show many customers are willing to work without a net, to a certain degree, after weighing the cost and complexity that comes with such high levels of redundancy.
"Where do you want to draw the line on redundancy on [service-level agreements] and uptime?" asked Craig Loop, director of technology at Realty Data Co. LLC, a Naperville, Ill., financial services company. "It literally is a dial-your-redundancy system, and all you have to do is throw money at it."
Realty Data considered the use of multiple regions previously, but ultimately decided to stick with U.S. East-1 because of the additional cost and development that would come with preparing for outages that would happen once every couple years.
There certainly are companies that require that level of uptime, and doing it through AWS is considerably cheaper than more traditional methods, said Carl Brooks, an analyst with 451 Research. But many users decide living through the occasional outage is part of the cost of doing business.
"It might cost $500,000 to implement multiregion stability with high availability and AWS best practices, but a four-hour outage may cost you $60,000," Brooks said.
ICYMI: A play-by-play of the Amazon S3 outage
On Tuesday, Amazon Simple Storage Service suffered a disruption in the U.S. East-1 region that lasted more than four hours before being resolved. This region is the oldest and one of the most widely used AWS regions, while S3 is one of the most popular AWS cloud services. Days later, AWS released a postmortem stating that the Amazon S3 outage was a "service disruption" that was attributed to human error.
Companies' responses to this latest downtime may also reflect the type of workloads that reside in the cloud. For all its exponential growth and exciting new capabilities, public cloud largely remains the domain of test and development, startups and websites -- many of which may be willing to stomach an outage in a way that would be unacceptable with traditional mission-critical applications.
"I don't think anybody died, or we didn't get to the moon," said Jason McMunn, of Transfigure Partners LLC, a cloud migration company in Springfield, Pa. "It's real first-world problems, where I couldn't load my GIFs on Slack."
Still, the service disruption did affect McMunn. He was in the middle of a sales demo that relies on S3 when the service cut out. It was a potential client he'd spent considerable time trying to meet.
"We finally got a chance to showcase this tool set to them and, thanks to AWS, we just looked like idiots," McMunn said.
"I don't think anybody died, or we didn't get to the moon. It's real first-world problems, where I couldn't load my GIFs on Slack."
Jason McMunn | Transfigure Partners
The company also relies on S3 for all its DevOps projects, and there was a cascading effect where developers had to revert to email to manually send updates to each other. He estimated the S3 problem will translate to 50 people hours of work; however, he said it would take a multiday outage, or the loss or destruction of data, to get him to move off of the public cloud.
"I feel like my gold was still safe in Fort Knox," he said. "I just couldn't get at it."
There's also a psychological component to how businesses react to cloud outages. When downtime is isolated to a single company's data center, IT pros become the bad guy. But that's not the case when everyone else is down, too, Loop said.
"It has an effect where we're all in this together, so people aren't so upset and animated about it," Loop said. "Now, it's more, 'Let me know what I can do and get it back up.' It's changed what an outage means."
To replicate or not to replicate
These types of incidents serve as a reminder to architect environments in ways that best protect workloads from downtime, even if that doesn't include cross-region replication, said Kevin Felichko, CTO of PropertyRoom.com. The online auction company houses the majority of its workloads in U.S. West-2 and noticed no issues with its production workloads. The biggest effect it saw was to some test and development in U.S. East-1 and to third-party support services.
PropertyRoom.com moved to AWS more than three years ago and opted to use the West Coast region, despite being based in Frederick, Md. U.S. East-1 may have had quicker access to new features and likely would have provided better performance due to proximity, but there was also a much higher rate of problems with the congested region, Felichko said.
"It validates us not putting mission-critical [workloads] in U.S. East-1," Felichko said. "[AWS] rolls out nice features there, and they've got strong presence there, but it's not as stable as other regions."
Replication isn't foolproof, either; companies could overlook certain scripts that reside only in one region, or if they have a central source for authoritative transactions in a single region. Even if a company has replication policies in place, it still may not conduct a failover. Financial Industry Regulatory Authority Inc., or FINRA, which AWS cited as a reference customer when multiregion capabilities were added, opted to ride out the Amazon S3 outage.
"We were in communication with AWS during the outage, were confident that recovery was near and opted not to fail over to another region, given the brief impact the outage had on our operations," a FINRA spokesperson said.
ACI Information Group, an aggregator of social media and blogs, has all its workloads in U.S. East-1, but used to have servers on the West Coast for replication. The company ultimately terminated those instances.
"Duplicate news stories are a big deal for us, and split regions had duplicates all the time," said Chris Moyer, vice president of technology at ACI Information Group and a TechTarget contributor. "That was more likely than Amazon failing, so it's a toss-up and comes back to the whole question about security versus making it easier for users."
At one point, ACI looked at Cross-Region Replication for failover, but was dissuaded by AWS, which told them the service was better suited for getting data closer to users because the likelihood of a region-wide outage was so small, Moyer said.
"They told us don't worry about it," he said.