Reliability Matters, Now More than Ever
Every so often, our technological world gets an important wake-up call.
Case in point: the service disruptions affecting Amazon Web Services (AWS) and its customers, mostly in the eastern US, on December 7 (80 years to the day after another wake-up call).
Amazon acknowledged the problem at 12:30 in the afternoon (eastern time) and reported around 6 pm that most services had recovered. It wasn’t out for a long time – but the effects were immediate and far-reaching:
- Netflix, Disney+ and other streaming services went dark
- Trading platforms such as Robinhood and Coinbase were offline
- The Associated Press had trouble getting news out due to problems with its publishing system
- Ticketmaster had to suspend ticket sales for Adele’s concert tour
- Municipal parking apps in Boston and Portland were not available
- Amazon’s own delivery and warehouse operations were severely impacted
The outage even affected thousands of individual households, shutting down their Roomba vacuum cleaners and Ring doorbells.
AWS is hardly alone. In October, Facebook and its subsidiaries were globally unavailable for over six hours, with dramatic consequences for the programmatic advertising that is the marketing lifeblood for many online enterprises. And back in July, a one-hour outage affecting Akamai’s Edge service limited access to hundreds of websites, including numerous large and well-known enterprises.
This blog regularly focuses on malicious cyberattacks of one type or another that threaten connected enterprises, and steps IT security teams can take to defend their digital assets. In these cases, however, no malicious activity or cyberattacks were involved. Instead, they were the result of unanticipated errors:
- In the case of Facebook, a networking issue made the DNS system unavailable, which in turn made the web service for the company and its subsidiaries unavailable.
- In the Akamai outage, a configuration update triggered a DNS bug, which in turn disrupted customer websites.
- It’s still early for definitive answers about the AWS outage, but the company initially acknowledged that a “network device issue” led to the outage.
In all three instances, a single failure cascaded across the many dependent parts of a much larger system. In the same way, the effects of each single outage rippled through innumerable companies that are commercially partners with the affected provider.
There are two clear take-aways from these incidents.
1. Reliability by design is more important than ever.
The criteria you use to evaluate potential business partners must include a strong focus on reliability by design – a top-to-bottom commitment to incorporating reliability as a core requirement driving the design, development, and delivery of the services they provide.
Lip service is not enough. The company’s commitment to reliability must be deep and enduring enough to overrule short-term commercial considerations that might lead them to cut corners.
Moreover, with the complexity of digital systems, it’s not enough to tout the reliability of one visible or central component. The principles of reliability by design must be applied to the entire digital ecosystem involved in delivering services. Your providers must have a deep and thorough understanding of the interdependencies that power their services.
And you should as well.
2. No one is immune, no matter how big or well-established.
These outages all happened to market leaders with excellent reputations that have earned them huge numbers of customers. Yes, there may be reassurance in numbers, but clearly you can’t count on safety in numbers. “It can’t happen to them” does not apply.
For your enterprise, that means no potential partner gets a free pass. You need to evaluate every business, no matter how large or prominent, with the same cold, analytical questions about their commitment to reliability by design, and how it has played out in their services and infrastructure.
If you’re wondering about Neustar Security Services – and you should be – we are and always have been fully committed to the fundamentals at the core of the services we provide, including reliability by design, security, performance, and support.
For example, we just completed a major upgrade to our UltraSecurity platform, expanding our application security network to 14 locations and deepening our mitigation capacity to 11 Tbps. We also expanded our global DNS footprint to 30 sites while simultaneously overhauling virtually every aspect of our UltraDNS service delivery, including new and expanded transit, faster compute nodes and much faster zone propagation.
These improvements were not driven by any deficiencies that affected our customers or our services. They are the result of our commitment to constantly challenge ourselves to do better.
Every service provider you choose should be doing the same.