On February 28, it seemed like half the internet went down. The Amazon AWS S3 cloud storage system in northern Virginia suffered an outage that lasted more than four hours.
In internet time, that’s an eternity – and potentially a lot of lost revenue for Amazon and its customers. The same goes for all the hair-on-fire in operations and support, and the ripples of disruption spreading down the line to end users. If you rely on a third-party service like Amazon AWS S3 to provide mission-critical functionality, your clients are going to blame you when it goes down.
AWS is by far the most popular cloud service, with about 40% of the global market.
This compares to close to 25% for the next three largest providers combined (Microsoft Azure, Google Cloud and IBM Cloud). So when something bad happens to AWS, a lot of people are going to feel it.
Amazon said that a fat-fingered employee was to blame. While investigating a billing system issue, he accidentally took down too many servers by typing in a wrong number. Could happen to any of us, right?
Well, that’s Amazon’s story.
But having not seen a HTTP 503 Service Available response since January 27 (which was the first since December 13, 2016), we saw a 503 at 14:15 UTC. That’s less than four hours before we detected the start of the main outage at 18:04.
Is this just a hell of a coincidence, or was there more to the incident than Amazon AWS is saying?
Did something happen about 14:15 that was a harbinger of the later outage – or its cause? Was the big outage caused by trying to fix the earlier outage, or maybe prevent its recurrence?
If the Amazon AWS S3 APIs are unavailable, the first you might know about it is when your customers complain to you that your service or APIs aren’t working. But if you monitor your mission-critical third party APIs, if there is a problem you will be able to determine quickly where the buck stops.
- One lesson is to make sure your service doesn’t have a single point of failure – like being dependent on a cloud service in a single physical location.
- Another is to use an API monitoring service. It ensures that the APIs you depend on – your own, your clients’ and your suppliers’ – are always behaving properly and meeting their Service Level Agreements. And you’ll get alerts as soon as soon as something some goes wrong.
The APImetrics Insights CASC score is a credit score-like that uses proprietary machine learning technology. We combine various measures of API performance to provide a single blended metric. This allows you to see at a glance the quality of an API, whether it is getting better or worse, and how it compares to other APIs.
The weekly CASC scores for the two APIs of the PaaS provider we monitor for the period of the outage make very interesting reading:
API OAuth 2.0 dialog Profile Fetch
2017-03-06 911 928
2017-02-27 668 674
2017-02-20 914 949
So the overall performance of two generally very high-performing APIs was significantly degraded by the outage. The APIs probably weren’t meeting their Service Level Agreements that week.
These are the stats for the day of the outage. It produced a lot of crazy-looking results.
The PaaS provider might have noticed the outage itself. But many organizations are dependent on third-party APIs that may in turn be dependent on services like Amazon AWS. A downstream service might fail softly or even silently, perhaps just returning an empty payload to the user.
Forewarned is forearmed.
You don’t want to be the last one to hear that your service really isn’t working, even if DevOps is still seeing nothing but 200 OKs. It makes sense to have a look at the API and see what is going on before something worse happens.
That’s why you need to monitor your service and your APIs from the end-user perspective. You’ll get a holistic, complete view of how your business processes are actually performing.
A performing API that suddenly starts displaying odd behaviors, like increased call latency or the occasional non-HTTP 200 response, might indicate more serious issues ahead. With APImetrics, you can pick up on the early symptoms of later problems.
As we store the response of every call, you can drill down into the causes and consequences of the failure. You might still catch the blame. But at least you’ll know what is going on and can put in places measures to try and stop it happening again.