Why API Monitoring?
The Problems at the coalface
We regularly get the question: why should I monitor my APIs? After all, we’ve had REST APIs for a long time now, relatively speaking and we have a variety of solutions in place for IT monitoring – whether it’s a cloud server monitoring, a stack monitoring solution like New Relic or App Dynamics, or the logs from your gateway.
So why do you need something else to do it?
It’s a good question. But the answer is simple, and still comes as a shock to us when we look at some of the APIs provided by large, public, organizations – and we see that they’re doing things that should be picked up by other tools, or just have implemented solutions that would have been problematic at the birth of the API economy, five or more years ago are still going on.
To illustrate this, we were looked at some APIs provided by the government agencies of a G20 country that is recognized as one of the world leaders in digital government and a set of APIs provided by 9 different blue chip, Fortune 500 organizations which all do essentially the same thing – return data on branch locations. We didn’t do a deep survey. This was a high-level look to generate some data for interested parties.
Here are some of the things we found. They emphasize why organizations need to get on top of APIs – and not just unit test them. They also need to make sure that the APIs you make public work as you have documented them and continue to work.
The Simple Stuff
If your API has security, then in 2018 we see no reason why you would have the API key as a URL parameter, and not in the header being handled as a HTTP call.
At the very least, HTTPS should be a must. There’s certainly not a lot of point in implementing key-based security, only to make your users pass the keys around the internet without encryption.
Is that ACTUALLY a parameter – or not?
In the documentation, there is a parameter called “datatype.” You set it to define whether you want JSON or XML. We were interested to find that in some cases, an API call worked with the parameter set to /datatype/ (it returned XML) – but in other cases, it returned a 4XX error.
Conducting this type of “fuzzing” on your APIs – as deployed – spots a bunch of these edge cases.
What does that mean? Helpful or unhelpful errors.
In another API, we found that if you made a call in the way the documentation said you could, it would throw an error – in this case, a 4XX. It turned out that there was a missing a mandatory parameter that wasn’t documented. The really helpful part was the error itself, which was contained exactly 2 bytes of error message
 ,- yes, brackets, that’s all…
It took some trial and error to get an actual working call.
Document what you provide – not what you wrote down
In many instances, the API calls as written down in the web documentation simply didn’t work, returning a range of 5XX and 4XX when input EXACTLY as defined in the documentation. There were no useful errors, and no clue as to why they weren’t working.
In one case, the call returned a 500 error with a confirmation of a GZIP taking place. But the GZIP package that was returned was unreadable. This made us think the API didn’t work, but giving us, as a user, no firm idea about what to do next.
The Not So Simple
Here we have the performance of 14 different APIs from 14 different providers. To make this valid we did some work first:
- All 14 are of similar sizes, all Fortune 500 companies
- The sample covers the same period of time, 1 month
- The locations the APIs are called from the same set of locations and clouds (AWS, Amazon, Google and IBM)
- The content retrieved is similar in nature
- There is no security for the API call, so no effects in the gateway to slow things down
What do assumptions do?
There’s a famous saying about assumption about assumptions and in the case of APIs it’s true. Here’s what we assumed we would observe.
- The APIs would be largely similar in performance
- There would be a correlation between the size of the returned content and the speed
- The impact of networking effects from the major clouds would be neligable
As we can see from the graph, none of those held up under analysis. There’s a wide difference between the fastest and the slowest, even if we take into account content size with more than a 200ms difference between APIs returning similar quantities of data.
Even though the data is meant to be largely the same, and on inspection has all the same fields, some providers don’t populate all the fields, even though, in this case, we assume that some of them do have the data on accessibility, for example.
Finally, there remain significant differences in networking impacts on APIs – the fastest performing service took 10% of the time to handle DNS lookup than the slowest. Looking deeper into the DNS look up data, it was clear that the worst performing APIs had significant DNS registry problems with 3 of the 4 major clouds that were adding significant delays to all transactions.
What this means for monitoring?
- You can’t rely on any single product or source. You might have blistering fast search look up times on your server, but if your DNS settings or registrations aren’t working with more than half the cloud services on the planet, your customers will have a poor performance.
- A bad update to an API can take down all the users without anybody be any the wiser.
- Looking only at your services, which may be up, won’t spot if a specific backend service has stopped communicating correctly or if your load balancer is down
- Mis-set thresholds for alerting could leave you down for large chunks of the day and be hard to spot except for long, slow, crawls through thousands of Splunk records
We’re not saying you have to use our product, but if you do provide, or consume APIs, you do have to have something that tells you how they’re really working, and not just how you think they do.