70% of all API issues have no easy to identify root cause In a survey of 20 of the leading corporate infrastructure APIs, we found that in over 70% of the performance issues we detected, there was no clear root cause in the cluster of poor performance. We used our Machine Learning system to learn the normal performance of each of the APIs which included services from Docusign, Microsoft and Dropbox and looked for periods where the performance degraded. We then clustered the events that appeared to be linked or related (again using machine learning techniques) and looked [...]
It's come to our attention that we had a significant issue affecting data collection from our remote agents over the 4th of July weekend. We've traced the problem to our services bus that connects the remote agents to our data store. This has been rectified and we are taking steps to ensure this type of event can't happen again, unfortunately, it will have resulted in a weekend of lost data from the different collection points. Calls made from our default server will be unaffected. If you have any additional questions please don't hesitate to contact us.
We're in the process of rolling out some new analytics tools and we've been looking at some headline numbers. The first headline number is 83,000,000 - that's how many API calls we've run since we started APImetrics and, because we believe in learning from data, that's how many records we have in our database. The second number, which is what surprised us, of that number about 1,600,000 are out and out failures - that is the API returned, for some reason, a 5XX error. We have a much higher rate of 4XX errors, but they could be related to token [...]
An article including a number of things we've discovered over the last year and very handy for anybody who is starting to try and figure out what a Service Level Agreement (SLA) actually means for an API, a micro-service or, for that matter, the cloud. We're going to be expanding our feature set around SLA monitoring and we'll keep you informed of the status over the next few months. Enjoy this one, it's an excellent read from CIO.com.
API Time Travel We frequently check general performance data to look for 'odd' responses and we found an interesting one today which involved an API call on a test server which took -284,027ms, or just a little under 5 minutes. We assume that the host had to have a clock reset in the middle of making the API call, but it was an interesting results and one which has led to a small change on our side to error out such platform induced time issues in the future. Once again, this makes it clear to us that just [...]
As one of our feature enhancements we've improved the way our agents work, we are now consistently capturing key data on the actual API performance including: DNS Lookup time: i.e. 28355µs (28ms) Time to Connect: i.e 76106µs (76ms) Time for Handshake: i.e 0µs (0ms) Upload time: i.e. 48µs (0ms) Processing time: i.e. 120568µs (121ms) Download time: i.e. 2545µs (3ms) However, this improvement has raised two issues. Firstly, we have realized that we had some minor reporting issues with our old collection agents which means that some of the latencies we are recording were actually better than actually were being experienced. [...]
Twitter had an extremely rare outage today, as our recent social media report showed, they're one of the most reliable services we measure. But they do have outages and given the range of services plugged into the APIs it affected people enough that the news outlets noticed: BBC Coverage here So, were you ready for it? The problems seem to have started around 13:00UTC (5am Pacific) and started clearing up around 15:00. They were intermitent and we'll publish a more detailed review of the anatomy of this outage and the identifying characteristics later. From our initial insights, the US was [...]
We'll have more details out shortly but there's a lot of features and changes coming in the next few days, once we're through the final testing. These include, but are not limited to: System wide variables with the ability to set different production environments Improved data visualizations with new graphing options and improved heatmaps Global SLA settings View all deployments from a single view We now support a full range of webhook integrations to services like Pager Duty and Github and we have a general set of webhooks that can be used for all integrations Watch this space!
For our second API Health Report we decided to look at Social Networking APIs, specifically Twitter, Facebook and Tumblr - like with our first report we were interested in how your choice of cloud could impact the results, but unlike the first report, we've also added in a range of global locations so you can see how cloud choices and location of users could impact performance. One of the key takeaways from this report is that there are differences between the clouds and large differences between the regions once you are outside of the United States. If you have customers [...]