In their recent white paper, ThousandEyes concludes that all the clouds offer good choices from the perspective of network performance with solid backbones for data transfer. But that’s only part of the story.
When we started with our unique multi-cloud approach, we assumed that there wouldn’t be much difference between different clouds when they’re used to call APIs. The fact is, they’re all VERY different.
How we analyze performance
Our product makes a series of calls to API endpoints. We use the same security (OAuth, OAuth+JWT, etc.) as a TPP (Third-Party Payment) provider or ASIP (Account Service Information) provider would use, and verify how well the calls are working, both from a call perspective – does it return what you expected? Does it do all the things it is meant to do? – and from a networking perspective – can you connect? How long does the call take to resolve DNS? …and so on.
This dataset is purely based on what we have observed between the interactions of the different cloud services we are hosted on, AWS, Azure, IBM and Google and the endpoint for our target bank and in it, we’re only looking at the DNS performance. We’ll share the raw data, what this means on a cross European perspective and some thoughts on proximate causes and why TPPs, ASIPs and banks should be factoring this into their monitoring and reporting decisions.
What the data says
This came to light recently in some analysis we did on networking connection issues for a Tier 1 European bank who have been investigating some of their performance issues when connecting to their Read/Write APIs.
While they are a UK-based bank, they’re part of the Pan-European PSD2 regulations and will certainly have partners, suppliers and integrators all over the continent, if not the world. While they may have TPPs and others integrating from all over the world, we focused on what the performance would look like from the main data centers in Europe as provided by the four major cloud service providers; Amazon, Microsoft, Google and IBM.
We looked at DNS resolution times as that seemed to be impacting performance the most and with the times we reviewed the median connection time, the 95th percentile and the 99th percentile, i.e. the times that the slowest 5% and 1% of traffic takes to find the destination server.
The table shows the raw results.
When we look at this by country, graphically we really see the difference.
This shows a fairly wide gap, not just in median performance but in the upper percentiles and not only by cloud but location. UK sites generally perform better, this is a UK bank with a UK data center, but developers, TPPs and others will pick locations based on other factors – especially for TPPs and ASIPs where their data centers for their core services may be in completely different locations.
One provider, Cloud 4, is able to resolve the UK DNS in under 1ms, and only takes 29ms at the 99th percentile. Compare this to Cloud 1, which takes 70x longer for the median resolution and almost 80x longer at a huge 2 seconds for the 99th percentile.
To put that another way, 1 in every 100 calls is taking 2 seconds just to find the server it’s trying to speak to.
When we go further afield the differences are even more stark. While the difference between best EU resolution time and worst (median) is just 20x, that’s the difference between 3/100ths of a second and half a second.
At the 99th percentile, we’re into serious problems with the difference growing to over 170x and a peak of almost 5 seconds. That would be a 5-second lag on 1% of API calls originating from that data center.
So what? And why?
At a basic level, any latency – especially in the order of seconds per transaction – is going to be bad for APIs. There’s plenty of data to suggest that a 400ms lag in an application or loading a website can really hit adoption. Google applies these metrics to its advertising scoring. IBM has published significant data on the adoption of networked applications and how lag impacts adoption and users.
So, yes, this is important – especially as the banks have little control over where the TPPs, ASPs and others are located and hosted.
The why is more complex. At least one factor is how the security requirements of this bank interact with the cloud providers around the InfoSec standards. This last one is something that anybody building out an architecture in the open banking and financial services sector can take into account and can be aware of from the outset.
Third-party architecture matters
We’ve been following the data from Credit Kudos with their stats on what they see from the CMA9 banks they integrate to. The numbers don’t align with the data we see from the banks we monitor – nor does it align with the numbers the OBIE gets from self-reporting from the banks.
A contributing factor to this is the problem at the heart of modern monitoring. There’s a degree of uncertainty – what and how you measure can affect the results you get, both positively and negatively. This leads to conflicting goals inside businesses – operations wants the best data possible to report to management. Management, on the other hand, wants to know what things look like so they’re prepared for potential issues.
This is just one bank. When we look at the same data for other banks, we see wildly different results – with some having solid performance across all data centers and all geographies, and others having much, much worse performance.
What ThousandEyes sees is different to what your stack monitoring tools see, or even your web tools. Understanding that you need to monitor things differently accordingly is essential.
You can’t afford to take it for granted that Amazon, Microsoft, IBM or even Google has this solved for you. And you can’t assume that your networking probes are actually spotting this type of thing because once the connection is made, the traffic itself could be handled perfectly. You also can’t assume that your CDN like Akami, Cloudflare and others are on top of it either.
In a complex world like Open Banking, where TPPs integrate from a range of data centers, architectures and other locations, you can’t afford not to be paying attention to what they might experience and why.
APImetrics is happy to offer a fast, cost-effective consultation on potential issues and provide you with insight into potential trouble spots – and, as a happy coincidence, set up all your monitoring for you and leave you with the tools never to be caught out again.
Can you afford not to do it?